ARTICLE
Communicated by Mikhail Tsodyks
Memory Maintenance via Neuronal Regulation David Horn Nir Levy School of Physics and Astronomy, Tel-Aviv University, Tel Aviv 69978, Israel
Eytan Ruppin Departments of Computer Science and Physiology, Tel-Aviv University, Tel Aviv 69978, Israel
Since their conception half a century ago, Hebbian cell assemblies have become a basic term in the neurosciences, and the idea that learning takes place through synaptic modifications has been accepted as a fundamental paradigm. As synapses undergo continuous metabolic turnover, adopting the stance that memories are engraved in the synaptic matrix raises a fundamental problem: How can memories be maintained for very long time periods? We present a novel solution to this long-standing question, based on biological evidence of neuronal regulation mechanisms that act to maintain neuronal activity. Our mechanism is developed within the framework of a neural model of associative memory. It is operative in conjunction with random activation of the memory system and is able to counterbalance degradation of synaptic weights and normalize the basins of attraction of all memories. Over long time periods, when the variance of the degradation process becomes important, the memory system stabilizes if its synapses are appropriately bounded. Thus, the remnant memory system is obtained by a dynamic process of synaptic selection and growth driven by neuronal regulatory mechanisms. Our model is a specific realization of dynamic stabilization of neural circuitry, which is often assumed to take place during sleep. 1 Introduction Memories can be maintained for very long periods of time, even for a complete lifetime. A fundamental dogma in the neurosciences is that memories are engraved in the brain via specific, long-term alterations in synaptic efficacies. However, synaptic turnover is relatively widespread in the mature nervous system (Goelet, Castellucci, Schacher, & Kandel, 1986; Lisman, 1994; Wolff, Laskawi, Spatz, & Missler, 1995). How, then, are memories maintained for very long periods? Clearly memories can be maintained if synaptic weights can be kept fixed, which is the purpose of several mechanisms suggested in the literature. An interesting alternative, which we Neural Computation 10, 1–18 (1998)
c 1997 Massachusetts Institute of Technology °
2
David Horn, Nir Levy, and Eytan Ruppin
explore here, is maintaining memories with altered synaptic values; that is, synapses change dynamically and still encode the original memories. Several ideas for synaptic maintenance were put forward in the literature. The first maintains that genomic changes are involved in long-term memory storage (Bailey, Montarolo, Chen, Kandel, & Schacher, 1992) and is based on studies showing that inhibitors of protein synthesis prevent long-term memory. However, several recent studies suggest that protein synthesis is not required for memory storage itself, but only for the expression of memory (Lisman, 1994). The second solution postulates that there exist synaptic regulatory mechanisms that can stabilize long-term synaptic changes (Crick, 1984; Lynch, 1993). A leading hypothesis is that these synaptic maintenance processes are regulated on the level of each individual synapse via an autophosphorilation process, where a specific calcium/calmodulin modulated (CAM) kinase enzyme serves as a form of molecular memory (Lisman, 1994). However, although there is little doubt that CAM kinase is involved in long-term potentiation induction, additional investigations are required to determine its role in long-term storage. In contradistinction to such mechanisms that rely solely on the synaptic structures, there exists the approach of dynamic stabilization, implying mechanisms that maintain synapses following their activation through the neural memory system (Kavanau, 1994). One may be tempted to think that ongoing memory recall together with Hebbian memory encoding could do this job. However, this approach may lead to pathologic attractors—the development of a configuration where few cell assemblies overshadow all others (Hasselmo, 1993; Ruppin, Reggia, & Horn, 1996). We present a novel mechanism that belongs to the dynamic stabilization category. It separates Hebbian learning, or memory consolidation, and memory maintenance that is carried out on the neuronal level and compensates for synaptic degradation. In addition to leading to the required homeostasis, we show that it also prevents the formation of pathologic neural assemblies. In fact, it has the interesting property of normalizing basins of attraction. The neurons in our model can regulate their overall level of synaptic inputs (i.e., average postsynaptic potential) by activating neuronal regulatory (NR) processes that jointly modify all the incoming synapses of the neuron by a common factor. Our mechanism separates naturally into two temporal domains, according to the level of variance in the synaptic degradation process. On a long time scale, it leads to a stable memory system provided the synapses are appropriately bounded. The resulting synaptic weights of the preserved memories are different from the original, memory-embedding values. Our proposal is biologically motivated by the extensive experimental evidence of homeostasis mechanisms that act to maintain neuronal activity (see van Ooyen, 1994, for a comprehensive review). These include receptor up-regulation and down-regulation, activity-dependent regulation of membranal ion channels, and activity-dependent structural changes that
Memory Maintenance via Neuronal Regulation
3
reversibly enhance or suppress neuritic outgrowth. The role of NR in regulating overall synaptic efficacies gains support from several experimental observations that point to the important role of neuronally based processes in synaptic turnover. These include the involvement of axonal transport in synaptic maintenance, the compensatory increase of the synaptic junctional area in response to synaptic loss, the involvement of immediate early genes, and the global effect on synaptic density of certain trophic factors (see Baudry & Lynch, 1993; Wolff, Laskawi, Spatz, & Missler, 1995, for a comprehensive review). We propose that neuronal regulation is a distinct process, complementing the Hebbian synaptic changes that occur during learning. In the next section, we present the associative memory model used to study NR computationally, describe the implementation of synaptic turnover, and present the NR mechanism we employ. In section 3, we describe and analyze several computational studies of NR, without and with synaptic bounds. Finally, the biological significance of our results is discussed in section 4. 2 Methods 2.1 The Model. We study NR in the framework of an excitatory-inhibitory associative memory network (Tsodyks, 1989), having M memory patterns, N excitatory neurons, and sparse coding level p << 1. The initial synaptic efficacy Jij (t = 0) between the jth (presynaptic) neuron and the ith (postsynaptic) neuron is chosen in the Hebbian manner, Jij (t = 0) =
M 1 X ηµ ηµ , Np µ=1 i j
(2.1)
where ηµ are the stored memory patterns. The updating rule for the activity state Vi of the ith binary neuron is given by ¢ ¡ (2.2) Vi (t0 + 1t0 ) = S hi (t0 ) − T , where t0 denotes the fast time scale of the updating of the network in a single retrieval trial and T is the threshold. S (x) is a stochastic sigmoid function, getting the value 1 with probability (1 + e−x )−1 and 0 otherwise. hi (t0 ) = hei (t0 ) − γ Q(t0 ) + Ii
(2.3)
is the local field, or membrane potential. It includes the excitatory Hebbian coupling of all other excitatory neurons, hei (t0 ) =
N X j6=i
Jij Vj (t0 ),
(2.4)
4
David Horn, Nir Levy, and Eytan Ruppin
an external input Ii , and inhibition that is proportional to the total activity of the excitatory neurons,
Q(t0 ) =
N 1 X Vj (t0 ). Np j
(2.5)
As long as the inhibition strength obeys γ ≥ Mp2 , the network performs well. Performance is measured by assessing the average recall of all memories. The retrieval quality at each trial is measured by the overlap function, mµ , that denotes the similarity between the final state V the network converges to and the memory pattern ηµ that is cued in each trial, defined by mµ (t0 ) =
N X 1 (ηµ i − p)Vi (t0 ). p(1 − p)N i=1
(2.6)
2.2 Synaptic Degradation and NR. Synaptic weakening due to metabolic turnover, or synaptic degradation, is modeled by Jij (t + 1t) → (1 − ²ij )Jij (t),
(2.7)
where the time t changes slowly compared to t0 and denotes the number of degradation and maintenance steps, or epochs. For the sake of analytic calculations, presented in the Appendix, we choose ln(1 − ²ij ) to be normally distributed with mean −² and variance σ² 2 . Synaptic strengthening resulting from NR is represented by1 Jij (t + 1t) → ci Jij (t),
(2.8)
in which the regulation factors ci correct the values of all excitatory synaptic connections projecting on neuron i, · µ ¶¸ hhe i (t)i ci = 1 + τ tanh κ 1 − Hie
(2.9)
where Hie = hhe i (t = 0)i and κ and τ are rate constants. This choice of ci maintains the average neuronal input field near its baseline value, Hie , as can be easily seen from the linear approximation, which is valid for small 1 An alternative implementation of NR for unbounded synapses would be changing the threshold Ti of the individual neuron in an amount dependent on the changes in the field. This is mathematically equivalent to the changes in ci but should rely on different biophysical mechanisms.
Memory Maintenance via Neuronal Regulation
5
changes in the field. The tanh function limits the effects of sudden large changes in the field, thus increasing the stability of the resulting network dynamics. In numerical simulations we use κ = 10 and τ = 0.01.2 We have studied (Horn, Levy, & Ruppin, 1996) a similar mechanism for the extreme case of synaptic deletion in the context of a model for Alzheimer’s disease. Clearly deletion leads eventually to a breakdown of the memory system. The compensation by ci just postpones the demise of the system. Here we are interested in finding out whether the memory system can continue to function forever if small degradation steps are used. For this purpose we find that we have to introduce a finite variation span for the synaptic weights. As the synapses Jij undergo a series of degradation and maintenance steps, their values are allowed to change in the interval [B− , B+ ]. If the dynamics lead to Jij < B− , the synapse is declared dead, and Jij is set to 0. If the dynamics lead to Jij > B+ it is reset to B+ , representing a limit on the strength a synapse may attain in real biological networks. In every simulation experiment described below, a sequence of synaptic degradation and maintenance steps is executed. Each such step (one time unit, or epoch, in the results reported) is composed of the following substeps: 1. Synaptic degradation is performed by decrementing Jij following equation 2.7. 2. The average input field of each neuron is measured by presenting random inputs to the network and letting it flow into its attractors. 3. After averaging over many inputs3 the new ci ’s are calculated via equation 2.9, and the synaptic weights are modified accordingly. 4. The network’s current performance level is measured by equation 2.6, before another degradation step is applied. 3 Results 3.1 Maintenance and Normalization. By maintaining the mean of the neuron’s local field, the NR method prevents rapid memory loss that would otherwise occur due to synaptic decay. Thus, with a uniform degradation process, the network’s performance will be maintained forever. However, a nonuniform degradation process will eventually lead to an imbalance of synaptic weights, resulting in a finite network lifetime tc . We start by determining the dependence of the network’s lifetime on the level of nonunifor2 This choice of τ ·κ = 0.1 cannot compensate the degradation in a single step. Nonetheless, applying many degradation and regulation steps with ² ¿ 1, the input field will stabilize around an overall deficiency of 10², which is still very small. We have found that it is advisable not to compensate fully at every step. This leads to better convergence of the algorithm for the whole network. 3 The algorithm works well also in an online mode, adjusting the c ’s after the preseni tation of every single input.
6
David Horn, Nir Levy, and Eytan Ruppin 6
10
5
10
4
tc
10
3
10
2
10
1
10
0
10
0
0.1
0.2
σε
0.3
0.4
0.5
Figure 1: The collapse time tc of network performance (logarithmic scale) as a function of synaptic degradation noise level σ² . Both experimental (small circles) and analytic (solid curve) results are shown. N = 1000, M = 50, p = 0.05, B− = 0, and B+ = ∞.
mity of synaptic degradation. We first examine the baseline case, where the synapses are unbounded (B− = 0, B+ = ∞). Figure 1 displays our results. We compare simulations with analytic results calculated by a mean-field approach (Sompolinsky, 1986; Tsodyks, 1989; Herrmann, Hertz, & Prugel¨ Bennet, 1995) (see the Appendix). As the noise level of synaptic turnover increases, the network’s lifetime rapidly decreases. Translating this result to the biological realm in a precise quantitative manner is currently impossible, since data about biological synaptic turnover rates are yet scarce and inconclusive. Several studies suggest that synapses undergo complete turnover in a period of several weeks (Goelet et al., 1986; Purves & Voyvodic, 1987; Wolff et al., 1995). If we think of the degradation and maintenance cycle as occurring a few times in 24 hours,4 this implies that ² is of order 10−2 . 4 Note that the degradation and maintenance process is assumed to proceed in small steps in our mechanism. In principle, there exists an alternative, in which the synapse undergoes major changes over only a small fraction of its (e.g., monthly) life cycle. This seems to be the case for perforated synapses (Jones, Itarat, & Calverley, 1991).
Memory Maintenance via Neuronal Regulation
7
Taking σ² to be roughly the same implies that the critical lifetime will be of order 104 , or about 100 months. But if σ² is larger, the system will lose its homeostasis much sooner. We conclude therefore that the NR mechanism may be insufficient to account for lifelong memory maintenance if synapses are unbounded. Before we turn to the study of the maintenance potential of NR when synapses are bounded, let us describe the normalization property of our method—its ability to counteract the formation of pathologic attractors. The latter are strongly embedded patterns that dominate all other memory patterns. Such attractors may be generated when biologically motivated activity-dependent learning algorithms are used, due to the inherent positive feedback existing in systems employing double dynamics of neuronal and synaptic updating (Dong & Hopfield, 1992; Hasselmo, 1993; Ruppin et al., 1996). Suppose that at some point of time, such pathologic attractors are formed, and the system finds itself with a synaptic efficacy matrix Jij (t) =
M 1 X gµ ηµ i ηµj , Np µ=1
(3.1)
where some of the memories are encoded with weights gµ larger than 1. We find that if at this point the NR mechanism is applied, allowing the system to evolve through degradation and maintenance cycles, such attractors are trimmed down, as demonstrated in Figure 2. We display here the basins of attraction of our model, as measured by a retrieval process that is initiated by random inputs. Whereas at the beginning, the strong memories dominate the scene, their weights are gradually reduced by the maintenance method until an almost homogeneous embedding is achieved. Neuronal regulation works well also when it is combined with ongoing learning of new, unfamiliar, memory patterns. This is demonstrated in Figure 3. Here every few epochs the network acquires another memory in an activity-dependent manner; a new memory is presented to the network via the external input Ii (see equation 2.3) and the synaptic efficacies of co-active neurons are allowed to change through 1Jij =
1g Vi Vj . Np
This learning process is then repeated for several epochs for the same memory pattern until some total learning weight g is achieved. At first each new memory dominates the scene, but after a few epochs, its basin of attraction is reduced (see the progression of newly acquired memories on the diagonal at the lower part of Figure 3). Eventually a full and homogeneous memory system is obtained. In principle, one can load onto this system as many memories as the capacity of the given architecture would allow. In practice this depends
8
David Horn, Nir Levy, and Eytan Ruppin
60
g1 = 4 g2 = 3 g3 = 2
50
basin of attraction [%]
g4 − g50 = 1 40
30
20
10
0 0
10
20
30
40
50
60
70
80
90
100
t
(a)
1
3 2
3
2
1
(b)
Figure 2: (a) Size of basins of attraction as measured by the percentage of retrievals of specific memories. Fifty memories are stored, of which 3 have strengths of g = 4, 3, and 2, and all the rest have g = 1. The network parameters are as in Figure 1, with ² = 0.005 and σ² = 0.005. (b) Shares of memory space (relative sizes of basins of attraction) at the beginning (upper figure) and the end (lower figure) of the simulation. Random inputs lead to either encoded memories or the null attractor (gray shading) in which all activity stops.
on the value that we assume for Hie , on the learning strength g, and on the time spans used in the learning and degradation protocol. Clearly, in this system Hie are given (or innate) parameters of the neurons that ideally
0 attractor
Memory Maintenance via Neuronal Regulation
9
100 50 0
0
50
100
150
50
100
150
200
250
300
350
200
250
300
350
5 10
Memory Number
15 20 25 30 35 40 45 50
t
Figure 3: Alternating synaptic learning and maintenance. In a system of N = 1000 neurons holding 30 memories, we store 20 additional memories, each with total learning weight of g = 1.4 (other parameters are as in Figure 2). Every 15 epochs, a new pattern is stored. It is presented to the network and engraved in the synaptic matrix in an activity-dependent manner for 5 epochs, followed by 10 epochs of regular synaptic degradation and maintenance. The top figure shows how the null attractor gradually vanishes. The lower figure portrays the basins of attraction of the different memories (larger basins are darker) at subsequent epochs. As evident, homogeneous memory retrieval is maintained throughout the simulation.
should correspond to the excitatory fields expected from the fully loaded network. If, for example, they correspond to a network with M memories of strength g = 1, we will have no problem loading M memories dynamically, provided the learning strength is of order unity. This does not preclude the possibility of loading more memories if M is smaller than αc N, the ultimate capacity of the model. However, employing learning rates that are too fast or a learning strength that is too strong will result in rapid degradation of the network’s retrieval performance, coupled with the emergence of mixed, spurious attractors. Rapid learning of strong memories can lead to loss of some of the previously stored memories, but not in any specific time order as in a palimpsest system.
10
David Horn, Nir Levy, and Eytan Ruppin 1 0.9 0.8
Performance
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
20
40
60
80
100
120
140
160
180
200
t Figure 4: The effect of synaptic bounds. The small circles denote the performance of the network without synaptic bounds, B+ = ∞. The + symbols denote the performance of the network with B+ = 8/Np (eight times the size of a synapse that stores one memory at t = 0), while the * symbols correspond to the case of B+ = 3/Np. The other parameters of the simulation were N = 500, M = 25, p = 0.075, ² = 0.005, and σ² = 0.2.
3.2 Long-Term Maintenance. The normalization property and the ability to learn new patterns are retained when bounded synapses are employed. The difference is that now, for appropriate synaptic upper bounds, the network may successfully maintain its stored memories forever, even in the face of ongoing, continuous, synaptic turnover, as demonstrated in Figure 4. The simple intuitive explanation is that by letting the degradationmaintenance process continue for a long time, the synapses undergo a random walk process with bounds. If the synaptic bound is sufficiently low, the number of large synapses retained by the NR mechanism will be higher than the minimal number of synapses required to maintain memory performance. This is the case for B+ = 3/Np in the simulation presented in Figure 4.5 By maintaining the neurons’ average postsynaptic potentials, the NR mechanism preserves the number of large synapses practically for5 This corresponds to the amount needed to encode three memories in the original synaptic weights, whose average value at t = 0 was .14/Np.
Memory Maintenance via Neuronal Regulation
11
ever, even though the identity of these synapses may change during the network’s lifetime. The existence of synaptic upper bounds prevents the formation (“runaway”) of synapses with very large values. The formation of the latter would have deleterious effects on the network’s performance since, together with the concomitant action of the NR mechanism, they may reduce the number of large synapses beyond the threshold of memory capacity. The possibility that the network can achieve stability—that it will continue to exhibit high retrieval performance forever—is enhanced when a “viability” bound (B− > 0) is incorporated. In this case, synapses whose values decrease below B− die, and their values are set to zero. This selective synaptic death process helps preserve the network’s performance because synapses with large initial values (i.e., synapses that code several memories) have greater chances of surviving than synapses with small initial values.6 This synaptic selection process is depicted in Figure 5a, which demonstrates that a significantly greater fraction of large synapses than small ones is retained through the action of the NR algorithm as time evolves. These results were obtained by studying numerically the evolution of a single neuron whose synapses undergo a series of degradation and NR steps, assuming that the NR algorithm maintains a fixed total sum of all synaptic weights. This approximation of the dynamics of a network undergoing synaptic degradation and NR enabled us to trace the resulting synaptic values for very long periods of time. Interestingly, the pattern of decrease in overall synaptic counts as time evolves is remarkably reminiscent of that observed experimentally in primates (Rakic, Bourgeois, & Eckenhoff, 1986; Rakic, Bourgeois, & Goldman-Rakic, 1993). The level of the selection bias toward synapses with large initial values depends on the pattern of synaptic degradation employed. Figure 5b demonstrates that the selective bias is much larger if synaptic degradation is additive (Jij (t + 1t) → Jij (t) − ²ij ) instead of multiplicative, the assumption employed in our model. Biological synaptic degradation may well lie in between these two extreme degradation mechanisms. 4 Discussion We have described a developmental, ongoing process of synaptic turnover including Hebbian changes, noisy degradation, and NR correction steps. Our maintenance process has a temporal scale determined by the variance of synaptic degradation, as shown in Figure 1. For short times, t < tc , NR compensates for the loss of synaptic efficacy. It also helps to normalize 6 The intuition of retaining synapses with large initial values is clear, since these synapses encode a large number of memories and hence are more significant than synapses with small initial values. This intuitive notion, supported by the work of Sompolinsky (1986) on clipped synapses, has recently been proved formally by Chechick, Ruppin, & Meilijson (1997).
12
David Horn, Nir Levy, and Eytan Ruppin
fraction of surviving synapses
1
0.8
0.6
0.4
0.2
small synapses large synapses all synapses
0 0 10
1
2
10
3
10
4
10
10
t (a) 1
fraction of surviving synapses
small synapses large synapses
0.8
all synapses
0.6
0.4
0.2
0 0 10
1
10
2
10
3
10
4
10
t (b)
Figure 5: The fraction of remaining synapses in a neuron that undergoes a series of synaptic degradation and NR steps. (a) With multiplicative synaptic degradation. ² = 0.01, σ² = 0.1. (b) With additive synaptic degradation. ² = 0.001, σ² = 0.001. The simulated neuron has 104 synapses, whose initial values follow the typical distribution of synaptic values of a neuron in a network of N = 500 neurons storing 25 memories with p = 0.4. The bounds are B+ = 10/Np and B− = 0.5/Np. The small synapses traced here store a single memory pattern, while the large synapses store seven patterns each.
Memory Maintenance via Neuronal Regulation
13
memory retrieval, by equalizing the basins of attraction of the stored memories and preventing the formation of pathologic attractors. For long times, t > tc , a network with unbounded synapses cannot maintain its memory. However, NR can maintain memory forever in networks with appropriately bounded synapses. During the NR process, some synapses die, while others approach the upper synaptic bound and remain in its vicinity, realizing long-term memory maintenance. Memory maintenance may therefore be achieved even though the synapses are not maintained at their original values. The NR mechanism described in this article provides a biological realization of synaptic clipping, bearing similarity to a process described previously (Sompolinsky, 1986) in the context of a Hopfield model. In the latter, the synaptic memory matrix is clipped so that all synaptic weights whose absolute value lies below some threshold vanishes, while the values of all others are set as plus or minus the threshold value. This process (Sompolinsky, 1986) causes a surprisingly small decrease in the capacity of the associative memory network. In our model, a subset of the surviving synapses approaches the upper bound. The choice of these strong synapses is stochastic and time varying, but synapses with large initial values have much larger chances to survive than initially weak synapses. That is, the action of the NR mechanism gradually transforms the network from having continuous synapses to quasi-binary ones, in a computationally efficient manner. From a biological point of view, analog networks may be a transitional, developmental stage of associative memories as their synapses saturate and become quasi-binary. For a fixed number of synapses per neuron, this process is computationally advantageous versus Willshaw-like networks that are based on binary synapses to begin with, since it leads to a more efficient synaptic matrix where only synapses representing several memories are retained. A straightforward prediction of the NR model is that synaptic efficacies observed in the brain should become narrowly distributed during growth and maturation. It would be interesting to know if this is indeed what lies behind the observed pattern of synaptic density reduction on maturity. Clearly this question lies outside the scope of our existing experimental capabilities. Recent findings support the notion that biological synaptic efficacy is indeed bounded in a rather limited range. This idea has been incorporated in the Bienenstock-Cooper-Munroe (BCM) theory (Bear & Cooper, 1987) of long-term potentiation (LTP) and long-term depression (LTD). A recent review (Abraham & Bear, 1996) has coined the term metaplasticity, meaning the plasticity of synaptic plasticity. It shows that prior synaptic activity can inhibit the induction of subsequent LTP (and facilitate LTD) in a synapsespecific manner. Our mechanism relies on activation of the memory system by random inputs, thus testing all basins of attraction without resorting to activation by the memories themselves. As such, it is reminiscent of previous sugges-
14
David Horn, Nir Levy, and Eytan Ruppin
tions (Crick & Mitchison, 1983; Hopfield, Feinstein, & Palmer, 1983) that utilize random activity to unlearn spurious attractors in the network. Such attractors are rare in the Tsodyks model and therefore were irrelevant in our study. Notice, though, that our NR mechanism does weaken the memories that are frequently retrieved through random activation, thus leading to the normalization exemplified in Figure 2. Random activation of cortical memory systems may be triggered by ponto geniculate occipital waves (Hobson & McCarley, 1977) during rapid-eye-movement sleep. It is, however, still unclear whether this is indeed the appropriate and the only period in which synaptic maintenance occurs. In any case, it seems preferable to have a clear separation between the processes of memory consolidation and memory maintenance since they require activation of different (and complementary) mechanisms. NR can be viewed as a particular realization of dynamic stabilization, a term that describes the idea that during sleep there exist dynamic processes that maintain synaptic efficacies. Kavanau (1994, 1997) has presented an extensive review of the literature on this subject, including many experimental findings that bear on the possible roles of different stages of sleep and theoretical suggestions as to how these may be beneficial to synaptic maintenance. Finally, it should be noted that recent findings indicate that signaling molecules involved in NR are altered in Alzheimer’s disease (Saitoh et al., 1991, Masliah & Terry, 1993; Masliah, 1995). This points to the important clinical implications of studying this mechanism further.
Acknowledgments We thank Larry Abbott, John Hopfield, John Lisman, Isaac Meilijson, James Reggia, and Marius Usher for very helpful discussions and comments.
Appendix: Memory Maintenance—A Mean-Field Analysis The goal of the following analysis is to calculate the collapse time of the network tc as a function of the level of inhomogeneity of the synaptic degradation processes, σ² (see Figure 1). To this end, we express the latter factor in the framework of coupled mean-field macroscopic equations that describe the network’s dynamics. We solve these equations numerically to find the network’s collapse time. To find the effects of synaptic degradation and NR, we replace the multiplicative degradation noise by an equivalent additive synaptic noise, following Sompolinsky (1986), and assume that the maintenance algorithm
Memory Maintenance via Neuronal Regulation
15
perfectly preserves the mean of the neuron’s local field. Thus, Jij (t) =
M 1 X ηµ ηµ + 1ij (t), Np µ=1 i j
(A.1)
where 1ij (t) has a gaussian distribution with zero mean and 12 (t)/N variance, where ³ α ´ (A.2) 12 (t) = σ 2 (t)α 1 + p2 , N and α = M/N. σ (t) represents the cumulative noise introduced by synaptic degradation, s σ (t) = σ²
etσ² − 1 . 2 eσ² − 1 2
(A.3)
The inhibition strength is taken to be γ = Mp2 , the external input Ii is assumed to be off, and the threshold T is of the order of unity. In a similar fashion to Herrmann et al. (1995), we write the local field of neuron i for p ¿ 1, hi (t0 ) =
X
Jij (t)Vj (t0 ) − Mp2 Q(t0 ) ∼ =
j
s ¡ X ν
¢ ην i − p mν (t0 ) + φi (t0 ), (A.4)
where the summation is over the s memories that have macroscopic overlaps and φi (t0 ) is crosstalk noise, # " M ³ ´ 1 X X 0 µ µ 2 (A.5) η i η j − p + pN1ij (t) Vj (t0 ). φi (t ) = Np j µ>s In the limit of large N, φi (t0 ) is normally distributed with zero mean and variance£ ¤ pα Q(t0 ) 1 + p2 NQ(t0 ) + p12 (t)Q(t0 ). The stochastic sigmoid function in equation 2.2 can be replaced by a deterministic threshold function with additive noise £ ¤ (A.6) Vi (t0 + 1t0 ) = 2 hi (t0 ) + ζi (t0 ) − T , where 2 is the step function and ζi (t0 ) is a gaussian noise term. Substituting the expression for the local field, equation A.4 into equation A.6, we get, 0
0
Vi (t + 1t ) = 2
" s ¡ X ν
ν
¢
# ν
0
0
0
η i − p m (t ) − T + φi (t ) + ζi (t ) .
(A.7)
16
David Horn, Nir Levy, and Eytan Ruppin
To calculate the evolution of the overlaps, this expression is substituted in equation 2.6, leading to, mν (t0 + 1t0 ) + * ¢ P ¡ T − sν ην i − p mν (t0 ) ην i − p 8 q = £ ¤ p(1 − p) pα Q(t0 ) 1 + p2 NQ(t0 ) + p12 (t)Q(t0 ) + ζ 2
(A.8)
ην
where Z
∞
8(x) = x
µ µ µ 2¶ ¶¶ 1 dz x z = 1 − erf √ exp − . √ 2 2 2π 2
(A.9)
Similarly, the total network activity is given by N 1 X Vj (t0 + 1t0 ) Np j + * ¢ P ¡ T− sν ην i −p mν (t0 ) 1 . (A.10) = 8 q £ ¤ p pα Q(t0 ) 1+p2 NQ(t0 ) +p12 (t)Q(t0 )+ζ 2
Q(t0 + 1t0 ) =
ην
The resulting fixed-point equations mν (t0 ) = m and Q(t0 ) = Q for a memory pattern with macroscopic overlap are T − m m = 8 q £ ¤ pα Q 1 + p2 NQ + p12 (t)Q + ζ 2 T + pm − 8 q £ ¤ pα Q 1 + p2 NQ + p12 (t)Q + ζ 2
(A.11)
and
T − m Q = 8 q £ ¤ pα Q 1 + p2 NQ + p12 (t)Q + ζ 2 T + pm 1 . + 8 q £ ¤ p pα Q 1 + p2 NQ + p12 (t)Q + ζ 2
(A.12)
These equations were solved numerically starting from t = 0 and increasing t gradually to find the transition time t = tc when the solution (m = 1) breaks down.
Memory Maintenance via Neuronal Regulation
17
References Abraham, W. C., & Bear, M. F. (1996). Metaplasticity: The plasticity of synaptic plasticity. Trends in Neuroscience, 19, 126–130. Bailey, C. H., Montarolo, P., Chen, M., Kandel, E. R., & Schacher, S. (1992). Neuron, 9, 749–758. Baudry, M., & Lynch, G. (1993). Long-term potentiation: Biochemicl mechanisms. In M. Baudry & R. F. Thompson (Eds.), Synaptic plasticity (pp. 87–115). Cambridge, MA: MIT Press. Bear, M. F., & Cooper, L. N. (1987). the physiological basis for the theory of synapse modification. Science, 237, 42–48. Chechick, G., Meilijson, I., & Ruppin, E. (1997). Synaptic pruning in development: A novel account in neural terms. Sixth Annual Computational Neuroscience Meeting (CNS∗ 97), Big Sky, Montana, July 1997. Crick, F. (1984). Memory and molecular turnover. Nature, 312(8), 101. Crick, F., & Mitchison, G. (1983). The funciton of dream sleep. Nature, 304, 111– 114. Dong, D. W., & Hopfield, J. J. (1992). Dynamic properties of neural networks with adapting synapses. Network, 3, 267–283. Goelet, P., Castellucci, V. F., Schacher, S., & Kandel, E. R. (1986). The long and the short of long-term memory—a molecular framework. Nature, 322, 419–422. Hasselmo, M. E. (1993). Acetylcholine and learning in a cortical associative memory. Neural Computation, 5, 32–44. Herrmann, M., Hertz, J. A., & Prugel-Bennet, ¨ A. (1995). Analysis of synfire chains. Network, 6, 403–414. Hobson, J. A., & McCarley, R. W. (1977). The brain as a dream state generator: An activation-synthesis hypothesis of the dream process. American Journal of Psychiatry, 134, 1335–1368. Hopfield, J. J., Feinstein, D. I., & Palmer, R. G. (1983). “Unlearning” has a stabilizing effect in collective memories. Nature, 304, 158–159. Horn, D., Levy, N., & Ruppin, E. (1996). Neuronal-based synaptic compensation: A computational study in Alzheimer’s disease. Neural Computation, 8, 1227– 1243. Jones, D. G., Itarat, W., & Calverley, R. K. S. (1991). Perforated synapses and plasticity. Molecular Neurobiology, 5, 217–228. Kavanau, J. L. (1994). Sleep and dynamic stabilization of neural circuitry: A review and synthesis. Behavioural Brain Research, 63, 111–126. Kavanau, J. L. (1997). Memory, sleep and the evolution of mechanisms of synaptic efficacy maintenance. Neuroscience, 79, 7–44. Lisman, J. (1994). The CAM kinase hypothesis for the storage of synaptic memory. Trends in Neural Science, 17(10), 406–412. Masliah, E. (1995). Mechanisms of synaptic dysfunction in Alzheimer’s disease. Histology and Histopathology, 10, 509–519. Masliah, E., & Terry, R. D. (1993). The role of synaptic pathology in the mechanisms of dementia in Alzheimer’s disease. Clinical Neuroscience, 1, 192–198.
18
David Horn, Nir Levy, and Eytan Ruppin
Purves, D., & Voyvodic, J. T. (1987). Imaging mammalian nerve cells and their connections over time in living animals. Trends Neurosci., 10, 398–404. Rakic, P., Bourgeois, P. P., & Eckenhoff, M. E. (1986). Concurrent overproduction of synapses in diverse regions of the primate cerebral cortex. Science, 232, 232–235. Rakic, P., Bourgeois, J. P., & Goldman-Rakic, P. J. (1993). Synaptic development of the cerebral cortex: Implications for learning, memory, and mental illness. Progress in Brain Research, 102, 227–243. Ruppin, E., Reggia, J., & Horn, D. (1996). A neural model of positive schizophrenic symptoms. Schizophrenia Bulletin, 22(1), 105–123. Saitoh, T., Masliah, E., Jin, L. W., Cole, G. M., Weiloch, T., & Shapiro, P. (1991). Protein kinases and phosphorylation in neurological disorders and cell death. Lab. Invest., 64, 596–616. Sompolinsky, H. (1986). The theory of neural networks: The Hebb rule and beyond. In J. L. van Hemmen & I. Morgenstern (Eds.), Heidelberg colloquium on glassy dynamics (pp. 485–527). New York: Springer-Verlag. Tsodyks, M. V. (1989). Associative memory in nerual networks with the Hebbian learning rule. Modern Physics Letters B, 3(7), 555–560. van Ooyen, A. (1994). Activity-dependent neural network development. Network, 5, 401–423. Wolff, J. R., Laskawi, R., Spatz, W. B., & Missler, M. (1995). Structural dynamics of synapses and synaptic components. Behavioural Brain Research, 66, 13–20.
Received October 15, 1996; accepted April 9, 1997.
NOTE
Communicated by Hans-Ulrich Bauer
Ordering of Self-Organizing Maps in Multidimensional Cases Guang-Bin Huang Haroon A. Babri School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
Hua-Tian Li Department of Computer Science and Engineering, Northeastern University, People’s Republic of China
It has been proved that in one-dimensional cases, the weights of Kohonen’s self-organizing maps (SOM) will become ordered with probability 1; once the weights are ordered, they cannot become disordered in future training. It is difficult to analyze Kohonen’s SOMs in multidimensional cases; however, it has been conjectured that similar results seem to be obtainable in multidimensional cases. In this note, we show that in multidimensional cases, even though the weights are ordered at some time, it is possible that they become disordered in the future. 1 Introduction Biologically motivated self-organizing maps (SOM) have been the focus of much interesting neural network research. In what follows we focus on Kohnen’s SOMs (Kohonen, 1989, 1995). Consider an SOM with n output neurons where the ith neuron has initial weight wi = {wi1 , . . . , wim }T . The feature map formation of the SOM follows an iterative procedure. At time t, a pattern x = {x1 , . . . , xm }T enters the network. Neuron c (the winner), whose weight wc is metrically nearest to the pattern x, is selected, kwc − xk = min kws − xk,
(1.1)
s∈{1,...,n}
and the weights of all neurons are then changed according to the updating rule, ½ ws (t + 1) =
ws (t) + η3(c, s)(x − ws ) ws (t)
for s ∈ Nc (t) others,
(1.2)
where η = η(t) is the learning rate (0 ≤ η ≤ 1). The function 3(c, s) is Neural Computation 10, 19–23 (1998)
c 1997 Massachusetts Institute of Technology °
20
Guang-Bin Huang, Haroon A. Babri, and Hua-Tian Li
called the neighborhood function, which decreases with increasing distance between s and c.1 Two important problems arise concerning the ordering of weights during the learning process of an SOM: (1) Can the weights of an SOM be ordered through self-organizing? (2) Once the weights of an SOM are ordered, do they always remain ordered in future training time? For the onedimensional case, the answers to the two problems are positive (Kohonen, 1995; Cottrel & Fort, 1987; Erwin, Obermayer, & Schulten, 1992). However, extending these results to the multidimensional case has been difficult at best. Kohonen (1995) has conjectured that in multidimensional cases, similar results seem to be obtainable. Budinich (1995) explains intuitively why these proofs on convergence to ordered states in one-dimensional case do not extend to multidimensional cases. In this note, we further prove that in the multidimensional case, even if the weights of an SOM are ordered at some time, it cannot be guaranteed that they will always remain ordered. 2 Ordering in Multidimensional Cases In multidimensional cases, there exist no simple relevant position relations among vector points, and how to define neighborhood relationships is still a problem for SOMs. In order to be visualized, the neighboring weights can be connected (mostly in rectangular or hexagonal array) using straight line segments (Kohonen, 1989; Ritter & Schulten, 1986; Kangas, Kohonen, & Laaksonen, 1990). Normally we can have the following definition. Definition 1. A simplex of a feature map is a closed field formed by least straight line segments connecting neighboring weights of an SOM. The weights of an SOM are called ordered if all the simplexes formed by these weights are nonintersecting. For the two-dimensional case, ordering of weights appears when no straight line segments connecting neighboring weights intersect each other. In fact, in order to prove our conclusions on Kohonen’s conjecture, we need only consider the following reasonable case, which can occur in the applications of Kohonen’s SOMs. Case 1. At time t, the weights of an SOM are ordered. A new pattern x enters the SOM and is within a simplex M, but the weight wc1 (t), which is nearest to the pattern x, is outside M, and at least one neuron, ci0 , which is outside M, is not in Nc1 (t). 1 The neighborhood function can be simply represented by 3(c, s) in this note, but it should be noted that 3(c, s) = 3(rc , rs ), where vector ri represents the ith neuron’s coordinate (the ith neuron’s position) in cortical space.
Ordering of Self-Organizing Maps
21
Suppose at time t there is a feature map as shown in Figure 1a, where the weights have been ordered. At time t + 1 when the new pattern x enters the SOM, the weight wc1 (t) and its neighbors wc2 (t), wc4 (t), and so on, will move toward the pattern x according to the learning rule (see equation 1.2). After weight adjustments the simplex M formed by wc5 (t), wc8 (t), . . . and wc10 (t) will change into a new simplex M0 formed by wc5 (t + 1), wc8 (t + 1), . . . , and wc10 (t + 1). (In Figure 1c some of the weights forming M are shown unchanged at this time.) Obviously x is also within M0 , and wc1 (t) is outside M0 . The line segment between wc1 (t) and x will intersect one side of M0 . Suppose it intersects the side wc5 (t + 1)wc10 (t + 1) and the cross point is p. As stated in Gasson (1983), if the point p divides the line segment wc1 (t)x in the ratio λ : (1 − λ), λ=
wc1 (t)p wc1 (t)x
,
then the point p can be denoted as p = wc1 (t) + λ(x − wc1 (t)). Given the weights wc5 (t), wc10 (t), and wc1 (t) and the pattern x we can get the solution of λ ³£ ¤−1 £ ¤ ´ wc10 (t+1)−wc1 (t) , 1 λ = Row (x−wc1 (t)) (wc10 (t+1)−wc5 (t+1)) (2.1) where Row(A, i) represents the ith element of vector A. Since λ=
wc1 (t)p wc1 (t)x
,
for any small positive value ² > 0, i.e., ² = η(t)3(0), it is possible that λ < ² in some applications. (For example, if x is input into the SOM repeatedly the weight of c1 can move to x according to the learning rule (in equation 2) with any small distance between wc1 (t) and x.) Because wc1 (t+1) = wc1 (t)+ η(t)3(0)(x − wc1 (t)), if η(t)3(0) > λ the weight wc1 (t + 1) lies within M0 . According to Case 1, since ci0 , which is outside M, is not in Nc1 (t), wci0 (t+1) = wci0 (t), and thus, wci0 (t + 1) is outside M0 . It is obvious that there exists a sequence of neurons ci0 , ci1 , ci2 , . . . , cik = c1 such that cij is one of neighboring neurons of cij+1 (j = 0, . . . , k − 1). Because wci0 (t + 1) is outside M0 and wcik (t + 1) within it there exists l such that wcil (t + 1) is outside M0 but wcil+1 (t + 1) lies within M0 , and thus the straight line segment connecting neighboring weights wcil (t + 1) and wcil+1 (t + 1) will intersect one side of M0 and the weights will become disordered. Case 1 can also appear in multidimensional cases with any type of lattice even though the input pattern and the relative winner lies within the same simplex. Thus, we have the following theorem on the ordering of Kohonen’s SOMs.
22
Guang-Bin Huang, Haroon A. Babri, and Hua-Tian Li
wc8(t) wc9(t)
x
x wc (t) wc7(t) wc5(t) wc1(t) 10 wc 2 (t) wc6(t) wc (t) wc4(t) 3
(a)
(b) wc 8(t) wc9(t) simplex M x wc5(t+1)
wc 7 (t)
wc6(t+1)
wc1(t+1) P
wc5(t)
wc4(t+1)
λ
wc1(t)
wc10(t)
wc 2(t+1)
wc 2(t) wc3(t)
wc6(t)
wc4(t)
(c) Figure 1: In a two-dimensional case after a new pattern is input into an SOM, the weights that have been ordered may become disordered, where not only the winner but also its neighbors adjust their weights. Bold lines indicate the new segments between the winner and some of its neighbors after adjustments. (a) The weights are ordered at time t. (b) The region affected by the input pattern. (c) The weights that have been ordered become disordered at training time t + 1.
Theorem 1. In multidimensional cases for any type of lattice and any small learning rate, even though the weights of an SOM have been ordered, it is possible that there exists a sequence of input patterns and their relative occurrence frequencies so that these weights become disordered at some future time. References Budinich, M. (1995). On the ordering conditions for self-organizing maps. Neural Computation, 7, 284–289.
Ordering of Self-Organizing Maps
23
´ Cottrell, M., & Fort, J.-C. (1987). Etude d’un processus d’auto-organisation. Annales de l’Institut Henri Poincar´e, 23(1), 1–20. Erwin, E., Obermayer, K., & Schulten, K. (1992). Self-organizing maps: Ordering, convergence properties and energy functions. Biological Cybernetics, 67, 47–55. Gasson, P. C. (1983). Computational geometry. In Geometry of spatial forms (pp. 431–505). New York: Ellis Horwood Ltd. Kangas, J., Kohonen, T., & Laaksonen, J. (1990). Variants of self-organizing maps. IEEE Trans. on Neural Networks, 1(1), 93–99. Kohonen, T. (1989). Self-organizing feature maps. In Self-organization and associative memory (pp. 119–157). Berlin: Springer-Verlag. Kohonen, T. (1995). The basic SOM. In Self-organization maps (pp. 77–130). Berlin: Springer-Verlag. Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen’s selforganizing sensory mapping. Biological Cybernetics, 54, 99–106. Received December 16, 1996; accepted May 21, 1997.
LETTERS
Communicated by Laurence Abbot
Predicting the Distribution of Synaptic Strengths and Cell Firing Correlations in a Self-Organizing, Sequence Prediction Model Asohan Amarasingham Echols Scholars Program, College of Arts and Sciences, and Department of Neurological Surgery, University of Virginia, Charlottesville, VA 22908, U.S.A.
William B. Levy Department of Neurological Surgery and Department of Psychology, University of Virginia, Charlottesville, VA 22908, U.S.A.
This article investigates the synaptic weight distribution of a selfsupervised, sparse, and randomly connected recurrent network inspired by hippocampal region CA3. This network solves nontrivial sequence prediction problems by creating, on a neuron-by-neuron basis, special patterns of cell firing called local context units. These specialized patterns of cell firing—possibly an analog of hippocampal place cells—allow accurate prediction of the statistical distribution of synaptic weights, and this distribution is not at all gaussian. Aside from the majority of synapses that are, at least functionally, lost due to synaptic depression, the distribution is approximately uniform. Unexpectedly, this result is relatively independent of the input environment, and the uniform distribution of synaptic weights can be approximately parameterized based solely on the average activity level. Next, the results are generalized to other cell firing types (frequency codes and stochastic firing) and place cell–like firing distributions. Finally, we note that our predictions concerning the synaptic strength distribution can be extended to the distribution of correlated cell firings. Recent published neurophysiological results are consistent with this extension. 1 Introduction The hippocampus is arguably involved with sequence prediction and the learning of context (Blum & Abbott, 1996; Levy, 1989; Hirsh, 1974; Nadel & Willner, 1980; Kesner & Hardy, 1983; Gray, 1982). Thus we have been studying self-organizing models of context-dependent sequence prediction using a simplified anatomy and physiology inspired by the biology of the hippocampus, particularly region CA3. Such networks can learn to perform psychologically inspired paradigms believed to belong in the province of hippocampal functions (for a summary, Neural Computation 10, 25–57 (1998)
c 1997 Massachusetts Institute of Technology °
26
Asohan Amarasingham and William B. Levy
see Levy, 1996). These paradigms include context-dependent problems such as disambiguation, in which two different input sequences share a common subsequence (Minai, Barrows, & Levy, 1994); goal finding without search (Levy, Wu, & Baxter, 1995; Levy, 1996); certain nonlinear problems such as transverse patterning (Levy, Wu, & Tyrcha, 1996; Wu, Tyrcha, & Levy, 1996b); as well as simpler problems such as transitive inference (Levy & Wu, 1997). The coding, that is, cell firing patterns, that our network develops to represent sequential information is interesting. Special firing patterns identify specific, temporally contiguous subsequences within the overall sequence being learned. That is, the neurons in this model, particularly the (feedback) neurons that receive no input from the external layer sequence, learn to fire exclusively within specific temporal bounds of the sequence; we refer to these neurons as local context units. These local context units are hypothesized to be an analog of hippocampal place cells (Wu, Baxter, & Levy, 1996a) because place cells fire briefly as a rat walks through a place field (O’Keefe & Nadel, 1978). Moreover, these firing patterns fit the generic description of synfire chains suggested by Abeles (1991) and others, with temporally adjacent local context units serving as the synfire links in Abeles’s definition. Compared to topographically organized models of sensory cortices, our hippocampal model has random recurrent connections so that it is largely structureless. However, with learning (i.e., local, online, associative synaptic modification), some structure may appear in the distribution of synaptic weights, and this structure might well reflect the structure of the input environment. Because the modification of synaptic strengths is arguably the fundamental basis of learning, understanding how the statistical structure of the input environment is impressed onto the synaptic weight matrix is a critical preparatory step for understanding how a network’s dynamics are altered as a result of interaction with the environment. Here we pursue these ideas and look for such structure by developing the interrelationship between local context unit lifetime and the steady-state statistics of the synaptic weights. The analysis leads to three unanticipated results: 1. The distribution of nonzero synaptic strengths does not converge to a Gaussian distribution but rather is essentially uniform in distribution. 2. A large majority of synapses is driven to zero strength. 3. The fraction of synapses driven to zero (and therefore the uniform distribution also) can be approximately predicted by activity levels alone. This means that the probability distribution of synaptic weights can be well approximated without specific knowledge of the input environment so long as local context codes form, or, equivalently, a
Predicting the Distribution
27
noise-free input environment does not leave its impression on the weight distribution beyond its effect on activity levels alone. 2 Model and Methods 2.1 Computational Architecture. Our CA3 model has been described several times (Levy & Wu, 1996; Wu et al., 1996a). The networks consist of an input layer (corresponding to layer II of the entorhinal cortex combined with the dentate gyrus) and a sparsely connected feedback layer (CA3-like). To simplify things, single axons, xj , from the input layer always fire a CA3 neuron. The recurrent layer consists of 1024 binary {0,1} primary neurons with identical firing thresholds, θ. The neurons are interconnected via a Bernoulli process: each neuron j has a probability (c = 0.10) of receiving a modifiable excitatory connection from each neuron i in the recurrent layer. The presence of such a connection is indicated by the binary {0,1} variable cij . Inhibition is mediated by a single interneuron that receives input from all the primary neurons in the recurrent layer; this interneuron then provides an identical shunting conductance proportional to its input to all primary neurons. At time t, taking wij (t) as the excitatory weight from neuron i to neuron j, KI as the fixed inhibitory weight from the input layer, and KR as the fixed weight for feedback inhibition, the excitation yj of neuron j is: PN yj = PN
i=1 cij wij zi (t
− 1)
i=1 cij wij zi (t − 1) P + KR N i=1 zi (t − 1)
+ KI
PN
i=1 xi (t)
,
(2.1)
where N is the number of neurons (1024) in the model. The output of CA3 neuron j is zj (t) = {1 if yj (t) ≥ θ or xj (t) = 1, 0 otherwise}. Note that an active external input, xj , always produces a firing, while neurons with no active external input are not forced to the zero state and could be fired through feedback connections. 2.2 The Learning Rule. A Hebbian-type postsynaptic associative modification rule is used in our simulations and analysis (Levy & Steward, 1979, 1983; Levy, 1982). For input i and output j, this postsynaptic rule is given by wij (t) = wij (t − 1) + ²zj (t)[zi (t − 1) − wij (t − 1)],
(2.2)
where wij is the weight of the synapse connecting neuron i to neuron j. Here ² = 0.01. A similar rule is used in other models (McClelland & Goddard, 1996; Treves & Rolls, 1994; Grossberg, 1982). 2.3 Input Sequences and Learning. Our approach has been to combine our minimal model of CA3 with a working assumption that the hippocam-
28
Asohan Amarasingham and William B. Levy
pus receives sequential sensory information—in analogy to the sequential sensory and motor input received by a rat exploring an environment. A critical assumption here has been that this external input is sparse (Levy, 1989; Treves & Rolls, 1992, 1994); many more CA3 neurons receive input from CA3 feedback than from external sources. That is, a large number of the neurons in our model do not receive any external input and fire solely as a result of feedback input. The nature of the temporal dependencies in the external input remains an unresolved issue. For example, it seems likely that sensory stimuli do not disappear immediately but are maintained over several time steps. Nevertheless, in order to ensure accuracy on this issue, we study a wide variety of input sequences. Interestingly, varying this aspect of the input environment turns out to have minimal impact on our results. Nine different types of input sequences were used in the simulations. Each external input pattern contained eight on-bits out of the 1024 neurons. For eight of the sequences, there was a constant shift of k bit(s) per unit of time (k = 1, 2, 3, 4, 5, 6, 7, or 8) from one pattern to the next, producing a successive overlap length of 8 − k for the input patterns. In the ninth input sequence, the overlap length of external patterns varied from one time step to the next over the range 0 to 7 so that the expected overlap is 3.5; this input sequence was included to demonstrate that these results do not depend on the shift rate being constant. Figure 1A illustrates such an input sequence. The results of the simulations described here are for networks running near their sequence length memory capacity. Sequence length capacity (C) was reached by running simulations of different sequence lengths via a titration-like procedure. As noted previously (Levy & Wu, 1996), the capacity is simply related to average activity, a, and average local context lifetime, E[`] as C=
E[`] , a
(2.3)
where E[`] is the duration of firing (in time steps) for a local context unit, averaged over all local context units in the network. Training consisted of presenting the input sequence to the network 350 times for the random overlap sequence and 300 times for all the other sequences. Testing consisted of randomizing network activity, giving the first pattern of the learned sequence and then allowing the network to run on its own. The network’s states are decoded by comparing its states at the end of training to the states during testing (Levy et al., 1995; Wu et al., 1996a; Levy & Wu, 1996). A sequence is learned if the network produces a minimum of 75% ordered recall of the recoded sequence during testing. We enforced robustness by also requiring that this definition be satisfied for at least four out of five randomly constructed networks with identical parameters. The reported data are averages of five successful simulations.
29
Neurons 1-170 (Neurons 171-1024 not shown)
Neurons 1-170 (Neurons 171-1024 not shown)
Predicting the Distribution
Time
Time
(A) Input Code
(B) Driven CA3 activities before learning
Figure 1: Typical input (A) and firing patterns (B) from a simulation of 1024 neurons before learning (only neurons 1–170 are shown). The input type is a random overlap input sequence. Large dots indicate firing (i.e., zi (t) = 1); small dots indicate no firing (zi (t) = 0). (A) The external input code for a random overlap input sequence described in section 2. In this example, the step-to-step input overlap uniformly varies between 0 and 7. (B) Network activities when the external input sequence is presented, before learning. Notice the contrast between the firing patterns before learning, here, and the firing patterns after learning, in Figure 2A.
30
Asohan Amarasingham and William B. Levy
For the calculations used later, a small correction is made for the unused neurons that do not fire in the network’s final coding. The presence of unused neurons in a steady-state coding modifies the synaptic weight prediction. We discarded these unused neurons (and therefore their synapses). Discarding unused neurons seems biologically valid (for references, see Voydovic, 1996).
3 Results The results show that the network’s steady-state synaptic weight distribution essentially depends on the distribution of local context lifetimes and the length of the input sequence. We quantify this dependence by making certain basic assumptions about the statistical properties of the model’s context neurons and then confirm the suitability of these assumptions with simulations.
3.1 Properties of the Coding Scheme—Local Context Units. Figure 2A illustrates the recoding of the external input sequence (see Figure 1A), that the network created. Critical to the network’s problem-solving abilities are the neurons that recognize a subsequence of the full sequence. We call these neurons local context units, and we hypothesize that they are analogous to hippocampal place cells (Levy et al., 1995). Specifically, neuron i is a local context unit if and only if there exist points in time, αi (context unit start point) and βi (context unit end point), such that zi (t) = 1 if αi ≤ t ≤ βi , and zi (t) = 0 otherwise. Consequently, we characterize the temporal length, or lifetime `i , of a local context unit i as βi − αi + 1, the number of time steps at which neuron i fires. We know that local context units are crucial because we have found architectural and environmental conditions that preclude context unit formation (including high noise levels, orthogonal input sequences, and modifications in the parameters determining activity). These conditions destroy problemsolving ability just as they destroy the local context units (Wu et al., 1996a) in the context-dependent problems mentioned in Section 1. Because the network does not perform well under such conditions and also because context units are analogous to hippocampal place cells, which we know to exist, we have chosen to focus on situations in which context units do form. All of the theoretical assumptions are summarized in Table 1. For the purposes of the theory developed here, we will assume that local context units are uniformly distributed over the sequence length; that is, the distributions of the αi ’s and βi ’s are not overly concentrated in one region of the sequence as opposed to another. Finally, we assume that the network’s coding converges (stochastically) if the activity converges (stochastically). That is, eventually the neural firing patterns reach a steady state.
31
Neurons 1-170 (Neurons 171-1024 not shown)
Neurons 1-170 (Neurons 171-1024 not shown)
Predicting the Distribution
Time (A) Driven CA3 activities after learning
Time (B) Single probe CA3 activities during testing
Figure 2: Typical firing patterns from a simulation of 1024 neurons after learning (only neurons 1–170 are shown). This is a continuation of the same simulation as in Figure 1. (A) Network activities when the external input sequence is presented, after learning. Notice how different the firing patterns are from Figure 1B, before learning. (B) Network activities during sequence completion testing; only the first pattern of the external input sequence is given as a probe. Notice the similarity of the firing patterns to those in Figure 2A.
32
Asohan Amarasingham and William B. Levy
Table 1: Assumptions and Approximations Used to Develop the Theoretical Arguments. Assumptions All neurons fire as local context units. (Section 3.3.1) Convergence of the synaptic modification rule. (Section 3.3.2) Approximations Local context units are independent of connectivity. (Section 3.3.3) Local context unit starting points are uniformly distributed over the sequence length. (Section 3.3.4) End effects are negligible. (Section 3.3.5)
3.2 Relating Local Context Units and Synaptic Weights. 3.2.1 Weight modification and neuronal firing. As a consequence of the Hebbian-type associative synaptic modification rule (equation 2.2), synaptic weights reflect the temporal correlation between neuronal firing patterns. The more synchronous the firing activity between two neurons sharing a synapse, the greater the strength of that synapse. (This synchrony is actually displaced one time step due to the one-time-step delay in the learning rule. However, for simplicity, we will use the term synchrony forthwith, while actually meaning synchrony displaced one time step.) Because we are assuming that neural firing patterns reach a steady state, we can consider zi (t − 1) and zj (t) as stationary, though not necessarily independent, random variables influencing a given synaptic weight wij , via the associative modification rule (see equation 2.2). Since z(t) is {0,1} valued, modification occurs only when the postsynaptic neuron is active, and thus the synaptic weight becomes a running average of the conditional probability distribution P(zi (t − 1) | zj (t) − 1). Summing the weight modification rule recursively for this conditional case and averaging the sum as t → ∞ gives us the convergence value of the synaptic weight: wij =
P(zj (t) = 1, zi (t − 1) = 1) . P(zj (t) = 1)
(3.1)
Therefore, assuming a steady state is reached, the steady-state synaptic strength is equivalent to the ratio of the number of time steps when the pre- (i) and postsynaptic (j) neurons are synchronously active to the number of time steps when the postsynaptic neuron is active. Cast in terms of probability, this gives: ! Ã ¡ ¢ P zj (t) = 1, zi (t − 1) = 1 =m . (3.2) P(wij = m) = P P(zj (t) = 1)
Predicting the Distribution
33
The essence of our idea is to calculate the two probabilities in equation 3.5, P(zj (t) = 1, zi (t − 1) = 1) and P(zj (t) = 1), in terms of the temporal specifications (α, β) of the context units i and j and the basic assumptions outlined in the previous section. Generally the intuition is this: consider a presynaptic context unit with fixed start point and end point, (αi , βi ). For each possible position (αj , βj ) of the postsynaptic context unit, the probabilities P(zi (t − 1), zj (t)) and P(zj (t)) are determined, and P so, therefore, is their ratio (see equation 3.2). It follows that P(wij = m) = αi ,βi ,αj ,βj P(wij = m | αi , βi , αj , βj )P(αi , βi , αj , βj ). The computational difficulties such a calculation involves are eased by making several simplifying assumptions. In the end, we are able to produce a closedform solution to the conditional probability P(wij = m | `i , `j ), and then produce a numerical procedure for generating P(wij ). These ideas are made more explicit in this and the following section. The temporal correlation between pre- and postsynaptic firing patterns amounts to the temporal overlap between pre- and postsynaptic context units. This follows from the definition of a local context unit: zi (t) = 1 if and only if αi ≤ t ≤ βi . Thus the event {{zj (t) = 1} ∩ {zi (t − 1) = 1}} is equivalent to the event {{αi ≤ t − 1 ≤ βi } ∩ {αj ≤ t ≤ βj } }. Simplifying this to the equivalent {max{αi + 1, αj } ≤ t ≤ min{βi + 1, βj }}, we define overlapij : ½ overlapij = max
min{βi + 2, βj + 1} − max{αi + 1, αj } 0
(3.3)
so that overlapij is 0 if this difference is less than zero. Thus overlapij is the number of time steps, t, that satisfies the event {{zj (t) = 1} ∩ {zi (t − 1) = 1}}. Setting S as the sequence length (measured in time steps), it follows that P(zj (t) = 1, zi (t − 1) = 1) = overlapij /S.
(3.4)
The number of time steps satisfying {zj (t) = 1} follows from the definition of a context unit j. Recalling the definition `j = βj − αj + 1, P(zj (t) = 1) = `j /S.
(3.5)
Finally, combining equations 3.4 and 3.5 via 3.2, we have à à P(wij = m) = P P
overlapij `j
!
! =m .
(3.6)
3.2.2 Predicting the synaptic weight distribution from a context lifetime distribution. The next step is to formulate the conditional probabilities
34
Asohan Amarasingham and William B. Levy
P(overlapij | `i , `j ). This is done by rewriting the definition of overlapij (see equation 3.3) in terms of `, overlapij = max{min{αi + `i + 1, αj + `j } − max{αi + 1, αj }, 0},
(3.7)
and partitioning the context lifetime probability space into three cases: `i = `j , `i > `j , and `i < `j . Within these partitions, the conditional probability P(wij | `i , `j ) can be simplified by making two assumptions: α is uniformly distributed, and αi and αj are independent for all connected pairs i, j. In addition, to avoid considering start-up and end effects on the weight distribution, we apply the approximation to the effect that the context units are not close enough to the beginning or end of the sequence to affect the overlap probabilities (Approximation 3; see Table 1). The calculation is made explicit for Case I (`i = `j ), and is summarized for `i 6= `j in the Appendix. The results are:
µ P wij =
x | `i , `j `j
¶ =
S−2`j −`i +2 S−`j +1
,
x=0
2 S−`j +1
,
0 < x < min{`i , `j }
|`i −`j |+1 S−`j +1
,
x = min{`i , `j }
(3.8)
The partitioned conditional probabilities described here completely define the conditional weight distribution, P(wij | `i , `j ). Thus combining this conditional distribution with the empirical context lifetime distribution, P(`), derived from network simulations, yields the distribution P(wij = x). Using P(wij = x) =
XX y
P(w = x | `i = y, `j = z)P(`i = y, `j = z),
(3.9)
z
we then apply an additional assumption that presynaptic context length and postsynaptic context length are independent: P(wij = x) =
XX y
P(w = x | `i = y, `j = z)P(` = y)P(` = z).
(3.10)
z
In this way we have the complete distribution of weights based on knowledge of the distribution of local context unit lifetimes, P(`). 3.2.3 Simplifications and comparison of theory to simulations. If predictions like those generated here have any hope of being tested in a biological experiment, we must relax the requirement for knowledge of the distribution
Predicting the Distribution
35
P(`). A simpler theoretical perspective arises by assuming that all local context units are of the same length, E[`]. Thus it follows from equations 3.8 and 3.10: P(wij = 0) =
S − 3E[`] + 2 S − E[`] + 1
P(wij = x/E[`]) =
P(wij = 1) =
2 , S − E[`] + 1
1 S − E[`] + 1
(3.11a)
for
0 < x < E[`]
(3.11b)
(3.11c)
The qualitative conclusions we might draw from equation 3.11a–c are the same as those for equation 3.8. The zero-weight proportion (see equation 3.11a) would dominate in the biologically relevant case of low activity because P(wij = 0) ↑ 1 as E[`]/S → 0, and E[`]/S is a measure of neuronal activity. For wij > 0, the distribution will be uniform with the exception of a disproportion at wij = 1. Therefore, to further simplify the weight distribution, we calculate the P(wij = 0) from equation 3.11a and then simply project a uniform distribution for the remaining cases, wij > 0. Figure 3 compares these predicted distributions to the average distribution obtained from empirical results for two input sequences. The prediction of P(wij = 0) is remarkably good in these histograms and across all nine input types. The uniform distribution of the simplified theory also does very well except at P(wij = 1). Here the prediction is not accurate, with the largest bin being underestimated by about 50%. The error here, in a situation in which local context units are perfectly correlated, stems from the assumption that they are uncorrelated. Fortunately this assumption seems to cause very little inaccuracy elsewhere. The impression of accuracy conveyed by these histogram comparisons is found for all nine input sequences (see Table 2, which quantifies the error in these comparisons by measuring histogram bin differences between theory and simulation). 3.2.4 Zero-Weight Approximations from Average Activity. For a network running at sequential memory capacity, equations 3.11a and 2.3 imply a relationship that is approximately P(wij = 0) ∼ 1 − 2a.
(3.12)
The crudeness of this approximation depends on activity and capacity, with accuracy increasing both as the network approaches capacity (as S ↑ C) and as activity decreases. In the case of the overlap four-input sequence, the 1−2a
36
Asohan Amarasingham and William B. Levy
1
Relative Frequency
0.75
0
2
4
6
8
10
12
14
16
12
14
16
12
14
16
0.15
Empirical weight distribution Predicted weight distribution
0.05
0
2
4
6
8 10 Synaptic Strength
(a) 1
Relative Frequency
0.8
0.6
0
2
4
6
8
10
Empirical weight distribution
0.08 Predicted weight distribution
0.05
0.02
1
5
10
15
Synaptic Strength
(b)
Figure 3: Synaptic weight histograms comparing simulations and theoretical predictions. The predicted weight distributions were derived by using equation 3.12 to predict P(wij = 0) and projecting a uniform distribution across the rest of the weight space. The weight space was divided into 15 bins: [0,1/15), [1/15,2/15), [2/15,3/15), . . . , [13/15,14/15), [14/15,1]. In both A and B, the abscissa is broken and the scale changes. (A) Empirical and theoretical weight distributions after learning of the random overlap input sequence described in Figure 1A. Distributions are averages over five runs. (B) Empirical and theoretical weight distributions after learning of an overlap four external input sequence. Distributions are averages over five runs.
Predicting the Distribution
37
Table 2: Prediction Errors, Organized in Terms of the Overlap of the External Input Sequence. Overlap 0 1 2 3 4 5 6 7 Random
Predicted from P(`), Equation (3.10) Error 1.32e-2 1.40e-2 1.30e-2 1.02e-2 7.3e-3 5.5e-3 5.1e-3 5.8e-3 1.34e-2
Var(error) 6.22e-4 6.49e-4 4.89e-4 1.84e-4 6.09e-5 5.49e-5 6.22e-5 8.04e-5 8.30e-5
Predicted from E[`], Equation (3.11a) Error 1.36e-2 1.34e-2 1.16e-2 7.3e-3 3.7e-3 3.7e-3 4.4e-3 4.4e-3 8.5e-3
Var(error) 5.32e-4 6.50e-4 5.67e-4 1.96e-4 4.42e-5 4.38e-5 4.91e-5 6.67e-5 7.16e-5
Note: The weight space, [0,1], was divided into 15 bins, as in Figure 3. In each simulation, the error associated with bin k was measured as the absolute value of the difference P between the predicted (fraction) P(wij ∈ bink ) and the observed count (1/N2 c) I(wij ∈ 2 bink ), where N c is the number of synapses in the network. The reported error is the average value of this error measure over all bins in five simulations of each input sequence. Var(error) is the variance of the error measure over the same set.
approximation was within 1% of the prediction generated by equation 3.11a and within 3% of the empirical value for all five simulations. For sequences of length less than capacity (S < C), this approximation can be corrected for the total fraction of used neurons, u, via the correction S = uE[`]/a. This gives P(wij = 0) ∼ 1 −
2a2 2a − 2 , u u − au
which can be approximated as a P(wij = 0) ∼ 1 − 2 , u
(3.13)
for small values of the ratio a/u. 3.3 Validity of the Theoretical Assumptions. The central result of this article concerns the distribution of synaptic weights and is best demonstrated by the empirical results. The theory was developed in order to examine the validity of this result within the broader context of sequencepredicting neural models. A list of the assumptions used for its development is provided in Table 1. 3.3.1 Assumption 1: Local context unit formation. Perhaps the most critical assumption is that coding neurons behave as local context units; they all fire
38
Asohan Amarasingham and William B. Levy
Table 3: Simulation Statistics.
Overlap
Sequence Length
Activitya
E[`]
Unused Neuronsb
Multiple Firingc
0 1 2 3 4 5 6 7 Random
20 24 33 47 57 90 110 160 22
0.051 0.048 0.049 0.064 0.061 0.055 0.050 0.054 0.138
1.01 1.21 1.59 3.07 3.60 5.11 5.85 9.49 2.71
144 153 112 64 59 45 63 113 77
155 105 121 45 15 15 6 15 15
Note: The data points represent averages over five simulations per input sequence. P a Defined as 1 z (t), when the external input sequence is presented, after NS i,t i learning. E[`] is defined per the text. b A count of the neurons that do not fire in the final coding. c A count of the neurons that do not satisfy the local context unit definition (i.e., fire at different, noncontiguous time points in the sequence) in the final coding.
exclusively within specific boundaries within the sequence. The empirical results presented here certainly justify such an assumption (see Table 3, particularly the statistics concerning multiple firing neurons), but we believe that the formation of local context units is important in a broader sense, particularly with regard to hippocampal function. Experimental support for local context units comes from recordings of place cells in the hippocampus of rats that have learned a spatial environment (O’Keefe & Nadel, 1978). Place cells fire when a rat traverses specific spatial locations within the learned environment; that is, these cells fire more or less exclusively within specific spatial boundaries. Moreover, because navigation over space is necessarily continuous with respect to time, it follows that the sensory information provided during navigation is itself sequential, and thus these neurons are firing exclusively within specific boundaries in a sequence. This firing pattern is precisely the definition of a local context unit and suggests an important relationship between the sequence-learning model presented here and experimental neuroscience. A second justification for the relevance of local context units comes from their functional importance: local context units form associative bridges over time. If the computational challenge in sequence learning is to develop appropriate synaptic interactions between neurons representing arbitrary sequences of external stimuli, then a significant biological concern is that such connections may not exist in the first place. This issue is avoided in sequence-learning models, such as those of Griniasty, Tsodyks, and Amit
Predicting the Distribution
39
(1993) and Amit, Brunel, and Tsodyks (1994), in which each neuron is assumed to be connected to every other neuron. In the biologically relevant context of sparse connectivity, however, connections between neurons representing successive external patterns may not exist. By allowing recurrent neurons (which are not originally associated with any external stimulus) to play a role in encoding the sequence, however, this CA3 model demonstrates how a more reasonable connectivity can form the required temporalto-spatial association, that is, by forming local context units. The formation of local context units is also necessary for the model to solve several interesting (i.e., context-dependent) sequence-prediction problems above and beyond simple sequential recall (as discussed in section 1; for a review, see Levy, 1996).
3.3.2 Assumption 2: Convergence of the learning rule. The other important assumption is that the associative weight converges. This implies that synaptic weights take on a scaled correlation of presynaptic and postsynaptic firing, E[zj (t)zi (t − 1)]/E[zj (t)]. In the case of the binary neurons used here, this correlation reduces to the conditional probability that the presynaptic neuron fires given that the postsynaptic neuron fires. In a recurrent network, we can provide no guarantees; however, the success of the theory in predicting the empirical weight distributions is perhaps fundamentally a vindication that synaptic weights can take on these conditional probabilities. Whether convergence is achieved is an important issue, because several models (Griniasty et al., 1993; Amit et al., 1994; Bienenstock, 1995) have implicitly assumed this sort of convergence by directly loading correlations into the synaptic weights. Thus the success of this theory demonstrates a biologically plausible learning rule that learns these correlations. Finally, because place cells are stable (Thompson & Best, 1989), in analogy to convergence, we have reason to believe that the assumption of convergence is a reasonable biological assumption.
3.3.3 Approximation 1: Local context units are independent of connectivity. This assumption is quite clearly an approximation, since, in an absolute sense, it is not true. In reality, the fact that two neurons are connected does influence the relative positions (starting points) of their local context units; it increases the likelihood that they are temporal neighbors. Because local context units are determined by two parameters, start point and lifetime, we express this approximation mathematically by the statements P(αi , αj | cij = 1) = P(αi , αj ) and P(`i , `j | cij = 1) = P(`i , `j ). (When we generalize, in section 3.4, to firing patterns that are parameterized by center µ and variance σ 2 , rather than by starting point and lifetime, this assumption becomes P(µi , µj | cij = 1) = P(µi , µj ) and P(σi2 , σj2 | cij = 1) = P(σi2 , σj2 ).)
40
Asohan Amarasingham and William B. Levy
3.3.4 Approximation 2: Local context unit starting points are uniformly distributed over the full sequence. This is a fairly simple approximation which we expect to be essentially true for the constant-shift input sequences. However, the theoretical results hold quite well even when local context units are not uniformly distributed, as indicated by the results in the case of the random overlap input sequence. Thus, this does not appear to be a critical assumption, and, in fact, it would not be difficult to extend the theory to handle any alternative distribution, so long as the distribution is specified. 3.3.5 Approximation 3: End effects are negligible. This approximation arose out of the need to calculate the αi , αj overlap distribution given only that cij = 1 (the two neurons are connected) and using Approximations 1 and 2 only. This calculation takes a simpler form when the negligible effects imposed by the skewing of the distribution at the boundaries of the sequence are ignored, so we assumed that the presynaptic local context unit was not at a boundary of the sequence—specifically, that `i < αi < S − `i when we calculated the overlap distribution. The situation is analogous for the continuous-valued firing patterns, discussed in section 3.4. Here we place the center of a prototypical neuron’s firing function at the center of the sequence and then calculate the probability distribution of weights for the synapses that input to this prototypical neuron. We then assume that all neurons have a similar distribution; that is, we assume that end effects are negligible. 3.4 Alternative Cell Firing Models. It is natural to wonder whether the results described so far are particular to the assumptions made about cell firing—specifically, the assumptions of discrete time and deterministic, binary output. In this section, we briefly illustrate the extension of our analysis to other cell firing distributions and demonstrate that this variation does not affect the essential result that the synaptic weight distribution is not gaussian or even centrally peaked. 3.4.1 Continuous firing. The natural continuous-valued analog to discrete local context unit firing patterns is a uniform firing function of the form fi (t) = {1 if αi ≤ t ≤ βi , 0 otherwise}, with αi , βi , and t real valued. Alternatively, a continuous-valued local context neuron could exhibit peak firing at a specific time point (or “center”) in the sequence with a more gradual decay in firing as the network state moves away from this center. To study the difference between these two firing models (uniform and decaying) in the continuous case, we can specify the firing function by using clipped polynomial expressions of the form: ½ fi (t) =
i 2k 1 − ( t−µ σi ) 0
, ,
µi − σi ≤ t ≤ µi + σi . otherwise
(3.14)
Predicting the Distribution
41
While equation 3.14 is less familiar in the current context than more traditional gaussian firing models, it gives a more satisfying comparison between uniform and decaying firing models. Rather than using starting points and ending points to parameterize cell firing, each cell i is now characterized by a center, µi , and a half-width scaling term, σi . Cells characterized by equation 3.14 reach peak firing rates at µi ; firing decays over the length σi as the network moves away from the center, and the general shape of this decay is determined by (positive, integer-valued) k. In the limit as k approaches ∞, equation 3.14 becomes a uniform firing function with starting point µi − σi and ending point µi + σi . Thus, we can easily contrast uniform versus decaying firing models by comparing the synaptic weight distribution for low and high values of k. Figures 4a and 4b plots this firing function for various values of k and σ . Picking up where we began in section 3.2.1, we can develop the synaptic weight distribution for the case of these continuous-valued firing functions using the same assumptions we used above (as enumerated in Table 1). Again, convergence assumptions (Assumption 2 in Table 1) imply that the synaptic weight wij takes on the correlation E[ fj (t) fi (t − 1)]/E[ fj (t)] (for a more precise justification in this continuous case, see Geman, 1979). For the sake of brevity, we assume that σi = σ is the same for each neuron (this is essentially like the simplification technique we use in section 3.2.3, where we assume that all context units have the same lifetime). As a result, 1/E[ fj (t)] =
2k + 1 4kσ
is a constant, which we call r. With σ fixed, then, we are interested in how the correlation rE[ fj (t) fi (t − 1)] varies as a function of the context unit centers, µi and µj . By applying Approximations 1 and 2, we assume that context unit centers are uniformly distributed across the sequence and note that, by virtue of independence, E[ fj (t) fi (t − 1)] = E[ fj (t) fi (t)]. We denote the distance between place cell centers µi and µj as dij = |µi − µj |. By again ignoring end effects (Approximation 3), we obtain the synaptic weight as a function of the distance between place cell centers (the weight-distance function): Z wij = r
dij −σ
Z =r
σ
σ
dij −σ
f (t) f (t − dij ) dt Ã
¶ ! µ ¶2k ! Ã µ t − dij 2k t 1− 1− dt, σ σ
(3.15)
in the case that dij < 2σ ; otherwise, wij = 0. The advantage of using polynomials rather than gaussian functions becomes clear here because these integrals can easily be closed. Equation 3.15 can be rewritten as a closed function of σ , dij , and k with the binomial theorem, though the resulting
42
Asohan Amarasingham and William B. Levy
series is a bit cumbersome for general k. Nevertheless, it is simple to show that the series generated by this expansion reduces to σ − 12 dij in the limit as k ↑ ∞. That is, as the firing function approaches a uniform distribution, the synaptic weight between two cells becomes a linear function of the distance between their centers. On the other hand, for low values of k, the synaptic weight between two neurons is not a linear function of distance but is in-
B 1
0.8
0.8
Firing Rate
Firing Rate
A 1
0.6 0.4
k=1 k=4 k=8
0.2 0 −1
−0.5
σ =1/2 σ =1 σ =2
0.6 0.4 0.2
0 Position
0.5
0 −2
1
−1
C
0 Position
1
2
D
1
k=1 k=4 k=8
k=1 0.25
Relative Frequency
Weight
0.8 0.6 0.4 0.2 0 0
0.2 0.15 0.1 0.05 0 0
0.5 1 1.5 2 Distance between place cell centers
0.2
E
Relative Frequency
Relative Frequency
k=8
0.25
0.2 0.15 0.1 0.05 0 0
1
F
k=4
0.25
0.4 0.6 0.8 Synaptic Weight
0.2 0.15 0.1 0.05
0.2
0.4 0.6 0.8 Synaptic Weight
1
0 0
0.2
0.4 0.6 0.8 Synaptic Weight
1
Predicting the Distribution
43
stead a decaying, polynomial function of distance (e.g., for k = 1, σ = 1, a d5
d3
fifth-degree polynomial: wij = − 40ij + 2ij − d2ij + 45 ). Figure 4a and 4b plots the synaptic weight as a function of distance for σ = 1 and various values of k. The difference between these two classes of synaptic weight-distance curves (linear versus polynomial) is important because the synaptic weight distribution depends essentially on the rate of change of the synaptic weight function with respect to the distance between place cell centers. Before we take this last step, however, we point out that the linear version of this function (which arises from the uniform firing model) is essentially the result we have quantified for our McCulloch-Pitts hippocampal model, albeit in continuous form (this can be seen from equations 3.4 and 3.5, noting that overlapij is the discrete equivalent of K − dij with K some constant). This similarity leads to the reassuring conclusion that the shape of the synaptic weight distribution is consistent whether the neurons are continuous or binary. Once the synaptic weight has been expressed as a function of the distance between place cell centers (the weight-distance function), all that remains is to apply the assumption that distance between place cell centers is uniformly distributed throughout the network (equivalent to Approximation 2 in a one-dimensional world). This will specify the synaptic weight distribution. Intuitively, it would appear that since this distance is uniformly distributed and the weight-distance function is monotonically decreasing, the density of the weight distribution will be relatively higher at those weights where the weight-distance function is decreasing slowly. Thus, it makes sense that a linear weight-distance function (with a constant rate of change) would give rise to a uniform weight distribution. Similarly, for the type of
Figure 4: Facing page. Weight analysis for the polynomial firing rate function, i 2k ) . Here µi is 0. (A) fi (t) for various values of k. The equation 3.14, fi (t) = 1−( t−µ σi parameter σ is held constant at 1. Notice how the fi (t) approaches the uniform firing function as k increases. (B) fi (t) for various values of σ ; k is held constant at 1. (C) The synaptic weight as a function of the distance between place centers for two cells (the weight-distance function for two neurons), as in equation 3.15. The weight-distance function is plotted for several values of k, with σ = 1. Notice how the function becomes linear as k increases. (D–F) Nonzero synaptic weight histograms were generated numerically for k of 1, 4, and 8. A uniform distribution of the distance between place cell centers (as would be the case if place cells were uniformly distributed in a one-dimensional world), and no end effects, were assumed. A set of distances evenly spaced (between 0 and 2) was transformed by the appropriate weight-distance function, and the histograms were constructed from this transformed set. Notice how the nonzero synaptic weight distribution becomes increasingly uniform as k increases.
44
Asohan Amarasingham and William B. Levy
weight-distance function that arises for a gradually decaying firing function such as k = 1 (and also for a gaussian firing function; see below), the weight distribution would be relatively higher for high and low weights than for intermediate weights, giving rise to a U-shaped type of distribution. More formally, we partition the zero and nonzero synaptic weight distributions. Generalizing, suppose we have a weight-distance function, g, such that wij = g(dij ), which is strictly decreasing (hence invertible) for dij ≤ 2σ and zero valued for dij > 2σ . Distances are distributed uniformly over half the sequence length (density 2/S). In the first case, we have P(wij = 0) = P(dij > 2σ ) = 1 − 4σ/S, where S is again the sequence length. For the nonzero weights, P(0 < wij ≤ t) = P(0 < g(dij ) ≤ t) = P(g−1 (t) ≤ dij < g−1 (0)) Z t −1 Z g−1 (0) dg (x) dx. 2/S dx = −2/S = dx g−1 (t) 0
(3.16)
The density function of the nonzero synaptic weight distribution is thus −2 dg−1 (x) , S dx proportional to the first derivative of the inverse of the weight-distance function. To determine the synaptic weight distribution for other cell firing models, all that is needed is an expression of the synaptic weight as a function of the distance between place cell centers, such as is computed in equation 3.15. Provided this function is decreasing (the only imaginable case in pure place identification), the density of the synaptic weight distribution will be proportional to the first derivative of the inverse of this function; geometrically, this means that the density will be relatively higher at those places where the weight-distance function is changing relatively slowly, so it should be possible to draw qualitative conclusions from the weight-distance function itself. Even for the simplest case, k = 1, σ = 1, the weight-distance function arising from equation 3.14 is a fifth-degree polynomial (above), which we did not invert. However, the weight-distance function (see Figure 4C) itself indicates that its rate of change is smaller for low and high distances (by implication, high and low weights) than for intermediate distances (weights). This result suggests a U-shaped distribution for the nonzero synaptic weights, as is confirmed by numerical calculations (see Figure 4D). The difference between trough and peak, however, is not large, so the uniform distribution remains a good approximation. As k increases (i.e., as the firing function becomes more uniform), however, the weight-distance function becomes more linear (approaching σ − 12 dij ), and gives rise to a uniform nonzero distribution. This can be demonstrated analytically, in the
Predicting the Distribution
45
limit as k ↑ ∞. Figure 4 illustrates these results from numerical calculations for various values of k. In summary, a uniform place-identification firing model gives rise to a uniform nonzero synaptic weight distribution in the case of continuous neurons (and continuous time), as well as in the discrete case of the McCullochPitts model. If a decaying firing model is used, the synaptic distribution rises at the high and low ends of the weight spectrum, becoming U shaped; however, even in this case, the uniform distribution remains a good approximation. The result for the zero weights remains essentially the same. 3.4.2 Gaussian place cell firing. Extending the previous analysis to the case of gaussian place cell firing is straightforward. Suppose now the firing function for cell i is a gaussian function, fi (t) = e
−(t−µi )2 2σ 2
,
(3.17)
with center µi . For the sake of simplicity, we again assume that the variance σ 2 is the same for all cells. We develop the weight-distance function in the same way, with dij again the distance between place cell centers: R wij =
fi (x) fj (x) dx R = fj (x) dx
R
f (x) f (x − dij ) dx R . fj (x) dx
(3.18)
The numerator of equation 3.18 reduces with equation 3.17, · ¸ 1 2 2 exp (−x − (x − d ) ) dx ij 2σ 2 Ã ! " # d2ij Z (x − 12 dij )2 = exp − 2 exp − dx, 4σ σ2 Z
Z
f (x) f (x − dij ) dx =
(3.19)
which gives µ ¶ i h R d2 (x− 1 d )2 Ã ! √ exp − 4σij2 dx exp − σ22 ij d2ij 2 i h exp − 2 . ≈ wij = R x2 2 4σ dx exp − 2σ 2
(3.20)
The natural course would be to take these integrals over the sequence length. But if instead we approximate with integrals extending to infinity, the error functions disappear, and the weight distribution becomes analytically tractable (this approximation follows from Approximation √ 3, see section 3.3.5) because the ratio of the two integrals then becomes 2/2. The weight-distance curve is thus itself a gaussian function.
46
Asohan Amarasingham and William B. Levy
Following equation 3.16, we can invert and differentiate equation 3.20, to give a weight density function √ √ µ 2¶ σ S 2 2 2 q , exp − <x< , fW (x) = √ S 2 16 2 x −ln( 2x)
(3.21)
which is (as we would by now expect) U shaped (see Figure 5). 3.4.3 Stochastic binary neurons. Returning to the case of discrete time and binary firing, what if the firing functions are stochastic rather than deterministic? If we have firing functions of the form P(zi (t) = 1) = fi (t), then the situation is the same as that which develops out of equation 3.2. The analysis is effectively the same as for the continuous cases, though the calculations now involve finite sums and a discrete space of possibilities. The case of uniform firing (i.e., local context units) is covered by equation 3.8 and gives rise, via approximation, to a uniform weight distribution (see equations 14a–c). For alternative stochastic firing distributions (such as mean peaked, decaying), we could approximate the firing function with a continuous function and use the techniques of the previous two sections to derive a continuous weight distribution that approximates the discrete one. Thus, we would again expect a U-shaped weight distribution for a mean-peaked firing function such as a discretized gaussian. 3.4.4 A two-dimensional world. The assumption that distance between place cell centers is uniformly distributed only follows from the assumption that place cells are uniformly distributed (e.g., Approximation 2) in a onedimensional world. In a two-dimensional world, however, the number of place fields farther from a central locus is greater than the number of place cells closer. Briefly, suppose that place cells are uniformly distributed on a planar circle with radius R, and consider a particular place cell at the center of the circle. We want to specify the distribution of the distances between the place cell at the center of the circle and other place cells in the circle (we use the center place cell here and assume that this distribution would be the same for all other place cells—this is strictly true only if there are no end effects). Referring to the distance variable as d, we have P(d ≤ t) = π t2 /πR2 = t2 /R2 , which implies a density function fd (t) = 2t/R2 . The density of the distribution actually increases in direct proportion with the distance. This modification alters the derivation of weight density given by equation 3.16. Instead, P(0 < wij ≤ t) = P(0 < g(dij ) ≤ t) = P(g−1 (t) ≤ dij < g−1 (0)) Z Z g−1 (0) 2x −2 t −1 dg−1 (u) du, dx = g (u) = 2 R2 0 du g−1 (t) R
(3.22)
Predicting the Distribution
47
0.7
Relative Frequency
0.6
S=20σ
Numerical weight distribution Analytical weight distribution
0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.2
0.3 0.4 Synaptic Weight
0.5
0.6
0.7
0.2
0.3 0.4 Synaptic Weight
0.5
0.6
0.7
0.2
0.3 0.4 Synaptic Weight
0.5
0.6
0.7
0.4
Relative Frequency
S=10σ 0.3
0.2
0.1
0 0
0.1
0.2
Relative Frequency
S=5σ 0.15
0.1
0.05
0 0
0.1
Figure 5: Weight analysis for the gaussian firing rate function, equation 3.17. The numerical weight distributions were computed by constructing a set of distances evenly spaced between 0 and S/2, where S was the sequence length. This set was transformed by the weight-distance function generated by the middle equation in equation 3.20; the integrals were computed numerically. The numerical histograms were constructed from this transformed set. The analytical histograms were constructed by integrating the weight density function specified by equation 3.21 over the bin widths. Comparisons were generated separately for sequence length (S) values of 5σ , 10σ , and 20σ . Note that the scale on the abscissa differs for each histogram.
48
Asohan Amarasingham and William B. Levy
using the substitution u = g(x), where g is a strictly decreasing weightdistance function (again, if g is not strictly decreasing everywhere, the weight distribution would need to be put together from partitions). The weight density function in this two-dimensional case is thus fw (t) = −1
dg −2 −1 g (x) dx R2
, which differs from the density in the one-dimensional case, equation 3.16, by the factor g−1 (x). This difference reflects the factor that more place cells are actually at farther distances than closer ones if place cells are uniformly distributed in a two-dimensional world. Extending this analysis to the cognitive mapping problem requires specifying the firing function of a place cell in two dimensions, deriving the weight-distance function via the integration techniques described in the previous two sections, and then using equation 3.22. Due to the great diversity of cell firing functions and weight-distance functions that have been measured or hypothesized for hippocampal place cells (Blum & Abbot, 1996; O’Keefe & Burgess, 1996; Burgess & O’Keefe, 1996; Touretzky, Redish, & Wan, 1993; Tsodyks & Sejnowski, 1995; Muller, Kubie, & Saypoff, 1991; Muller & Stead, 1996), an explicit analysis of the various cases falls outside the scope of this article; still, something in general can be said. This two-dimensional effect can be simply though crudely exemplified by a linear weight-distance function (a firing function uniform on a circle gives rise to a weight-distance function that is very nearly linear). Assume the weight-distance function g(dij ) = M − dij if dij ≤ M, and is 0 otherwise. Then the nonzero synaptic weight density, given by equation 3.22, fw (x) = 2 (M − x), is a linear, decreasing distribution—in contrast to the uniform R2 distribution that arises in a one-dimensional world. 3.4.5 Summary. The purpose of this section has been to demonstrate that the computational techniques used to quantify the synaptic weight distribution in our hippocampal model generalize, using analogous assumptions and approximations, to cell firing models other than those of deterministic, binary, discrete local context units. While it is useful to understand how these computations generalize, it bears reiterating that this does not imply that the model generalizes. Whether a model can be constructed that evolves a coding similar to any of the cases we have discussed and is well characterized by the assumptions and approximations we use (particularly Approximation 1, that coding is independent of connectivity) is an issue completely divorced from these generalizations. For all the cases analyzed in detail, the essential result of a nongaussian synaptic weight distribution remains. While these distributions were not always uniform, the uniform distribution is a good approximation. 4 Discussion The purpose of this analysis was to understand how the relationship between the input environment and the distribution of synaptic weights is
Predicting the Distribution
49
affected by two important phenomena: Hebbian synaptic modification and the emergence of the local context unit–based coding scheme. We find that this relationship imposes a distinctive constraint on the distribution of the synaptic weights under the sufficient condition that context units form. The actual distributions are, to a remarkably large extent, independent of the details of input firing patterns. Using equation 14a–c, these distributions can instead be predicted as a function of the average context unit lifetime (which depends on more general characteristics of the input environment) with minimal inaccuracy. Even more simply, by using the relationship between average context unit lifetime, E[`], capacity, and activity described by equation 2.3, it is possible to approximate these distributions with even more rudimentary knowledge. That is, if local context codes form and if average activity is known or preset, the distribution is largely independent of the environment, according to equation 3.12. The surprise of these results suggests a contradiction to two assumptions underlying our original intuitions concerning the dynamical properties of this network: (1) an expectation of a typical gaussian-type weight distribution arising from a first-approximation assumption of independent neuronal firing, and (2) the general idea that Hebbian weight modification imposes the statistical characteristics of the environment on the synapses. Our intuitive expectation of a gaussian weight distribution builds on a supposition of independent neuronal firing. As a simplified example, suppose that each neuron fires independently Sa times, so that the average network activity is a. Again, convergence assumptions imply that wij will approach P(zi (t − 1) | zj (t)). If kij is the number of time steps of synchronous firing (like overlapij ), it follows that ¡Sa¢¡ S−Sa ¢ P(kij = m) =
m
Sa−m
¡S¢
;
(4.1)
Sa
that is, kij has a hypergeometric distribution. But since wij = kij /Sa, this implies wij is also hypergeometrically distributed with mean a and variance (1 − a)2 /(S − 1). As S → ∞, the central limit theorem implies that the standardized distribution becomes gaussian, not uniform, as in our results. Moreover, the distribution converges to an impulse function at the mean, a, as S → ∞ as well, as implied by the variance term. Thus P(wij = 0) converges to 0, not to the ca. 1 − 2a of our results. The weight distribution in this formerly intuitive case is therefore quite different from the distribution shown to exist when local context units characterize cell firing. The difference between the two coding schemes that produce these very different distributions stems from a difference in the input environments and the effect of synaptic modification. The case of independent activity patterns is characteristic of a structureless, informationless environment, while local context unit–type activity arises (in the model) from predictable correlations in the sequential repetition of the input.
50
Asohan Amarasingham and William B. Levy
Understanding the basis for our earlier, false intuition leads us directly to the issue of biological realism. The world and the brain is some hybrid combination of the two extremes—the one of our initial intuition and the one explained previously. Although the relative contribution of each of these two input types is an open question, it will always be the challenge to the information processor to extract whatever signal there is from amid however much noise there is. So it would make sense that the relationship between the synaptic weight structure and coding scheme of a more natural sequence-learning and predicting network, such as is found in the mammalian brain, lies somewhere between the extremes of purely deterministic and uniform, local context unit–based codings and the purely noisy and gaussian, independent firing patterns. Thus, we suggest that the next step in the synaptic weight theory presented here requires relaxing some, but not all, of its deterministic rigor. The fact that the input environments simulated here are less noisy than in more realistic, biological cases makes it all the more surprising that the synaptic weight distribution turns out to be independent of the input environment, on the condition that context units form. As a result, we suspect that some of the more qualitative results of the deterministic case analyzed here are meaningful. Among these results, the large number of synapses that are driven to zero weight is hypothesized to be a general characteristic of these networks. Consequently, the theoretical link between zero weights and low activity developed as equation 3.12, P(wij = 0) ∼ 1 − 2a, could provide a unification of two well-known and complementary characteristics of forebrain cortical systems: low connectivity and low activity. Generalizing the uniform distribution of nonzero synaptic weights is a more complicated question, and this complication formed the basis for the extensions presented in section 3.4. Here, we demonstrate how the results can extend to the cases of stochastic and continuous firing neurons, using only assumptions and approximations equivalent to those already introduced (see Table 1). The essential result of a nongaussian weight distribution still obtains, but with an additional qualification: the resulting nonzero weight distribution may be either uniform or U shaped, depending on the “place cell” firing function assumed. In both cases, deriving the weight distribution from the firing function, for both analytical and numerical analysis, depends on the firing function in the same straightforward way. Moreover, in the cases we studied, the uniform distribution of weights remained a good approximation even for the distributions predicted to be U shaped. This leads directly to the question of whether the assumptions outlined in Table 1 are biologically reasonable. Section 3.3 discusses the issues surrounding this question. As this discussion indicates, our position is that all of these approximations are computationally reasonable, with the exception of Approximation 1 (see section 3.3.3), which states, in effect, that a
Predicting the Distribution
51
neuron’s role in the coding is independent of its connectivity. The success of this particular approximation in the computational simulations is actually a bit surprising. In this context, an important result of our model is that a self-organizing recurrent model of the CA3 region can develop local context units for which neuronal correlations (i.e., the relationship between two connected neurons which depends on the respective α’s and β’s) can be very effectively approximated by an assumption of independence to the variable of connectivity. This result is important because context unit place cell codings are not imposed on our model, but rather develop out of a self-organizing process. Therefore, it is surprising that the model’s principal source of randomness, the connectivity matrix, can be assumed to have no influence on this process when the weight distribution is computed. There are several interesting cases, in addition, in which this codingconnectivity independence assumption is necessarily valid and in which we would expect our extended results to hold. First are the models that implicitly presuppose independence between coding and connectivity, by virtue of either full connectivity (e.g., Blum & Abbott, 1996) or externally originated place cells imposed, independently, on a randomly connected network (e.g., Muller & Stead, 1996). Second, and perhaps of some interest to neurophysiologists, the distributions of weights could just as well have been the distribution of pairwise cell firing, E[zi zj ]/E[zj ] (and this is true regardless of whether the codingconnectivity independence assumption is valid). The motivation to measure these correlations has begun to be shared by experimentalists, and the distributions published in the recent experimental paper of Hampson, Byrd, Konstantopoulos, Bunn, & Deadwyler (1996; see, e.g., their Figures 3C and 3D) are very similar to the sorts of distributions described here.1 Finally, we note that the description of synfire chains as models of cortical function suggested by Abeles (1991), and similar lines of thought arising from other researchers, has recently inspired a great deal of research on sequence-producing recurrent neural networks. The firing patterns in these cortically inspired networks are characterized by subpopulation neuronal groups (synfire links) activating in sequential fashion (Abeles 1991; Bienenstock, 1995; Hertz & Prugel-Bennet, ¨ 1996; Herrmann, Hertz, & Prugel¨ Bennet, 1995; Griniasty et al., 1993; Amit et al., 1994; Brunel, 1996; Chover, 1996). Thus we believe our results are not limited to hippocampal models. Rather they may be applied to a variety of models currently being studied in the computational literature.
1 Hampson et al. refer to these as cross-correlations. They measure the distribution of E[zi zj ] rather than the E[zi zj ]/E[zj ] distribution we calculate. The difference in the distributions, 1/E[zj ], is essentially one of scale (our measure is scaled for the average activity of the postsynaptic neuron), not of shape.
52
Asohan Amarasingham and William B. Levy
Appendix The purpose of this appendix is to make explicit the calculations that produce P(wij = x | `i , `j ). To do this calculation, the (`i , `j ) space was partitioned into the three cases: `i = `j , `i > `j , and `i < `j . Within these partitions, we subpartition the cases as a function of starting point and lifetime, in order to simplify the min and max functions in the overlapij definition; this is made easier if we consider the relationship separately depending on which context unit (the pre- or the post-) starts firing first. Finally, these subpartitions are reduced to joint probabilities and then unified to remove the subpartitions. Among these three cases, the conditional weight distribution is structurally very similar, and so we are able to unify the three cases into a single function. This is the result we present as equation 3.8. A.1 Case I: `i = `j = `. In this case equation 3.7 reduces to: overlapij = max{min{αi + ` + 1, αj + `} − max{αi + 1, αj }, 0},
(A.1)
in which case we can further partition the possibilities into Case Ia, when the presynaptic context unit starts firing before the postsynaptic, and vice versa in Case Ib. A.2 Case Ia: αj ≥ αi + 1. So that max{αi + 1, αj } = αj and min{αi + ` + 1, αj + `} = αi + ` + 1. This gives overlapij = max{αi + ` + 1 − αj , 0} and allows us to conclude: P(overlapij = 0, αj ≥ αi +1) = P(αi +`+1−αj ≤ 0, αj ≥ αi +1)
(A.2a)
P(overlapij = `, αj ≥ αi + 1) = P(αj = αi + 1, αj ≥ αi + 1)
(A.2b)
P(overlapij = x, αj ≥ αi + 1) = P(αi + ` + 1 − αj = x, αj ≥ αi + 1), 0<x<`
(A.2c)
for the cases of no overlap, complete overlap, and partial overlap, respectively. A.3 Case Ib: αj < αi + 1. Then max{αi + 1, αj } = αi + 1 and min{αi + ` + 1, αj + `} = αj + `. So overlapij = max{αj + ` − αi + 1, 0}, and, once again, we have P(overlapij = 0, αj < αi +1) = P(αj +`−αi +1 ≤ 0, αj < αi +1)
(A.3a)
P(overlapij = `, αj < αi + 1) = P(αj = αi + 1, αj < αi + 1) = 0
(A.3b)
Predicting the Distribution
53
P(overlapij = x, αj < αi + 1) = P(αi + ` + 1 − αj = x, αj < αi + 1), 0<x<`
(A.3c)
for the cases of no overlap, complete overlap, and partial overlap, respectively. Since Cases Ia and Ib partition the event space, we can generate the overlapij distribution by combining equations A.2a–c and A.3a–c. The calculation of P(overlapij = 0) is presented as an example: P(overlapij = 0) = P(αj ≤ αi − ` + 1, αj < αi + 1) + P(αj ≥ ` + 1 + αi , αj ≥ αi + 1) = P(αj ≤ αi − ` + 1) + P(αj ≥ ` + 1 + αi ) = 1 − P(αi − ` + 1 < αj < αi + ` + 1).
(A.4)
Equation A.4, as well as its analog for the other cases, can be computed by assuming independence of αi and αj and invoking the assumption of uniform distribution of α’s (so that, explicitly, P(αi = t) = 1/(S − `i + 1) if 1 ≤ t ≤ S − `i + 1): P(overlapij = 0) = 1 − P(αi − ` + 1 < αj < αi + ` + 1), [which we break down by again partitioning into conditional relationships] =1 −
S−`+1 X S−`+1 X x=1
P(αi − ` + 1 < αj < αi + ` + 1 | αi = x, αj = y)
y=1
× P(αi = x, αj = y) [Since x and y are constants (in time-step units), P(αi −`+1 < αj < αi +`+1 | αi = x, αj = y) is an indicator function: P(αi − ` + 1 < αj < αi + ` + 1 | αi = x, αj = y) = 1 if and only if x − ` + 1 < y < x + ` + 1, provided that P(αi = x) > 0 and P(αj = y) > 0. Otherwise, it takes value 0. We rewrite using the notation I(A) as the indicator function of event A and apply the independence assumption for αi and αj .] = 1−
S−`+1 X S−`+1 X x=1
= 1−
S−`+1 X S−`+1 X x=1
I(x − ` + 1 < y < x + ` + 1)P(αi = x)P(αj = y)
y=1
y=1
I(x − ` + 1 < y < x + ` + 1)
1 1 S−`+1S−`+1
54
Asohan Amarasingham and William B. Levy
P [The calculation of y I(x − ` + 1 < y < x + ` + 1) here is complicated by the end-effect situation. In the ideal case, the event {x − ` + 1 < y < x + ` + 1} would be measurable without respect to the particular value that x takes, as would appear from the event itself. However, the probability distribution of α = y is additionally constrained by the condition 1 ≤ α ≤ S, and this complicates the calculation. We avoided complicating the theory by ignoring such a possibility, thus assuming `i < x < S − `i , or, effectively, that the presynaptic context unit was not near the beginning or the end of the sequence. In P light of this assumption, noting that x is a constant within the summation, y I(x − ` + 1 < y < x + ` + 1) amounts to a count of the number of values of y satisfying x − ` + 1 < y < x + ` + 1 for any constant x, but this is just 2` − 1.] µ = 1−
1 S−`+1
¶2 S−`+1 X
2` − 1
x=1
2` − 1 S−`+1 S − 3` + 2 = S−`+1
= 1−
(A.5)
The procedure employed to calculate P(overlapij = 0 | `) above applies generically for the remaining cases 0 < overlapij ≤ `. Given this conditional probability, it is a simple matter to compute P(wij | `) with equation 3.6. A.4 Case II: `i > `j . The procedure for the remaining cases is generically similar to Case I, so here we only point out the differences and show the results of the calculations. The most significant characteristic distinguishing Cases II and III from Case I is that the information contained in `i and `j cannot be simplified to a single variable; a conditional probability can only be generated for both `i and `j . This affects the indices of summation in the analogous calculation producing equation A.5 as well as the defining terms of the end-effect error. However, in the former case, the indices are cancelled out by the (αi , αj ) independence assumption, and, in the latter, the end error can be assumed away as before. The overlapij can be calculated from (αi , βi ) and (αj , βj ) as above, and the same computations applied:
¶ µ x | `i , `j = P wij = `j
S−2`j −`i +2 S−`j +1
,
x=0
2 S−`j +1
,
0 < x < `j
`i −`j +1 S−`j +1
,
x = `j
(A.6)
Predicting the Distribution
55
A.5 Case III: `i < `j . The same reasoning applies if `i < `j . This is guaranteed by the symmetry of the `i < `j and `i > `j cases if, as we have assumed, the two variables are independent. In this case, however, since 0 < overlapij < `i , it follows 0 < wij < `i /`j , and thus P(wij = 1) = 0 necessarily:
µ P wij =
x | `i , `j `j
¶ =
S−2`j −`i +2 S−`j +1 ,
x=0
2 S−`j +1 ,
0 < x < `i
`j −`i +1 S−`j +1 ,
x = `i
(A.7)
Acknowledgments This work was supported by NIH MH48161 and MH00622, EPRI RP803008 and Pittsburgh Supercomputing Center Grant BNS950001P to WBL, and by the Department of Neurosurgery, John A. Jane, chairman. We thank Xiangbao Wu for his patient advice and assistance with the simulations and Nancy L. Desmond for her assistance with the manuscript. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral circuits. Cambridge: Cambridge University Press. Amit, D. J., Brunel, N., & Tsodyks, M. V. (1994). Correlations of cortical Hebbian reverberations: Theory versus experiment. Journal of Neuroscience, 14(11), 6435–6445. Bienenstock, E. (1995). A model of neocortex. Network, 6(2), 179–224. Blum, K. I., & Abbott, L. F. (1996). A model of spatial map formation in the hippocampus of the rat. Neural Computation, 8(1), 85–93. Brunel, N. (1996). Hebbian learning of context in recurrent neural networks. Neural Computation, 8(8), 1677–1710. Burgess, N., & O’Keefe, J. (1996). Neuronal computations underlying the firing of place cells and their role in navigation. Hippocampus, 6(6), 749–762. Chover, J. (1996). Sequential recall. Unpublished manuscript. Geman, S. (1979). Some averaging and stability results for random differential equations. SIAM J. Appl. Math., 36(1), 86–105. Gray, J. A. (1982). The neuropsychology of anxiety: An enquiry into the functions of the septo-hippocampal system. New York: Oxford University Press. Griniasty, M., Tsodyks, M. V., & Amit, D. J. (1993). Conversion of temporal correlations between stimuli to spatial correlations between attractors. Neural Computation, 5(1), 1–17. Grossberg, S. (1982). Studies of mind and brain: Neural principles of learning, perception, development, cognition, and motor control. Boston: D. Reidel.
56
Asohan Amarasingham and William B. Levy
Hampson, R. E., Byrd, D. R., Konstantopoulos, J. K., Bunn, T., & Deadwyler, S. A. (1996). Hippocampal place fields: Relationship between degree of field overlap and cross-correlations within ensembles of hippocampal neurons. Hippocampus, 6, 281–293. Herrmann, M., Hertz, J., & Prugel-Bennet, ¨ A. (1995). Analysis of synfire chains. Network, 6, 403–414. Hertz, J., & Prugel-Bennet, ¨ A. (1996). Learning short synfire chains by selforganization. Network, 7(2), 357–363. Hirsh, R. (1974). The hippocampus and contextual retrieval from memory. Behav. Biol., 12, 421–444. Kesner, R. P., & Hardy, J. D. (1983). Long-term memory for contextual attributes: Dissociation of amygdala and hippocampus. Behav. Brain Res., 8, 139–149. Levy, W. B. (1982). Associative encoding at synapses. In Proc. Fourth Ann. Conf. Cogn. Sci. Soc. (pp. 135–136). Levy, W. B. (1989). A computational approach to hippocampal function. In R. D. Hawkins & G. H. Bower (Eds.), Computational modeling of learning in simple neural systems (pp. 243–305). Orlando, FL: Academic Press. Levy, W. B. (1996). A sequence predicting CA3 is a flexible associator that learns and uses context to solve hippocampal-like tasks. Hippocampus, 6 (6), 579–590. Levy, W. B., & Steward, O. (1979). Synapses as associative memory elements in the hippocampal formation. Brain Res., 175, 233–245. Levy, W. B., & Steward, O. (1983). Temporal contiguity requirements for longterm associative potentiation/depression in the hippocampus. Neuroscience, 8(4), 791–797. Levy, W. B., & Wu, X. B. (1996). The relationship of local context codes to sequence length memory capacity. Network, 7 (2), 371–382. Levy, W. B., & Wu, X. B. (1997). A simple, biologically motivated neural network solves the transitive inference problem. In Proceedings of the 1997 International Conference on Neural Networks, Vol. 1 (pp. 368–371). Levy, W. B., Wu, X. B., & Baxter, R. A. (1995). Unification of hippocampal function via computational/encoding considerations. In Proceedings of the Third Workshop: From Biology to High-Energy Physics. Intl. J. Neural Sys., Vol. 6 (Suppl.) (pp. 71–80). Levy, W. B., Wu, X. B. & Tyrcha, J. M. (1996). Solving the transverse patterning problem by learning context present: A special role for input codes. In INNS World Congress on Neural Networks (pp. 1305–1309). Hillsdale, NJ: Erlbaum. McClelland, J., & Goddard, N. (1996). Considerations arising from a complementary learning systems perspective on hippocampus and neocortex. Hippocampus, 6(6), 654–665. Minai, A. A., Barrows, G. L., & Levy, W. B. (1994). Disambiguation of pattern sequences with recurrent networks. In INNS World Congress on Neural Networks (Vol. 4, pp. 176–181). Hillside, NJ: Erlbaum. Muller, R. U., Kubie, J. L., & Saypoff, R. (1991). The hippocampus as a cognitive graph (abridged version). Hippocampus, 1(3), 243–246. Muller, R. U., & Stead, M. (1996). Hippocampal place cells connected by Hebbian synapses can solve spatial problems. Hippocampus, 6(6), 709–719.
Predicting the Distribution
57
Nadel, L., & Willner, J. (1980). Context and conditioning: A place for space. Physiol. Psych., 8(2), 218–228. O’Keefe, J., & Burgess, N. (1996). Geometric determinants of the place fields of hippocampal neurons. Nature, 381, 425–428. O’Keefe, J., & Nadel, L. (1978). The hippocampus as a cognitive map. Oxford: Clarendon Press. Thompson, L. T., & Best, P. J. (1989). Place cells and silent cells in the hippocampus of freely-behaving rats. J. Neurosci., 9(7), 2382–2390. Touretzky, D. S., Redish, A. D., & Wan, H. S. (1993). Neural representation of space using sinusoidal arrays. Neural Computation, 5(6), 869–884. Treves, A., & Rolls, E. T. (1992). Computational constraints suggest the need for two distinct input systems to the hippocampal CA3 network. Hippocampus, 2, 189–199. Treves, A., & Rolls, E. T. (1994). Computational analysis of the role of the hippocampus in memory. Hippocampus, 4(3), 374–391. Tsodyks, M., & Sejnowski, T. (1995). Associative memory and hippocampal place cells. In Proceedings of the Third Workshop: From Biology to High-Energy Physics. Intl. J. Network Sys., Vol. 6 (Suppl.) (pp. 81–86). Voydovic, J. T. (1996). Cell death in cortical development: How much? why? so what? Neuron, 16, 693–696. Wu, X. B., Baxter, R. A., & Levy, W. B. (1996a). Context codes and the effect of noisy learning on a simplified hippocampal CA3 model. Biol. Cybern., 74, 159–165. Wu, X. B., Tyrcha, J. M., & Levy, W. B. (1996b). A neural network solution to the transverse patterning problem depends on the repetition of the input code. Manuscript submitted for publication. Received August 22, 1996; accepted April 9, 1997.
Communicated by Guenther Palm and Ramamohan Paturi
State-Dependent Weights for Neural Associative Memories Ravi Kothari Rohit Lotlikar Marc Cahay Artificial Neural Systems Laboratory, Department of Electrical and Computer Engineering and Computer Science, University of Cincinnati, Cincinnati, OH 45221-0030, U.S.A.
In this article we study the effect of dynamically modifying the weight matrix on the performance of a neural associative memory. The dynamic modification is implemented by adding, at each step, the outer product of the current state, scaled by a suitable constant η, to the correlation weight matrix. For single-shot synchronous dynamics, we analytically obtain the optimal value of η. Although knowledge of the noise percentage is required for calculating the optimal value of η, a fairly good choice of η can be made even when the amount of noise is not known. Experimental results are provided in support of the analysis. The efficacy of the proposed modification is also experimentally verified for the case of asynchronous updating with transient length > 1. 1 Introduction Neural networks characterized by information storage and recall are termed associative memories. Such networks, by virtue of their ability to recall information from partial or imprecise information, have been found useful in a number of applications, including error correction, image restoration, and pattern recognition. In the simplest form, an associative memory can be realized using a fully interconnected recurrent network; if the state of such an associative memory at time t is denoted by the column vector x(t) = [x1 (t), x2 (t), . . . , xN (t)]T , its dynamics are given by: ui (t + 1) =
N X
wij xj (t) − θi
j=1
xi (t + 1) = f [ui (t + 1)] ,
(1.1)
where wij is the connection strength from neuron j to neuron i, θi is the threshold for neuron i (often taken as 0), and f (·) is a nonlinear activation function. Proper choice of the weights results in an associative memory where an imprecise pattern vector will settle to a fixed point. Correct recall Neural Computation 10, 59–71 (1998)
c 1997 Massachusetts Institute of Technology °
60
Ravi Kothari, Rohit Lotlikar, and Marc Cahay
is said to have occurred when the fixed point is the desired prototype vector. When the fixed point does not correspond to the desired prototype vector, convergence to a spurious point is said to have occurred. A popular prescription for the weights is based on the Hebbian learning rule (see, e.g., Amari, 1972; Kohonen, 1972; Nakano, 1972; Hopfield, 1982). Assuming there are M prototype patterns {ξ (µ) : µ = 1, 2, . . . , M}, ξ (µ) ∈ {−1, +1}N to be stored, the weight matrix based on the Hebbian learning rule is given by wij =
M X µ=1
(µ) (µ) ξj ,
ξi
with the activation function defined as: ½ +1, ui (t) ≥ 0 f [ui (t)] = . −1, ui (t) < 0
(1.2)
(1.3)
Thus, once a stimulus (say, a noisy version of a stored prototype ξ (ν) is applied), the network acts as an unforced system and evolves according to equation 1.1, perhaps reaching ξ (ν) at equilibrium. The capacity of such networks has been analyzed in detail. Specifically, it has been shown that if M < N(1 − 2ρ)2 /(4 ln N), then all the stored prototypes are recoverable from a distance ρN with a probability approaching 1 (McEliece, Posner, Rodemich, & Venkatesh, 1987; Amari & Maginu, 1988). The goal of an effective associative memory paradigm is then to increase ρ and M simultaneously for a given N. The concept of multiple encoding has been suggested to improve the capacity of the associative memory (Wang, Cruz, & Mulligan, 1990a). The idea is to increase the contribution of unstable prototypes to the weight matrix; assuming the prototype ξ (ν) is not a fixed point, formulate the weights as: wij =
M X µ=1
(µ) (µ) ξj
ξi
+ (q − 1)ξi(ν) ξj(ν) ,
(1.4)
where q is a constant, whose value was experimentally determined at first and later analytically derived (Wang, Cruz, & Mulligan, 1990b). Quite obviously in making ξ (ν) a fixed point, one might make another prototype unstable. An alternative method, called dummy augmentation, has also been proposed, which ensures that all prototypes are stable (Wang et al., 1990a). The idea used is to augment the prototype vectors by appending additional bits such that the augmented prototype vectors become orthogonal and hence are accurately recalled. However, in the light of the fact that the asymptotic estimate for the number of spurious fixed points is given approximately by (1.0505 ∗ 20.2874N ), one has an exponential growth in the
State-Dependent Weights for Neural Associative Memories
61
number of spurious fixed points with increasing N (McEliece et al., 1987; Kamp & Hasler, 1990). Consequently, in the presence of noise, one might likely converge to these spurious fixed points. A more generalized form of equation 1.2 has also been investigated (Perreto, 1988). Here the weight matrix is defined as wij =
M X µ=1
(µ) (µ) ξj
ξi
(µ)
+ aξi
(µ)
+ bξj
+ c,
(1.5)
where, a, b, c are constants. For nonzero values of a, b, c, however, the weight matrix is asymmetrical and admits other attractors—unlike symmetric networks, which admit only fixed points and cycles of length 2 (Kamp & Hasler, 1990). The concept of an exponential correlation associative memory (ECAM) has also been proposed (Chiueh & Goodman, 1991). The evolution of the network is then defined as #
" x(t + 1) = f
X µ
T
(ξ (µ) x(t)) (µ)
a
ξ
,
(1.6)
where a is a constant and T denotes the transpose. While an exponential growth in capacity is reported, this scheme requires all the patterns to be continuously and individually available, unlike neural associative memories, which represent the patterns in a weight matrix. Once the contribution of patterns to the weight matrix is determined, they are no longer required. In addition, implementation of the ECAM requires an exponential amount of hardware (Chiueh & Goodman, 1991). More recently, the use of a nonmonotonic activation function f , toward increasing the capacity, has been proposed (Morita, 1993). Yet more√recent work showed the capacity of such associative memories to be O(N/ ln N) (Yanai & Amari, 1996). In this article, we study the behavior of a matrix associative memory that uses state information to change the weight matrix dynamically. We find that such a memory allows for a much larger capacity and increased tolerance to noise. Analytical and experimental evaluations of the proposed paradigm are presented to support the claim. 2 State-Dependent Weight Matrix Our motivation for dynamically modifying the weight matrix is based on the observation that one expects an unknown stimulus to converge to ξ (ν) only if the stimulus was closer to ξ (ν) than to any other stored prototype. In other words, the stimulus itself carries significant information as to the
62
Ravi Kothari, Rohit Lotlikar, and Marc Cahay
desirable attractor. To take advantage of this observation, we define a new weight matrix J as ˜ J(t) = w + w(t),
(2.1)
where w is the standard correlation weight matrix with components as de˜ is a termined from equation 1.2 and is computed once and stored, and w(t) time-varying component of the weight matrix defined as ˜ w(t) = ηx(t)x(t)T ,
(2.2)
where η is a constant. Intuitively, since the stimulus is closer to the desired prototype, the addition of the outer product of the state should reinforce the weight matrix to favor the desired prototype or fixed points close to the desired prototype. A similar approach1 for extremely diluted networks (i.e., a significant number of interconnections have been randomly removed) has been previously studied (Engel, Bouten, Komoda, & Serneels, 1990) based on concepts of the statistical mechanics of spin glasses. The authors use the fact that exact solution for the dynamics is available for extremely diluted networks since different xi (t) cannot depend on the same xi (0) (Derrida, Gardner, & Zippelius, 1987). However, an extension to fully connected models or the determination of the optimal value of η in terms of M, N, and the amount of noise is not computed in this prior study. In what follows, we analytically determine the optimal value of η to use for correct recall. Not surprisingly, the optimal value of η is dependent on the amount of noise (which is not known a priori) that is present in the stimulus. However, as we show, a fairly good estimate of η can also be obtained based on the known parameters (M and N). Our analysis relies on separating the response of a network to a stimulus into signal and noise components and requiring the signal to exceed the noise in order to bound η. To begin the analysis, consider a stimulus that is a noisy version of a stored prototype, applied to the network at time 0, that is, x(0) = ξ (ν) + δ, where δ is a N × 1 noise vector and δi ∈ {−2, 0, 2}. Using equations 1.1, 1.2, and 2.7, we compute u(1) as: i h u(1) = Jx(0) = w + ηx(0)x(0)T x(0) = wx(0) + ηNx(0).
(2.3)
For correct single-shot recall of the ith bit, we must have 0 < ui (1)ξi(ν) . 1
We thank an anonymous reviewer for bringing this work to our attention.
(2.4)
State-Dependent Weights for Neural Associative Memories
63
ui (1) can be obtained from equation 2.3, and using x(0) = ξ (ν) +δ, we obtain: X
ui (1) =
wij xj (0) + ηNxi (0)
j
à X X
=
µ6=ν
j
XX
=
j
+
! (µ) (µ) ξi ξj
µ6=ν
X
(µ) (µ) ξj
ξi
+
ξi(ν) ξj(ν)
³
´ ³ ´ ξj(ν) + δj + ηN ξi(ν) + δi
³
´ X ξi(ν) ξj(ν) + δj + j
ξi(ν) ξj(ν) δj
³ ´ + ηN ξi(ν) + δi .
(2.5)
j
Substituting ui (1) from equation 2.5 into 2.4 and collecting terms, we obtain, ´ X X (µ) (µ) ³ (ν) (ν) ξi ξj ξj + δj ξi 0< j
µ6=ν
+ N +
X
ξj(ν) δj + ηN + ηNξi(ν) δi ,
(2.6)
j
< Ci + 2i ,
(2.7)
where Ci is the term within the first set of square brackets in equation 2.6, and 2i constitutes of all the remaining terms of the equation. The term Ci is the sum of (M − 1)N independent variables, and consequently by the central limit theorem has a normal distribution. For the inequality of equation 2.7 to hold, Ci > −2i , and consequently one may obtain the probability of the ith bit being in error, Pe , as 1 Pe = √ 2πσ 2
Z
−2i
−∞
e−t
2
/2σ 2
dt,
(2.8)
where σ 2 = (M − 1)N ≈ MN — the variance of Ci . To obtain 2i , we characterize each of the terms in 2i . To obtain the proper characterization of the terms in 2i , we define ρ as the fraction of bits in the stimulus that are noisy; that is, the stimulus is formed by reversing ρN bits of a prototype pattern. Table 1 summarizes each term in 2i . The optimum value of η can now be computed. From Table 1, it is evident that the value of 2i depends on the value of δi . There are two possibilities: (1) δi = +2 or −2, in which case 2i is equal to N(1 − 2ρ − η), and (2) δi = 0, in which case 2i is equal to N(1 − 2ρ + η). These two outcomes have probabilities of ρ and 1−ρ, respectively. The probability of error, Pe , is given
64
Ravi Kothari, Rohit Lotlikar, and Marc Cahay
Table 1: Characteristics of the Terms in 2i . Term of 2i
Term 1 2 3 4
P N(ν)
ξ δj j j ηN ηNξi(ν) δi
Value
Probability
N −2ρN ηN 0; if δi = 0 −2ηN; if δi 6= 0
1 1 1 (1 − ρ) ρ
by ·
1
Z
Pe = √ (1 − ρ) 2πσ 2
−φ1
−∞
e
−t2 /2σ 2
Z dt + ρ
−φ2
−∞
−t2 /2σ 2
e
¸ dt ,
(2.9)
where, φ1 ≡ N(1 − 2ρ + η) and φ2 ≡ N(1 − 2ρ − η). The minimum value of Pe relative to η occurs when dPe /dη = 0, that is, h i 1 dPe 2 2 2 2 =0= √ −(1 − ρ)e−φ1 /2σ + ρe−φ2 /2σ . dη 2πσ 2
(2.10)
This leads to (1 − ρ)e−φ1 /2σ = ρe−φ2 /2σ , 2
2
2
2
(2.11)
which on simplification leads to the optimal value of η as µ ¶ 1−ρ σ2 ln 2N2 (1 − 2ρ) ρ µ ¶ 1 1−ρ M ln . = N 2(1 − 2ρ) ρ
η=
(2.12)
Thus, as long the dynamic weight adaptation is done to observe the bound on η as given by equation 2.12, correct recall would occur. Not surprisingly, equation 2.12 predicts a lower value of η with increasing ρ (noise). Intuitively, one might also have anticipated that if the stimulus is very noisy, one should decrease the contribution of the state to the weight matrix. In addition, for a fixed N and ρ, the value of η is directly proportional to M. Consequently, state-dependent contributions seem to be definitely more desirable when one would like to store more patterns in an associative memory. In the next section, we experimentally obtain the effect of incorporating state-dependent contributions to the weight matrix.
State-Dependent Weights for Neural Associative Memories
65
3 Simulations 3.1 Simulations: Synchronous Single-Shot Recall. The first simulation we performed was based on a single-shot (one iteration) synchronous (all the neurons are free to change their state simultaneously) dynamics. Consequently, this simulation was in agreement with the conditions under which equation 2.12 was obtained. We varied M, ρ, and, η to investigate not only the effect of state-dependent contributions to the weight matrix but also to validate our analysis. We kept N fixed at 400. Figure 1 shows three plots corresponding to M = 60, 80, and 100, obtained with single-shot synchronous dynamics. Each plot shows the average number of prototype bits (expressed as a percentage) in error for varying values of η and ρ. All the prototypes were randomly generated (µ) (µ) with P[ξi = +1] = P[ξi = −1] = 1/2. Note that M = 60 corresponds to saturation, the maximum number of patterns that could be stored in a traditional associative memory at very low noise conditions. Further, η = 0 corresponds to the traditional associative memory model. For all the cases, it is clear that state-dependent contributions to the weight matrix allow for higher capacity and increased tolerance to noise. For example, when M = 100, one recovers the prototype fairly closely for ρ = 0.1. This corresponds to M = 0.25N, which is much higher than the saturation point of the traditional associative memory of M = 0.15N. A more subtle point may be observed from Figure 1. For the case M = 100 and ρ = 0.2, approximately 25% of the bits are in error in the best case. This is in fact higher than the amount of noise originally present in the stimulus (ρ = 0.2). However, in the absence of state-dependent contributions to the weight matrix, approximately 35% of the bits are in error (corresponding to η = 0). In fact, with a reasonable choice of η, the performance is always better with the proposed modification than without it. Finally, we observe that as predicted by equation 2.12, for a given M one finds a decrease in the optimal value of η with increasing ρ. Additionally, for a fixed ρ, the optimum value of η is proportional to M in accordance with equation 2.12. The concordance between the theoretical optimum value of η and the experimentally obtained optimum value of η is summarized in Table 2. We also note that determining the theoretically optimal value of η from equation 2.12 requires knowledge of ρ, which is not available a priori. However, from equation 2.12, one may observe that η is always greater than M/N. A fairly good choice when ρ is not known is η = M/N. This choice in fact brings us fairly close to the best performance achievable. To illustrate, obtaining η using η = M/N for the above simulation provides η = 0.15, 0.20, and, 0.25 for M = 60, 80, and 100, respectively. Observe from Figure 1 that these values of η provide near-optimal performance.
Communicated by Mikhail Tsodyks
The Role of the Hippocampus in Solving the Morris Water Maze A. David Redish David S. Touretzky Computer Science Department and Center for the Neural Basis of Cognition, Carnegie Mellon University, Pittsburgh, PA 15213-3891, U.S.A.
We suggest that the hippocampus plays two roles that allow rodents to solve the hidden-platform water maze: self-localization and route replay. When an animal explores an environment such as the water maze, the combination of place fields and correlational (Hebbian) long-term potentiation produces a weight matrix in the CA3 recurrent collaterals such that cells with overlapping place fields are more strongly interconnected than cells with nonoverlapping fields. When combined with global inhibition, this forms an attractor with coherent representations of position as stable states. When biased by local view information, this allows the animal to determine its position relative to the goal when it returns to the environment. We call this self-localization. When an animal traces specific routes within an environment, the weights in the CA3 recurrent collaterals become asymmetric. We show that this stores these routes in the recurrent collaterals. When primed with noise in the absence of sensory input, a coherent representation of position still forms in the CA3 population, but then that representation drifts, retracing a route. We show that these two mechanisms can coexist and form a basis for memory consolidation, explaining the anterograde and limited retrograde amnesia seen following hippocampal lesions. 1 Amnesia Following Hippocampal Lesions Hippocampal lesions in humans produce devastating impairments in declarative memory (memories of specific items, events, or episodes) (Scoville & Milner, 1957; Squire & Zola-Morgan, 1988; Cohen & Eichenbaum, 1993; Zola-Morgan & Squire, 1993). Although these patients perform immediate recall tasks normally, they are strongly impaired at times greater than a few minutes. In addition to the anterograde amnesia, these amnesias extend backward in time to recently before the occurrence of the lesion, but they leave early memories intact (Squire & Zola-Morgan, 1988). Similar results have been seen in nonhuman primates with hippocampal lesions (Squire & Zola-Morgan, 1988; Zola-Morgan & Squire, 1993). Neural Computation 10, 73–111 (1998)
c 1997 Massachusetts Institute of Technology °
74
A. David Redish and David S. Touretzky
The theory proposed to explain these data is that the hippocampus serves as a temporary store for memory (Marr, 1969; Buzs´aki, 1989; Cohen & Eichenbaum, 1993; McClelland, McNaughton, & O’Reilly, 1995). However, no models of hippocampal function in specific memory tasks exist; all published models of declarative memory demonstrate storage and retrieval of arbitrary binary vectors (Marr, 1971; Rolls, 1989; Gluck & Myers, 1993; Alvarez & Squire, 1994; Hasselmo & Schnell, 1994; O’Reilly & McClelland, 1994; Levy, 1996; Rolls, 1996). Although these theories can address general principles involved in memory, they cannot address the role of the hippocampus in specific tasks. This makes it difficult to compare their results to real experiments or to generate testable predictions from them. Anterograde and limited retrograde amnesias after hippocampal lesion are also seen in rats tested in the Morris water maze (Sutherland & Hoesing, 1993). The Morris water maze consists of a submerged platform placed somewhere within a pool of water made opaque with milk or chalk (Morris, Garrud, Rawlins, & O’Keefe, 1982). When placed in this pool, rats try to find a way out; they initially swim randomly until they find the platform and climb out. Normal rats quickly learn the location of the platform: if the platform is removed, the rats search at the place where the platform had been (Morris et al., 1982). Rats with hippocampal lesions cannot learn this task (Morris et al., 1982; Morris, Schenk, Tweedie, & Jarrard, 1990; McDonald & White, 1994). If the rats are trained on the task first and then given a hippocampal lesion 1 week later, they show profound deficits; however, the same lesion 12 weeks after training produces much smaller deficits (Sutherland & Hoesing, 1993). Here, then, is a specific amnesia result that can be modeled in detail. 2 Modeling Memory in the Morris Water Maze We suggest that rats trained on the Morris water maze use two different mechanisms to find the hidden platform, one locale based and one route based. The two mechanisms can be subdivided into five steps, the first two of which are locale based and the last three route based. These five steps occur in order (we address possible ways to sidestep this in section 4.2): 1. Exploration. The animal familiarizes itself with the environment. 2. Self-localization. Upon reentry into the environment, the animal must determine its location relative to the platform. From this information, it can determine the direction it needs to swim in order to reach the platform. 3. Route learning. When the animal travels along a specific path, routes are stored in the recurrent connections of hippocampal area CA3. 4. Replay of routes during sleep. During sleep, the recent routes are replayed because, when primed with noise, the hippocampal formation settles
The Role of the Hippocampus
75
to a stable representation of a location, which then drifts along the routes stored in the CA3 recurrent connections. McNaughton, Skaggs, and Wilson (Wilson & McNaughton, 1994; Skaggs & McNaughton, 1996) have reported data supporting this hypothesis: simultaneous extracellular recordings from hippocampal pyramidal cells have shown that cells tend to fire in the same sequence during slow-wave sleep as they did during recent experience in an environment. We discuss this and its implications for the theory in section 2.4. 5. Consolidation. The “dreamed” routes are transferred to long-term storage by a slowly learning cortical network. Anterograde amnesia occurs because long-term memory requires a hippocampus in order to learn the routes. Retrograde amnesia occurs when long-term memory has not been completely trained at the time the hippocampus is lesioned. Once the routes have been stored in long-term memory, the animal can solve the task without a hippocampus. Our previous work laid out a theory of the role of the hippocampus in navigation (Touretzky & Redish, 1996; Redish & Touretzky, 1997) (see Figure 1). The key components of the expanded theory presented here are as follows: • Path integration occurs via a loop including the superficial layers of the entorhinal cortex (ECs), the subiculum (Sub), and the parasubiculum (PaS).1 • Sensory cues (called the local view, but not solely visual) enter the hippocampal formation from high-level sensory association areas, referred to here as HLS. • The path integration and local view representations are first combined in ECs, but any conflicts are resolved by the recurrent connections in CA3. • On reentry into a familiar environment, competitive dynamics in the hippocampus allows the system to settle to a coherent place code even with ambiguous sensory cues.2 This coherent code resets the path integrator so that multiple experiences of the same environment are compatible with each other. • During sleep, recurrent connections within the hippocampus force a coherent code to form from noise, but due to asymmetric connection strengths produced during training, the represented location precesses 1 Path integration is the ability to return to a starting point, even after a long, circuitous path, using only idiothetic cues (Mittelstaedt & Mittelstaedt, 1980; Gallistel, 1990; Etienne, 1992). 2 The place code is coherent if all neural activities are consistent with a representation of the same location in space.
76
A. David Redish and David S. Touretzky
Figure 1: Sensory information (local view) enters the system by the high-level sensory areas, (HLS). Path integration (PI) occurs via a loop including the superficial layers of the entorhinal cortex (ECs), the subiculum (Sub), and the parasubiculum (PaS), updated by head direction information in the postsubiculum (PoS). Head direction (HD) is updated via a loop including the anterior thalamic nuclei (ATN). The hippocampus proper (CA3, CA1) serves to reset the path integrator (PI-reset) from local view information on reentry into a familiar environment. During sleep, routes are read out from hippocampus through the deep layers of entorhinal cortex (ECd) into long-term memory (LTM). The goal trajectory (Goal) is calculated in nucleus accumbens (NAcb) and requires information from the subiculum, transmitted via the fornix. Not all anatomical connections are shown, and some structures are omitted.
along the learned routes, effectively replaying routes traversed during the task. • Slowly learning cortical networks (long term memory, or LTM) can be trained by these “dreams” of routes traveled so that eventually the animal can perform the task without a hippocampus. The theory requires the hippocampus to show two major modes of operation: a storage mode and a recall mode. The hippocampus does show two modes of activity (Vanderwolf & Leung, 1983; Buzs´aki, 1989; Stewart & Fox, 1990; Vanderwolf, 1990), seemingly gated by the presence of acetylcholine (ACh) (Vanderwolf, 1990; Huerta & Lisman, 1993; Hasselmo, 1995) arriving from septal input via the fimbria (Stewart & Fox, 1990) and referred to by the very different electroencephalogram (EEG) traces recorded during the
The Role of the Hippocampus
77
two modes. During motion, in the presence of ACh, the hippocampal EEG shows a 7–12 Hz rhythm called theta; during rest and slow-wave sleep, in the absence of ACh, the hippocampal EEG shows irregular activity, called LIA (large-amplitude irregular activity), characterized by short-duration sharp waves (Vanderwolf, 1971). We now proceed to detail how our theory accounts for each of the five steps discussed above, reviewing the data supporting the theory and simulation results demonstrating each point. Additional simulation details are given in appendix A. 2.1 Exploration: Learning the Cognitive Graph. We begin by showing that the combination of random exploration and correlational (Hebbian) learning produces a weight matrix in the CA3 recurrent connections that is appropriate for the competitive dynamics needed for self-localization. Following Muller, Kubie, and Saypoff (1991), we call this connection function the cognitive graph: the synaptic efficacies between place cells are inversely related to the distance between the centers of their place fields. As the animal wanders around the environment, cells with nearby place fields are more likely to be coactive than cells with well-separated fields. Combined with correlational long-term potentiation (LTP), in which the synaptic efficacy is increased when both cells are simultaneously active, after a session of wandering around an environment, the intrahippocampal connections will be inversely proportional to the distance between place field centers (Muller et al., 1991). LTP has been shown extensively in the recurrent connections in CA3 and in the Schaffer collaterals connecting CA3 to CA1 (see Landfield & Deadwyler, 1988). Also supporting this theory are data showing the effect of ACh: while suppressing neuronal transmission in intrahippocampal synapses, ACh enhances LTP in them (Hasselmo & Schnell, 1994). We make the simplifying assumption that ACh shuts off the CA3 recurrent connections completely. Experiments in hippocampal slices show that it diminishes synaptic efficacy across the Schaffer collaterals by approximately 90 percent, while diminishing the efficacy of the perforant path (inputs from ECs) by only approximately 45 percent (Hasselmo & Schnell, 1994). ACh presumably is present during theta mode, while the animal is moving about the environment. Disruption of ACh has been found to shift the hippocampus out of theta, while cholinergic agonists shift the hippocampus into theta mode (Huerta & Lisman, 1993). LTP produced by hippocampal stimulation during theta or at intervals corresponding to the theta frequency is much stronger than similar stimulation during nontheta (LIA) (Larson, Wong, & Lynch, 1986; Larson & Lynch, 1989; Greenstein, Pavlides, & Winson, 1988; Pavlides, Greenstein, Grudman, & Winson, 1988; Huerta & Lisman, 1993). Simulations. The network used to demonstrate the generation of the cognitive graph consisted of a limited version of the total model presented
78
A. David Redish and David S. Touretzky
in Figure 1. It included HLS, PI, HD, ECs, and CA3/1 components. Specific neuronal model details are given in appendix A. The HD component consisted of a 1D (circular) neural array. At every time step, the currently represented value was calculated by a directional mean, the represented value was updated by the angular velocity, and then a new (idealized) HD representation was generated. This allowed us to simulate the neural interactions between the head direction and other representations without the computational time required to simulate the actual head direction update mechanism. We have previously shown that a postsubiculum-anterior thalamic nuclei head direction model can track real angular velocity sequences accurately (Redish, Elga, & Touretzky, 1996). The PI simulation was similar but used a two-dimensional (2D) (toroidal) neural sheet. We also simulated the HLS component as a bump on a 2D neural sheet (as in Samsonovich & McNaughton, in press) and assumed that at every point in the environment, the position of the animal was correctly represented by the population. We did this because it is not clear what aspects of the environment are included in the local view; any set of spatial information about landmarks sufficient to localize the animal to a point will do. For some experiments, such as those described in section 2.2, there was more than one peak in the local view representation. This allowed us to ask questions about the ability of the system to handle ambiguous inputs without having to build a complicated, speculative model of the visual system of the rodent. The EC’s population was also simulated as a 2D neural sheet receiving input from the HLS and PI components. Because we do not separately simulate CA3 and CA1, we refer to the combined simulated population as CA3/1 but refer to CA3 and CA1 in our discussions of the theory proper. The CA3/1 population consisted of a 2D neural sheet connected to the PI representation by one-to-one connections. A sparse random connection pattern works just as well, but by using a oneto-one pattern, we know the center of the place field for each CA3 neuron. According to our theory, every time an animal enters an environment, it self-localizes as best it can by a sharp wave (LIA mode). We do not measure EEG in our simulations, but the self-localization sequence begins with a high proportion of the CA3 cells active at low levels, and settles to a small number of highly active cells within approximately 100 ms (see section 2.2). We believe this corresponds to a single sharp wave (Buzs´aki, 1989). We thus begin exploration by first placing the simulated animal at a random location and triggering a 100-ms sharp wave. Since this is a novel environment, there are no stored associations in the intrahippocampal connections, and the sharp wave serves only to reset the path integrator to a random point. This random location becomes the reference point for the environment, and the origin for the path integrator coordinate system. The animal then explores the environment by wandering randomly, during which LTP occurs in the intrahippocampal connections.
The Role of the Hippocampus
79
Figure 2: Route traveled by simulated rodent while exploring a 100-cm diameter circular environment for 5 minutes. Dots indicate position sampled every 10 seconds. Gray area denotes arena.
Two effects must occur for the animal to have familiarized itself with the environment. First, a mapping from local views (in HLS) to path integrator coordinates (in ECs) must be learned. Second, because local views may be ambiguous, the cognitive graph must be learned in the recurrent connections of CA3. We show that the appropriate connection function appears within the first minutes of exploration. Figure 2 shows the track of the simulated animal wandering over the environment during the 5 minutes of exploration. The animal has clearly covered the entire environment. Figure 3 shows a scatter plot of the learned synaptic weights as a function of the distance between each pair of units in the CA3/1 population. The synaptic efficacy between two cells is, on average, inversely related to the distance between the centers of their place fields. A similar plot of HLS-to-ECs synapse strengths would show that local view representations (in HLS) have been associated with path integrator coordinates (in ECs). Although the connection function appears quickly in our simulations, we used very large learning rates in order to minimize computation time. We do not know whether more realistic learning rates would allow the function to be acquired so quickly. If they did, the random trajectories shown by
80
A. David Redish and David S. Touretzky
Figure 3: Scatter plot of learned synaptic weights as a function of distance between pairs of units. Distance is in cm. Line indicates the mean.
animals first placed in the water maze (with no knowledge of the goal location) would be sufficient to “explore” the environment. 2.2 Self-Localization and Place Field Stability. In order to navigate within a familiar environment, an animal must use a consistent representation of position from session to session. Although visual cues can serve to inform the animal of its initial position, if they are ambiguous, there must be a mechanism to settle on a consistent representation of location. We believe intrinsic competitive dynamics force the hippocampus to settle on a coherent code. These dynamics can serve as a disambiguation mechanism and can reproduce the search pattern that gerbils make when faced with ambiguous local cues (Touretzky & Redish, 1996). We suggest that the competitive dynamics realized in the rodent proceeds thusly: subiculum, parasubiculum, hipppocampus, and entorhinal cortex are initially noisy; sensory cues in HLS passed through ECs into the hippocampus proper bias the random firing rates with candidate locations. The recurrent connections in CA3 allow one of these candidate locations to win out, forming a coherent place code in hippocampus. The connections between CA1 and subiculum reset the path integrator to the correct representation of the animal’s location in path integrator coordinates. This happens in the course of a single sharp wave during LIA. In our simulations, the place code in CA3/1 is coherent within 50 to 100 ms. Figure 4 shows the first 70 ms of a simulated sharp wave.
The Role of the Hippocampus
81
Figure 4: Starting from random noise, a coherent place code forms in less than 50 ms. Plot shows firing rates of CA3/1 place cells. Cells are laid out in a 2D sheet with their locations in the sheet corresponding to the centers of their place fields in the environment. Intensity values have been interpolated for clarity. White indicates high firing rate, black low.
82
A. David Redish and David S. Touretzky
During a sharpwave, place cells do not show normal place fields; many cells are simultaneously active (many more than would normally be active during theta) (Buzs´aki, 1989). Because ACh is not present, synaptic efficacy between CA3 cells is presumably at full strength, allowing the system to settle from an initially noisy state to a coherent representation of the animal’s location. Once this representation is coherent, the path integrator (in subiculum, receiving strong connections from CA1) is driven by the now coherent representation of location in CA1 and is effectively reset. The animal can now navigate around the environment. Simulations. The network used to demonstrate self-localization used a similar architecture to that set out in section 2.1. The 2D neural sheets were enlarged to 20 × 20, and the hippocampal simulation was more detailed. We simulated the CA3/1 population as two pools, one excitatory and one inhibitory (labeled CAE and CAI, respectively, in appendix A) . The excitatory neurons were interconnected within and between pools by an idealization of the connection function learned in section 2.1 (a gaussian with a standard deviation of 20 cm). We had to use an idealization because our networks are small relative to those in the actual rodent brain. Inhibitory CA3/1 neurons were broadly connected to both the excitatory and inhibitory pools. Essentially, this connection structure corresponds to local excitation and global inhibition. We measured the ability of this self-localization process to handle ambiguities in the local view by locking three bumps into the HLS representation. This simulates three “candidate positions” in the local view. This ambiguous local view representation is resolved in the CA3/1 representation into a coherent representation of position similar to that shown in Figure 4. 2.3 Route Learning. Given a representation of the animal’s current location in the environment and a representation of the current goal, the animal should be able to calculate the direction to take to reach the goal. The nucleus accumbens receives information about current needs and desires from the amygdala (Davis, Rainnie, & Cassell, 1994) and information about current location via the fornix (Witter, Ostendorf, & Groenwegen, 1990) and is optimally situated to perform this function. This function of the nucleus accumbens was first suggested by Mogenson (1984), and a model showing its feasibility has been presented by Brown and Sharp (1995). (See Redish and Touretzky (1997) for a review.) There are three neurophysiological effects that allow the hippocampus to store routes as the animal travels. First, the postsynaptic potential (PSP) has a nonzero time constant. As an animal travels from the place field of one neuron (say, a) to another (say, that of neuron b), neuron a continues to have an effect on the firing rate of neuron b, but when the animal was in place field a, neuron b did not have an effect on neuron a. Second, imagine the animal at an instant along the route taken, passing
The Role of the Hippocampus
83
through a place field centered slightly off the route. This cell will have a firing rate somewhere between its maximum and minimum firing rates. Cells with place fields closer to the animal’s position will have higher firing rates, and cells with place fields farther will have lower rates. This means that the output connection function from the neuron in question will be biased asymmetrically toward the path traveled. Finally, as the animal moves through the place field, the timing of the spikes fired by that cell precesses with respect to the theta rhythm: cells behind the animal fire early in the cycle, while cells ahead of the animal fire late in the cycle (O’Keefe & Recce, 1993; Skaggs, 1995; Skaggs & McNaughton, 1996; Skaggs, McNaughton, Wilson, & Barnes, 1996). Thus the represented position sweeps across the actual position from back to front with each theta cycle. When combined with the biophysical time course of LTP, this phase precession will also favor connections pointing along routes to a goal (Blum & Abbott, 1996). Simulations. The route-learning simulation consisted of the same network as used in section 2.2, with the addition of a new hippocampal mode. The simulation parameters as described in section 2.2 correspond to LIA mode, while the simulation parameters used for the route-learning simulation correspond to theta mode (see appendix A). We do not explicitly model the nucleus accumbens. Instead we compare the subicular representation at the goal and the current subicular represenation, and then simulate travel in a straight line until the animal reaches either the goal or a wall. Figure 5 shows the paths traveled to reach the goal from the four cardinal points. These are the four routes that will be stored in the CA3/1 population. We model the asymmetric nature of LTP by making the learning rule dependent on the synaptic drive of the presynaptic neuron and the firing rate of the postsynaptic neuron (see equation A.4). The synaptic drive Si of neuron i is the effect of that neuron on all the neurons on which it synapses divided by the synaptic weight over each synapse (Pinto, Brumberg, Simons, & Ermentrout, 1996; see appendix A). It can be understood as a model of the postsynaptic potential or as a decaying memory of recent firing rates shown by neuron i, with a decay time constant of τi . We do not model phase precession as an emergent result of a complex process; instead we assume that phase precession exists and show that, when combined with the asymmetric temporal nature of LTP, routes are stored in the recurrent connections of the hippocampus. In order to produce phase precession, we derive the preferred phase of each CA3 neuron using the approximation in Figure 6. We then define the firing rate of each neuron at time t as ˜
Fi (t) = e−(θi (t)−θ (t))
2
/ρ 2
· Fˆi (t),
(2.1)
84
A. David Redish and David S. Touretzky
Figure 5: Four routes to the goal location. In order to demonstrate the accuracy of the simulation, the direction to the goal was determined by comparing the representation in subiculum with the prior subicular representations of the goal location. Lines indicate trajectories taken by the simulated animal to reach the goal (indicated by small circle). An x has been drawn at the initial location of the animal in each position. These routes are stored in the CA3/1 structure via LTP. Gray area denotes arena.
where θ˜i (t) is the preferred phase of neuron i, θ (t) is the current phase of the theta rhythm, ρ is a constant, and Fˆi (t) is the peak firing rate at ˜ θ(t) = θ(t). We assume a theta rhythm with a frequency of 7 Hz, so θ (t) = 7·360◦ ( sec · t) mod 360◦ . Fˆi (t) is determined by equation A.2 (see appendix A). This makes the representation of position sweep from behind the animal to in front of it with each theta cycle as it does in the real animal (O’Keefe & Recce, 1993; Skaggs et al., 1996). We do not claim this as a model of how phase precession is actually generated in the rodent hippocampus, only that it produces a phase precession effect so that routes can be stored in the CA3 recurrent connections. These effects combine to store routes in the recurrent connections of CA3. They produce a vector field pointing toward the path and then leading to the goal. Figure 7 shows the routes stored by an animal traversing the four paths in Figure 5.
The Role of the Hippocampus
85
Figure 6: How we simulated phase precession. Let L(t) be a ray originating at the simulated rodent’s current position (as represented by the pyramidal cells in CA3), pointing in the direction of its current heading (as represented by the cells in postsubiculum). Let Pi (t) be a vector from the represented position of the rodent to the center of the place field of place cell i, and Di (t) be the projection of Pi (t) onto L(t). Then the preferred phase of neuron i, θ˜i (t) is proportional to Di (t): θ˜i (t) = K · Di (t), where K is a scale factor chosen to be small enough that the phase precession will not wrap around (K = 1.2 deg/cm in our simulations). Thus, cells with place fields behind the represented position (in CA3) fire earlier in the theta cycle, and cells ahead of the represented position fire later. We do not claim this as a model of how phase precession is actually generated in the rodent brain, only that it produces a phase precession effect so that routes can be stored in the CA3 recurrent connections.
2.4 Replay of Routes During Sleep. When there is sensory input into the hippocampus and the hippocampus is in LIA mode (i.e., the animal is awake and looking around but not moving), sensory cues enter the system via HLS and ECs, and those CA3 cells that are consistent with the current local view will be more active than those that are not. This biases CA3 to settle to a place code consistent with the local view and thus can serve as a self-localization procedure. On the other hand, when there is no sensory input, this bias will be absent, but due to the recurrent connections in CA3, the hippocampus will still settle into a coherent activity pattern. Due to the asymmetric connections that were stored when the animal traversed the routes to the goal, the place
86
A. David Redish and David S. Touretzky
Figure 7: Vector field of routes to a goal stored in the recurrent connections of the model CA3. For each cell j, we calculated the center of mass of the output connection weights, and plotted an arrow from the place field center toward the center of mass. Length of arrow corresponds to linearly scaled distance to center of mass of the output connection weights.
code will precess along one of these remembered routes. The bias provided by the sensory input should be enough to keep the system from precessing when awake, but in the absence of sensory input (during sleep), there is nothing to stop the precession. During sleep, when sharp waves occur without sensory input, we expect to see replay of routes. This is shown in Figure 8. Given an initial noisy state, a coherent code forms within half a second, and then over the next few seconds, it drifts along a remembered route. Data supporting a replay of recent experience in hippocampus during sleep were first seen by Pavlides and Winson (1989). They showed that cells with recently visited place fields were more active during REM sleep than other cells whose place fields had not been recently visited. Wilson and
The Role of the Hippocampus
87
Figure 8: Replay of routes during LIA without sensory input. A coherent code forms quickly and then slowly drifts to the goal over the subsequent few seconds.
McNaughton (1994) showed that during slow-wave sleep (SWS), cells that showed correlated firing during a session in an environment (because their place fields overlapped) also showed a stronger correlation during sleep immediately after the session. Skaggs and McNaughton (1996) explicitly examined the temporal nature of replay during sharp waves in slow-wave sleep. They defined the temporal bias Bij between two cells i and j to be the difference between the integrated cross-correlation for the 200 ms after each spike of cell j and the integrated cross-correlation for the 200 ms before each spike of cell j. Thus, if cell i generally fires after cell j rather than before, Bij will be greater than 0.
88
A. David Redish and David S. Touretzky
They report that the temporal bias during sleep after running on a linear track is strongly correlated with the temporal bias seen while the animal was running on the track. Although this does not conclusively demonstrate replay of routes during sleep, it provides strong circumstantial evidence for the hypothesis. The original indication that REM sleep affected learning came from REM sleep deprivation studies (Hennevin, Hars, Maho, & Bloch, 1995). Because REM sleep is so distinctive, it is easy to wake a subject up whenever it goes into REM sleep. There is a window during which animals must be able to get REM sleep in order to learn a task such as the water maze (Smith, 1995). However, this REM deprivation window is usually measured in terms of a few hours, not a few weeks (Smith, 1995). Although we have presented the model as if the replaying of routes occurs during a single sharp wave, the time course of the replay presented in Figure 8 is more compatible with a settling process that occurs during REM sleep: a single sharp wave lasts for 100 ms, while a REM sleep episode can last for a few minutes (Vanderwolf, 1990). During sleep, rats first enter slow-wave sleep. LIA occurs throughout the hippocampus, and irregular slow activity occurs throughout the cortex. With the onset of REM sleep, all of the animal’s muscles go limp, and theta appears in hippcampus, while low-voltage fast activity (LVFA) occurs in the cortex. This bout of REM sleep lasts for about 2 minutes (Vanderwolf, 1990)—plenty of time to retrace a route. One possibility is that during the final sharp wave of the LIA block, the system settles on a coherent location and then throughout the following bout of REM sleep, the system retraces the route. Another possibility is suggested by data from Smith (1995) and Kudrimoti et al. (1996). Smith (1995) reviews data that there is a post-REM NMDA-window in which NMDA blockers such as MK-801 will also disrupt learning. For example, animals with REM deprivation windows of 5 to 8 hours posttraining had NMDA windows of 9 to 12 hours posttraining, and animals with REM deprivation windows of 1 to 4 hours posttraining had NMDA windows of 5 to 16 hours posttraining. This is particularly intriguing given recent data from Kudrimoti et al. (1996). They measured the increase in correlation of cells whose place fields overlapped during a recent exploration of an environment during three blocks of LIA in SWS. Blocks 1 and 2 occurred before the first bout of REM sleep, and block 3 occurred soon after it. They found that the correlation in block 1 was higher than the correlation in block 2, but that the correlation during block 3 was larger than block 2 and comparable to block 1. An intriguing possibility is that replay occurs during both SWS and REM. During SWS, information is written out to cortex and partially erased from the hippocampus, while during REM, the hippocampal information is strengthened so that it can be read out again during SWS. More work is clearly needed looking at the representation of location in hippocampus during sleep to explore
The Role of the Hippocampus
89
Figure 9: Believed position of the animal during 20 5-second dreams, sampled at 100-ms intervals. Each point indicates the position represented in the model CA3/1 at each sampled time.
whether the replay occurs during slow-wave or REM sleep and what the corresponding roles of slow-wave and REM sleep are. Simulations. The route replay simulations used for Figures 8 and 9 were identical to the self-localization simulation of section 2.2, but the weight matrix used was the asymmetric one produced by LTP in section 2.3. In order to simulate “sleep,” the HLS representation was set to zero. 2.5 Consolidation. If we hypothesize a cortical memory capable of storing routes that receives input from HLS and hippocampus and learns during slow-wave sleep, we can explain the anterograde and limited retrograde amnesia that follows hippocampal lesions applied before, soon after, or long after training in the Morris water maze (Sutherland & Hoesing, 1993). Anterograde amnesia occurs because long-term memory requires a hippocampus in order to learn the routes. Retrograde amnesia occurs when long-term memory has not been completely trained at the time the hippocampus is lesioned. The retrograde amnesia reaches back only a limited time because once the routes have been stored in long-term memory, the animal can solve the task without a hippocampus. Our theory suggests that the hippocampal cognitive map, necessary for learning the task with a hidden platform, is not written out to long-term
90
A. David Redish and David S. Touretzky
memory (LTM), but rather that LTM only stores the needed routes. Once the routes are stored in LTM, the representation maintained in the hippocampus is no longer required to perform the task successfully. In effect, the system learns the direct mapping from local views (HLS) to directions (LTM). This hypothesis requires an output from hippocampus to cortex. There is such a pathway, leading from CA1 into the deep layers of entorhinal cortex (ECd), and then back into a variety of cortical areas (Amaral, 1993; Witter, 1993). Data from ischemic subjects show that damage restricted to CA1 is sufficient to produce the anterograde and limited retrograde amnesia discussed above (Squire & Zola-Morgan, 1988; Zola-Morgan & Squire, 1993; Wood, Mumby, Pinel, & Phillips, 1996). This hypothesis also requires that ECs and ECd be active at different times: As the intrahippocampal connections are learning (during theta), activity should not be transmitted through ECd. But as routes are replayed during sleep (during LIA), ECs should be silent. Consistent with this hypothesis are data showing that while ECs cells are phase locked to the theta rhythm, they are uncorrelated to LIA, and conversely, while ECd cells are uncorrelated to theta, they are correlated to the LIA EEG signal (Chrobak & Buzs´aki, 1994). During the self-localization procedure, this theory requires the hippocampus to show LIA also, but ECs cells should fire at a constant rate and would still be uncorrelated to LIA. Simulations. In order to demonstrate the viability of the consolidation proposal, we simulated LTM as a three-layer feedforward network (HLS, LTMH , and LTMO ) and trained the weights (HLS −→ LTMH and LTMH −→ LTMO ) by the standard backpropagation algorithm (Rumelhart, Hinton, & Williams, 1986; Hertz, Krogh, & Palmer, 1991). The training regimen for the LTM network consisted of 40 dreams, each of which were 5-second sequences of CA3/1 representations as the network was allowed to settle from noise and progress along the stored routes as described in the previous section. Each sequence was sampled at 100 ms intervals (see Figure 9). We used these representations to regenerate HLS representations and then generated directions of motion φ(t) by comparing each sample with its successor. We regenerated HLS(t) by taking the represented position in CA3 at time t and setting the HLS representation to the representation of that position. In the animal, we expect this regeneration process to occur from feedback connections between ECd and parietal cortex. Because CA3/1 is a 2D neural sheet, we can generate φ(t) by subtracting the center of mass of the CA3/1 representation at time t from the center of mass at time t+1t. These input-output pairs hHLS(t), φ(t)i formed the training set for the LTM network. The three-layer LTM network had 400 inputs, corresponding to each of the HLS neurons, 40 hidden units (LTMH ), and 2 outputs (LTMO ) representing cos(φ) and sin(φ). We do not claim that cortical representations look anything like this or that cortical learning occurs by backpropogation;
The Role of the Hippocampus
91
we only want to demonstrate that enough information is contained in the dreams to train a cortical network on the desired input-output function. We trained the network on repeated presentations of 40 dreams (1000 epochs, 40,000 total presentations). Each dream was sampled at 1 Hz, so that five input-output pairs were selected from each dream to train the network. This was done because subsequent pairs from a single dream are very similar, and this helped minimize computation time. In actual animals, we expect the entire dream to be used, and we expect each dream to be unique. To test the long-term memory, we placed the simulated animal at evenly spaced positions in the environment (5-cm spacing) and measured the direction that would be taken at each location. Figure 10 shows the output vector in LTMO at each location. The vector field shows interpolations between the routes because the HLS representations are distributed over approximately 50 to 100 neurons with similar place field centers. Weights from cells with place field centers situated between two routes end up trained to the vector average of the two routes. To demonstrate that the network successfully learned the task, we simulated placing the animal at each of the four cardinal locations and moving it in the direction represented in LTMO . As the simulated animal moved through the environment, it was allowed to continue following the direction represented in LTMO (which changed as the representation in HLS changed). If the animal hit the wall, it turned clockwise if the direction represented in LTMO was to the right of the normal vector to the wall and left otherwise. It turned in 45-degree increments until it could proceed. This has the effect of moving the animal along the wall. If the length of the LTMO vector was zero, the animal proceeded along the shortest path to the wall until the LTMO vector was nonzero or the animal hit the wall. The simulation ended after 60 seconds, or when the animal was within 10 cm of the goal. Figure 10 shows the paths taken from the four starting locations. The explanation for the amnesias seen after hippocampal lesions can be clearly seen in Figure 10. If the hippocampus is lesioned before LTM has been trained, the simulated animal cannot solve the task at all (anterograde amnesia). If the hippocampus is lesioned after LTM has been partially trained, the animal at least approaches the goal but shows severe impairments (retrograde amnesia). Finally, if LTM has been fully trained by hippocampal dreams, the animal can solve the task accurately even with hippocampal lesions (retrograde amnesia is limited). 3 Predictions This theory makes a number of predictions. We detail four here: • Coherency of the place code during LIA. We hypothesize that the selflocalization that must occur on reentry into a familiar environment is
92
A. David Redish and David S. Touretzky
(a) untrained
(b) partially trained
(c) completely trained
Figure 10: Memory consolidation in a network trained on dreams. (Left) The vector field represented at each location by the long-term memory network. (Right) Paths taken by a simulated animal with hippocampal lesions, navigating using just its cortical route memory. Light gray area denotes arena; dark gray area denotes platform.
realized during LIA. This predicts that rodents reentering a familiar environment will show LIA before movement and that the place code will begin in an incoherent state and become coherent through the course of the LIA. Coherency can be rigorously defined as the inverse
The Role of the Hippocampus
93
width of the confidence interval of the location represented by the place code. • Automaticity of overlearned tasks. We hypothesize that an animal does not store its entire cognitive map in long-term memory; it stores only routes. This predicts a difference in the ability to handle obstacles in normal and hippocampal animals, even after the span of the limited retrograde amnesia (12 weeks in the task above). Normal animals have little trouble planning trajectories to reach goals even with barriers (Tolman, 1948; Poucet, Thinus-blanc, & Chapuis, 1983; Leonard & McNaughton, 1990; Chapuis & Scardigli, 1993), but this requires a representation of the animal’s position in the environment (O’Keefe & Nadel, 1978). Thus this requires the cognitive map and the hippocampus. Our simplified model of the mechanism to plan the path to the goal cannot plan paths around barriers. However, a more realistic model of nucleus accumbens should be able to plan such trajectories. Nevertheless, even a more complex goal memory model capable of planning paths around barriers would require a representation of current position. An animal dependent on a route memory (an animal with a trained LTM, but no hippocampus) would be severely impaired when a barrier is placed across the route. • Navigation without a hippocampus. A number of alternative models of rodent navigation (e.g., Blum & Abbott, 1996; Samsonovich & McNaughton, in press) have suggested that the hippocampus is used in online navigation. Our model suggests that the hippocampus is necessary only for self-localization upon entry into a familiar environment. This means that rodents with hippocampal lesions should be able to perform tasks that do not require switching environments. For example, rodents should demonstrate path integration, such as the Mittelstaedt and Mittelstaedt (1980) pup retrieval task, even with hippocampal lesions. Presumably an animal in such a situation does not have to relocalize at any point in the task. Normally, rodents in the water maze task are carried back to a home cage to rest between trials or are disoriented by carrying them around the room before placing them at the starting point. Our model suggests that hippocampal animals may be able to perform the Morris water maze task if it is simplified in the following way: place the animal at the goal location (the hidden platform) and then smoothly carry the animal directly to the release point. A number of researchers have explored the Morris water maze with hippocampal and subicular lesions (Morris et al., 1982, 1990; McDonald & White, 1994), but we know of no experiments examining this simplified task. • Phase precession precedes asymmetric connection strengths. Tsodyks, Skaggs, Sejnowski, & McNaughton (1996) present a model of how
94
A. David Redish and David S. Touretzky
phase precession can be generated by asymmetric connections. Although we do not present a model of how phase precession is generated, our theory requires that phase precession precede the asymmetry of connection strengths. This means that one should see phase precession from the first trajectory through an environment. This does not imply that once the connection strengths are asymmetric, they cannot enhance the phase precession effect. Once the connection strengths are asymmetric, a positive feedback situation exists between the asymmetry of the connection strengths and phase precession, which can enhance both. 4 Discussion 4.1 The Dual Role Played by the Hippocampus. We have suggested here that the hippocampus plays two roles: 1. Self-localization. On reentry into a familiar environment, self-localization is accomplished in the course of a sharp wave in the hippocampal formation. 2. Route replay. As the animal traverses routes in a familiar environment, these routes are stored in the recurrent connections of CA3. These routes are replayed during sleep. These two roles can coexist because when there are sensory signals providing candidate biases to CA3, they are enough to counter the drift provided by the asymmetric connections. In the absence of sensory input, there is nothing to stop the drift, and the representation replays a recent route. Blum and Abbott (1996) predicted that if asymmetric connections are present (such as is necessary to produce the route replay effect), then the place fields of cells should shift backward along much-repeated paths. Mehta and McNaughton (in press) have confirmed this prediction: the place fields of cells do shift backward along a much-repeated path as the animal runs along that path. They trained a rat to run in a loop on a rectangular elevated maze and recorded place cells in CA1. They found that the place fields shifted backward over the course of a single session. This would seem to cause a problem for the self-localization role because the place fields would not represent the same location at the end of the run. However, Mehta, and McNaughton also showed that while the distance along the track covered by the place field shifted by almost 50 percent (approximately 5–7 cm), the center of mass of the field shifted very little (20 percent; approximately 2–3 cm). This duality is what we need for the dualrole hippocampus (self-localization/route storage). Because the number of cells overlapped by this expanded place field increases due to the increase in area covered, longer routes can be stored, but because the center of mass
The Role of the Hippocampus
95
Figure 11: x-coordinate (on a toroidal map) of the final representation of the parallel relaxation process with (light bars) local view input and symmetric weights only, (medium bars) local view input and both symmetric and asymmetric weights, and (dark bars) both symmetric and asymmetric weights without local view input. y-coordinates (not shown) are similar.
of the field does not shift much, the self-localization process will not be affected very much. Figure 11 shows the x-coordinate of the CA3/1 representation after a simulated sharp wave occurred under three conditions: • Light bars. When the simulated animal was at five different locations in an environment but the recurrent connection matrix between CA3 excitatory cells consisted of symmetric weights only. (This is the simulation used in section 2.2.) • Medium bars. When the simulated animal was at the same five locations but the CA3 recurrent connection matrix consisted of both symmetric and asymmetric weights. (This is the simulation used in section 2.4, but with a nonzero HLS representation.) • Dark bars. In the absense of sensory input but when the representation began near each of the four locations above. (This is the simulation used in section 2.4.) There is no difference at all between the results shown in the light and medium bars, indicating that the local view input is sufficient to hold the representation in place. On the other hand, when there is no sensory input to hold the representation in place, it drifts (see Figure 8). Even though the representation started near one of the five initial locations, in the absence of sensory input, by the end of the process, it had drifted to the goal location
96
A. David Redish and David S. Touretzky
(see also Figure 9). This shows that the two modes are in fact compatible (at least in simulation). 4.2 Why Can’t the Animal Learn the Task Without a Hippocampus? Even with hippocampal lesions, animals can still learn simplified versions of the Morris water maze. If, for example, the platform is visible or the animal only begins from a single starting point or there is a landmark indicating the direction to swim (Eichenbaum, Steward, & Morris, 1990; McDonald & White, 1994), the animal can perform at near-normal levels. These simplified versions of the task allow routes to be learned based on single cues (the visible platform, a single motor program, a directional cue). Brown and Sharp (1995) built a model of nucleus accumbens that uses reinforcement learning (Sutton, 1992) to back the direction-to-goal function from near the goal to locations distant from it. Their model learns to associate directions with representations of position. They use place cells as inputs, but there is no reason that the same process would not work with local view representations. Because animals cannot learn to solve the Morris water maze without a hippocampus, we hypothesize that such a mechanism (as envisioned by Brown and Sharp) does not exist separate from the hippocampus. A number of authors have suggested that there are multiple memory and navigation systems (see, for example, O’Keefe & Nadel, 1978; Squire, 1992; Cohen & Eichenbaum, 1993). The Morris water maze data can be explained by assumming that there are three systems that allow animals to navigate within an environment: 1. Locale system. Animals can use constellations of distal cues to localize themselves on a “cognitive map”; combined with a memory of the goal location on the same map, the animals can navigate to the goal. This system requires a hippocampus. We detailed a model of this system in sections 2.1 and 2.2. 2. Taxon system. Animals can use visual cues to drive navigation directly toward or away from the cue. Evidence exists that this system requires the caudate nucleus (McDonald & White, 1994; Packard & McGaugh, 1992; Packard, 1994). We have not included a model of the taxon system in this article. Because there is no visible cue in the hidden-platform water maze task, it would not help the animal find the platform. 3. Route system. Routes stored in the hippocampus can be written out to the cortex, so that directions necessary to reach a goal are associated with local views. This is the system detailed in section 2.5 (see also section 4.3). This system requires training for each step the animal must take; it cannot learn to associate local views with directions to distant goals without hippocampal help (through route replay).
The Role of the Hippocampus
97
If there were a way to show the animal the route to the goal, it might be possible to train the route system even without a hippocampus. Whishaw, Cassell, and Jarrard (1995) and Schallert, Day, Weisend, and Sutherland (1996) both showed ways to train the route system directly and found that animals could learn to solve the water maze even with hippocampal lesions. Whishaw et al. (1995) trained animals with fimbria/fornix lesions to find a visible platform and then removed the visible platform. These animals concentrated their search where the platform had been. Schallert et al. (1996) used animals with kainate-colchicine hippocampal lesions. The animals were first trained with a large platform that filled almost the entire maze. Once the animals could reach that platform reliably, it was shrunk trial by trial until it was the same size as a typical platform in a water maze task. Again, the animals could learn to solve the water maze without a hippocampus. 4.3 Where Is the Route System? Although the data are not yet conclusive, we suggest that the most likely candidate for anatomical instantiation of the route system is from posterior parietal to posterior cingulate cortex. There is a lot of evidence that posterior parietal cortex (PPC) supports the local view subsystem (we have labeled this HLS in Figure 1). Rodents with PPC lesions have extreme difficulties with spatial tasks, including the radial maze task, the hidden-platform water maze, and the visual-platform water maze (Kolb & Walkey, 1987). Rodents with PPC lesions are very poor at trajectories; when performing the water maze, they never improve their initial trajectory, even with visual cues informing the location of the platform (Kolb, 1990). DiMattia and Kesner (1988) report that PPC lesions forced the animals to revert to a random search strategy. McNaughton et al. (1994) report that the firing rates of cells recorded from rodent posterior parietal cortex while an animal traversed a radial arm maze were correlated with whether the animal was turning (some cells were tuned to left turns, some to right turns, some to straight-ahead movements), as well as whether the animal was progressing toward or away from the center of the maze. Chen, Lin, Green, Barnes, and McNaughton (1994) found that approximately 5 to 10 percent of the cells were sensitive to direction on the radial arm maze, and of those, half required a cue in order to show a directional tuning. These cells fired only if the cue was present. But if the cue was present and then removed, some of the cells continued to show a tuning to direction, as if the cell remembered the direction the cue had been. They tended to show a broader tuning to direction than normal head direction cells (Chen et al., 1994). If the parietal cortex forms the input to the route system, where is the output? We can identify requirements for the output components of the route system: • It must receive input from the local view (parietal cortex).
98
A. David Redish and David S. Touretzky
• It must receive input from the hippocampus. • It must receive input from head direction representations (postsubiculum and the anterior thalamic nuclei). • It must send efferent projections to motor structures. • It must be able to represent intended actions and directions of motion. • Combined lesions of the area with hippocampus should cause impairments in navigation tasks. Although lesions of the area alone might not cause severe impairments, combined lesions should produce devastating results. Posterior cingulate cortex (also known as retrosplenial cortex or area 29) is the best candidate for the route storage in long-term memory (labeled LTM in Figure 1). It is bidirectionally connected with parietal cortex (Vogt, 1985) and with postsubiculum (Wyss & van Groen, 1992; Finch, 1993), receives input from the anterior thalamic nuclei (Sripanidkulchai & Wyss, 1986; van Groen, Vogt, & Wyss, 1993) and subiculum (Vogt, 1985; Wyss & van Groen, 1992), and sends projections to motor cortex (Finch, 1993). Single cell recordings from posterior cingulate cortex show (rare) correlations to head direction and (more common) correlations to behavior (Chen et al., 1994). Also supporting this hypothesis are data from Sutherland and Hoesing (1993), in which they showed that both cingulate and parietal lesions continue to have devastating effects, even if they occur 12 weeks after training. They suggest that posterior parietal cortex represents stimulus information, and posterior cingulate cortex then transforms this representation back into motor commands. 4.4 Retrograde Amnesia. Anterograde amnesia can occur when a system necessary for memory storage breaks down, and retrograde amnesia can occur when the actual storage site is damaged, but temporally graded retrograde amnesia implies a consolidation mechansism. The role of the hippocampus in consolidation has been hotly debated recently (Nadel, 1991; Squire, 1992; Cohen & Eichenbaum, 1993; Weisend, Astur, & Sutherland, 1996; Koerner, Thomas, Weisend, & Sutherland, 1996; Bohbot, Liu, Thurm, Nadel, & Bures, 1996a). There are two time courses for consolidation that need to be handled separately. Short-term consolidation (STC) occurs over the course of hours (0.5 to 72) and long-term consolidation (LTC) occurs over the course of days to years. STC is most likely to occur as a consequence of the biophysical time course of LTP. If LTP is disrupted—for example, chemically (Smith, 1995), by electroconvulsive shock (Bohbot, Otahal, Liu, Nadel, & Bures, 1996b), or by lesions (Bohbot et al., 1996a)—then memories are lost. But the window in which these kinds of memories can be disrupted is short. (See McGaugh (1966) for a review.)
The Role of the Hippocampus
99
On the other hand, LTC occurs because memories are stored in a temporary buffer and then written out from that buffer into long-term storage. The hypothesis that hippocampus serves as the temporary buffer was first proposed by Marr (1971). More recently, the theory has been extended by a number of researchers (Squire, 1992; Cohen & Eichenbaum, 1993; McClelland et al., 1995). In this article, we have tried to show how the hippocampus can store and replay memories that would allow an animal to solve a specific task (i.e., the Morris water maze). Although the existence of STC seems to be well accepted, the existence of LTC has been called into question recently. We therefore review the relevant data on LTC in rodents and refer readers to other reviews of the nonrodent literature (Squire, 1992; Cohen & Eichenbaum, 1993; Rempel-Clower, Zola, Squire, & Amaral, 1996). The Sutherland and Hoesing (1993) result cited in section 1 would seem to lay the question to rest conclusively, but the lesion was made with colchicine (Sutherland, personal communication), which selectively targets dentate gyrus granule cells and tends to spare the CA3 and CA1 fields. Weisend et al. (1996) report that rats with hippocampal lesions show complete retrograde amnesia even out to 36 weeks; however, the lesions included large portions of dorsal subiculum. Dorsal subiculum is likely to play a role in path integration (Redish & Touretzky, 1997), which could affect navigation even in well-consolidated tasks. Bohbot et al. (1996a) found that in their experiments, the “temporal gradient of retrograde amnesia” is better explained as an effect of time since surgery: the lateral ventricles expand as time progresses postsurgery, compressing nearby structures. Koerner et al. (1996) explicitly tested the limited nature of retrograde amnesia in the hidden-platform water maze by comparing (1) animals that learned the task and waited 12 weeks, (2) animals that learned the task and received repetitions each week for 12 weeks, and (3) animals that learned 13 different hidden-platform water mazes over those 13 weeks. After hippocampal lesions (which again encroached on dorsal subiculum) only group 2 showed any performance above chance. Cho, Berracochea, and Jaffard (1993) report definite retrograde amnesia in mice: animals with bilateral ibotenate entorhinal cortex lesions showed significant impairments of two-choice discriminations (on an eight-arm radial maze) for discriminations learned 0.5 to 2 weeks prior to surgery, but no impairment at all for those learned 4 or 8 weeks prior to surgery. In fact, the animals were better at the 4- or 8-week old discriminations than at the 2-week discriminations. Normals are better at recently learned discriminations. This suggests a consolidation time measured in weeks, but with entorhinal, not hippocampal, lesions. Winocur (1990) examined social food preference in rats with bilateral electrolytic hippocampal lesions. Normal rats, when paired with a demonstrator rat that has recently eaten a food, prefer that food over other foods. Rats that first acquired the preference (from the demonstrator rat) and then
100
A. David Redish and David S. Touretzky
experienced surgery (a dorsal hippocampal lesion) showed a significant impairment when fewer than 2 days intervened between acquisition and surgery but were normal when 5 or 10 days intervened. On the other hand, Bolhuis, Stewart, and Forrest (1994) examined animals trained on the water maze with two retention intervals between training and surgery: 2 to 4 days and 14 weeks. Animals were given either ibotenate hippocampal or subicular lesions. They found no temporal gradient to retrograde amnesia. Squire and Alvarez (1995) point out that animals were at chance for both time periods, and these floor effects may have precluded detection of a temporal gradient to the retrograde amnesia. Although the extent of the temporal gradient of retrograde amnesia after hippocampal damage is still an open question, we present this model as an explanation of long-term consolidation, such as that described by Sutherland and Hoesing (1993), Winocur (1990), and Cho et al. (1993). 4.5 Related Work. The suggestion that correlational LTP in the CA3 recurrent connections combined with random exploration of an environment would produce a connection function where the learned synaptic weight is inversely related to distance between place field centers was first made by Muller et al. (1991). McNaughton, Skaggs, and Wilson (Wilson & McNaughton, 1994; Skaggs & McNaughton, 1996) have shown that after exploring an arena, cells with overlapping place fields in that arena are more likely to be correlated during subsequent sleep than those with nonoverlapping fields. However, none of these authors suggested the competitive dynamics that produces the self-localization mechanism in our explanation. Shen and McNaughton (1997) have examined this competitive dynamics, with a model of attractor dynamics similar to that presented here in which cells with gaussian place fields exhibit correlational (Hebbian) learning. When these two effects are combined with random exploration, a localexcitation weight matrix is formed (see section 2.1). They demonstrate that when presented with random input and allowed to settle to a stable state, cells with recently visited place fields are more active than other cells, corresponding to data from Pavlides and Winson (1989) and Wilson and McNaughton (1994). They did not, however, look at sequences, as we have in section 2.4. Samsonovich and McNaughton (in press) and Zhang (1996) have suggested that path integration can occur by the motion of a hill of activity on a two-dimensional neural sheet, and both demonstrate that dynamics such as we assume will produce a coherent hill from noise. However, both of these models of path integration differ significantly from ours. Zhang makes no claims about the anatomical implementation of the mechanism, and Samsonovich and McNaughton require the hippocampus for online navigation. Our model makes a different anatomical claim about the location of the path integration mechanism (see Redish & Touretzky, 1997, for more detail) and requires the hippocampus for self-localization upon reentry into an envi-
The Role of the Hippocampus
101
ronment only. In addition, Samsonovich and McNaughton explicitly reject the proposition that the hippocampus learns the CA3 recurrent connection strengths. Tsodyks and Sejnowski (1995) examined a one-dimensional version of these dynamics, but looked at only a small simulation with a nonuniform distribution of place fields. This produces a situtation in which a few of the coherent representations are stronger attractors than the others. The dynamics of the inhomogeneous system are very different from that examined here and would predict a very nonhomogeneous distribution of both place field locations in the environment and place field sizes. Blum and Abbott (1996) have shown that LTP combined with phase precession of place cells along the theta rhythm can also store routes to a goal. However, their theory requires the hippocampus for online navigation and is incompatible with the limited nature of the retrograde amnesia that we set out to explain. Like the Burgess, Recce, and O’Keefe (1994) model, we hypothesize that phase precession is a consequence of an effect that exists prior to the asymmetric connection strengths. This puts our model in direct contrast to that of Tsodyks et al. (1996). Whereas we hypothesize that the asymmetric connections are a partial consequence of the phase precession, Tsodyks et al. (1996) have shown that the phase precession may in fact be generated by the asymmetric connections. These two properties may form a sort of feedback loop, which serves to strengthen the asymmetric connections. Levy and his colleagues (Levy, 1996; Levy & Wu, 1996; Wu, Baxter, & Levy, 1996) have explored sequence learning in recurrent hippocampuslike networks and have demonstrated that cells with temporally extended effects can store and replay sequences of arbitrary binary vectors. They have not examined route learning in realistic tasks. Finally, the hypothesis that hippocampus serves as a temporary store for memories that are replayed during sleep has been proposed by a number of authors (Marr, 1971; Cohen & Eichenbaum, 1993; McClelland et al., 1995), and Buzs´aki (1989) specifically suggests that memories are stored during theta mode and recalled during LIA. But none of these authors addresses navigation, and none is able to model a specific task. By concentrating on a single well-defined task, we have been able to look at all components of a memory-related task in detail, including how the hippocampus is used in solving the task (exploration and self-localization), storing the information needed to transfer the task to long-term memory (route learning), and training long-term memory to perform the task (replay of routes during sleep and consolidation). We believe this is a first step toward understanding the role of the hippocampus in long-term consolidation.
102
A. David Redish and David S. Touretzky
Appendix A: Simulation Details Our model is shown in Figure 1. For simplicity we assume that each population consists of a 2D array of neurons with evenly spaced place field centers. We also assume a single environment. We leave relaxing these assumptions to future work, but refer readers to the discussion of reference frames in our previous work (Redish & Touretzky, 1997; Touretzky & Redish, 1996). A.1 Neuronal Model. This neuronal model is more realistic than standard integrate-and-fire models but more abstract than compartmental models. It consists of three equations. For an extensive discussion of the derivation, see Wilson and Cowan (1972) and Pinto et al. (1996). Vi (t) = γi +
X
wij Sj (t)
(A.1)
1 + tanh(Vi (t)) 2
(A.2)
j
Fi (t) =
τi
dSi (t) = −Si (t) + Fi (t) dt
(A.3)
Vi (t) is proportional to the voltage in neuron i at time t; γi is a tonic inhibitory input to neuron i necessary for the dynamics of the system; wij is the synaptic weight from neuron j to neuron i; Fi (t) is the firing rate at time t; and Si (t) is the synaptic drive of neuron i. τi is a time constant, proportional to the decay of the postsynaptic potential produced by neuron i in neurons on which it synapses. Synaptic drive is a nonmeasurable property of a neuron and can be understood as the effect of the neuron on all the neurons on which it synapses divided by the synaptic weight across each synapse. As has been shown by Pinto et al. (1996), equations A.1 through A.3 form a consistent neuronal model that can be understood as describing either a continuous approximation to a single neuron with Fi (t) being the probability of firing at time t, or a population of neurons with Vi (t) being proportional to the average voltage and Fi (t) the fraction of neurons in the population firing a spike at time t. During movement, certain connections within the model are allowed to learn based on an LTP-type Hebbian learning rule in which a connection weight is increased if the product of the synaptic drive of the presynaptic neuron and the firing rate of the postsynaptic neuron is greater than the current weight: η
dwij = [−wij + (Sj · Fi )2 ]+ . dt
(A.4)
The Role of the Hippocampus
103
Table A.1: Exporation Parameters I: Brain Structures Brain Structure
Neurons
High-level sensory (HLS) Postsubiculum (PoS) Subiculum/parasubiculum (Sub/PaS)
204 60 144
Superficial entorhinal cortex (ECs) CA3/1 pyramidal pool (CAE)
144 144
Parameters 12 × 12 (place) + 60 (direction) Tracks head direction 12 × 12; tracks position by path integration τ = 1 ms, γ = −2 12 × 12, τ = 10 ms, γ = −1.5
Table A.2: Exploration Parameters II: Connections Connections, Exploration simulation Input −→ Output
Parameters
HLS −→ ECs CAE −→ CAE ECs −→ CAE ECs −→ Sub/PaS Sub/PaS −→ ECs
η = 0.5, σ = 20 cm η = 0.5, σ = 20 cm w = 5.0, one-to-one w = 1.0, σ = 2 cm w = 2.0, σ = 2 cm
η is a learning rate, and [ ]+ signifies rectification at 0, so weights can only increase. This is a simplification, and we plan to include LTD in future work. Connections involving the LTM network were used in modeling consolidation and trained separately by the standard backpropagation of error algorithm (Rumelhart et al., 1986; Hertz et al., 1991). A.2 Parameters and Specifics. We used a time step of 1 ms in all simulations. A.2.1 Exploration. We do not simulate the mechanisms by which these structures track head direction or position but refer readers to already published models that demonstrate the viability of the assumed mechanisms (Redish et al., 1996; Redish & Touretzky, 1997; Samsonovich & McNaughton, in press; Zhang, 1996). In order to sidestep the complexities involved in interpreting visual input, we assume that HLS represents the candidate positions given by the local view as a hill of activity on a two-dimensional toroidal grid (12 × 12 neurons) and the candidate head directions given as a hill of activity on a one-dimensional ring (60 neurons). HLS is assumed to be reset correctly to show the entering location and head direction and to track the position and direction accurately with movement through the environment. This simulates the local view. The parameters for the exploration simulation are listed in Tables A.1 and A.2.
104
A. David Redish and David S. Touretzky
Table A.3: Self-Localization Parameters I: Brain Structures Brain Structure
Neurons
High-level sensory (HLS) Postsubiculum (PoS) Subiculum/parasubiculum (Sub/PaS)
460 60 400
Superficial entorhinal cortex (ECs) CA3/1 pyramidal pool (CAE)
400 144
CA3/1 inhibitory pool (CAI)
400
Parameters (20 × 20 + 60) Tracks head direction 20 × 20; tracks position by path integration τ = 1 ms, γ = −2 20 × 20, τ = 10 ms, γ = −1.5, ρ = 25◦ τ = 2 ms, γ = −7.5
Table A.4: Self-Localization Parameters II: Connections Connections Input −→ Output
Parameters
HLS −→ ECs HLS −→ PoS ECs −→ CAE ECs −→ Sub/PaS Sub/PaS −→ ECs CAE −→ CAE CAE −→ CAI CAE −→ Sub CAI −→ CAE CAI −→ CAI
w = 5.0, σ = 20 cm w = 2.0, σ = 20 cm w = 5.0, one-to-one w = 1.0, σ = 2 cm w = 2.0, σ = 2 cm w = 5.0, σ = 20 cm w = 16.0, σ = 20 cm w = 5.0, one-to-one w = −8.0, σ = 200 cm w = −12.0, σ = 200 cm
A.2.2 Self-localization. Connections indicated as one-to-one connect input cells to output cells with corresponding locations on an evenly spaced grid. This was done for computational speed; a random, sparse connection function produces similar results. Other connections are full connections in which the connection weight from input cell j to output cell i, wij , falls off as a gaussian function of distance between the place field centers of the two cells i and j. The parameters for the self-localization simulation are listed in Tables A.3 and A.4. A.2.3 Route learning. The model used in the route-learning simulations was identical to the self-localization model with the change that the intrahippocampal connections had zero weight and the CAE −→ CAE connection had a learning rate of η = 0.01. A.2.4 Dreaming. The simulations used for dreaming were identical to those used to demonstrate self-localization, but the weight matrix used was that trained in section 2.3. In order to simulate “sleep,” the HLS represen-
The Role of the Hippocampus
105
Table A.5: Consolidation Parameters Backpropagation network simulating cortex Name
Role
Neurons
HLS LTMH LTMO
Input layer Hidden layer Output layer
400 40 2
tation was set to zero. A.2.5 Consolidation. The HLS −→ LTMH and LTMH −→ LTMO connections were fully connected and trained by the standard backpropagation of error algorithm. See the text for details. The parameters for the backpropagation network used for the consolidation simulation are listed in Table A.5. Acknowledgments We thank two anonymous reviewers as well as Tad Blair, Bard Ermentrout, Bruce McNaughton, David Pinto, Alexei Samsonovich, Bill Skaggs, Matt Wilson, and Kechen Zhang for helpful discussions. We also acknowledge our fellow attendees at the workshop, From Place Cells to Episodic Memory: Models of the Hippocampus, at the Computational Neuroscience 1996 conference, particularly David August, Howard Eichenbaum, Michael Hasselmo, Chip Levy, and Ali Minai. This work has previously appeared in abstract form (Redish & Touretzky, 1996). This work was supported by National Science Foundation grant IBN-9631336. References Alvarez, P., & Squire, L. R. (1994). Memory consolidation and the medial temporal lobe: A simple network model. Proceedings of the National Academy of Sciences, U.S.A., 91, 7041–7045. Amaral, D. G. (1993). Emerging principles of intrinsic hippocampal organization. Current Opinion in Neurobiology, 3, 225–229. Blum, K. I., & Abbott, L. F. (1996). A model of spatial map formation in the hippocampus of the rat. Neural Computation, 8(1), 85–93. Bohbot, V., Liu, Z., Thurm, S. L., Nadel, L., & Bures, J. (1996a). Spatial memory? Never without a hippocampus. Society for Neuroscience Abstracts, 22, 1873. Bohbot, V., Otahal, P., Liu, Z., Nadel, L., & Bures, J. (1996b). Electroconvulsive shock and lidocaine reveal rapid consolidation of spatial working memory in the water maze. Proceedings of the National Academy of Sciences, U.S.A., 93, 4016–4019.
106
A. David Redish and David S. Touretzky
Bolhuis, J. J., Stewart, C. A., & Forrest, E. M. (1994). Retrograde amnesia and memory reactivation in rats with ibotenate lesions to the hippocampus or subiculum. Quarterly Journal of Experimental Psychology, 47B(2), 129–150. Brown, M. A., & Sharp, P. E. (1995). Simulation of spatial learning in the Morris water maze by a neural network model of the hippocampal formation and nucleus accumbens. Hippocampus, 5(3), 171–188. Burgess, N., Recce, M., & O’Keefe, J. (1994). A model of hippocampal function. Neural Networks, 7(6/7), 1065–1081. Buzs´aki, G. (1989). Two-stage model of memory trace formation: A role for “noisy” brain states. Neuroscience, 31(3), 551–570. Chapuis, N., & Scardigli, P. (1993). Shortcut ability in hamsters (mesocricetus auratus): The role of environmental and kinesthetic information. Animal Learning and Behavior, 21(3), 255–265. Chen, L. L., Lin, L. H., Green, E. J., Barnes, C. A., & McNaughton, B. L. (1994). Head-direction cells in the rat posterior cortex: I. Anatomical distribution and behavioral modulation. Experimental Brain Research, 101, 8–23. Cho, Y. H., Berracochea, D., & Jaffard, R. (1993). Extended temporal gradient for the retrograde and anterograde amnesia produced by ibotenate entorhinal cortex lesions in mice. Journal of Neuroscience, 13(4), 1759–1766. Chrobak, J. J., & Buzs´aki, G. (1994). Selective activation of deep layer (V-VI) retrohippocampal neurons during hippocampal sharp waves in the behaving rat. Journal of Neuroscience, 14(10), 6160–6170. Cohen, N. J., & Eichenbaum, H. (1993). Memory, amnesia, and the hippocampal system. Cambridge, MA: MIT Press. Davis, M., Rainnie, D., & Cassell, M. (1994). Neurotransmission in the rat amygdala related to fear and anxiety. Trends in Neurosciences, 17(5), 208–214. DiMattia, B. V. D., & Kesner, R. P. (1988). Spatial cognitive maps: Differential role of parietal cortex and hippocampal formation. Behavioral Neuroscience, 102(4), 471–480. Eichenbaum, H., Stewart, C., and Morris, R. G. M. (1990). Hippocampal representation in place learning. Journal of Neuroscience, 10(11), 3531–3542. Etienne, A. S. (1992). Navigation of a small mammal by dead reckoning and local cues. Current Directions in Psychological Science, 1(2), 48–52. Finch, D. M. (1993). Hippocampal, subicular, and entorhinal afferents and synaptic integration in rodent cingulate cortex. In B. A. Vogt & M. Gabriel (Eds.), Neurobiology of cingulate cortex and limbic thalamus: A comprehensive handbook (pp. 224–248). Boston: Birkhauser. Gallistel, C. R. (1990). The organization of learning. Cambridge, MA: MIT Press. Gluck, M. A., & Myers, C. E. (1993). Hippocampal mediation of stimulus representation: A computational theory. Hippocampus, 3(4), 491–516. Greenstein, Y. J., Pavlides, C., & Winson, J. 1988. Long-term potentiation in the dentate gyrus is preferentially induced at theta rhythm periodicity. Brain Research, 438, 331–334. Hasselmo, M. E. (1995). Neuromodulation and cortical function: Modeling the physiological basis of behavior. Behavioral and Brain Research, 67(1), 1–27. Hasselmo, M. E., & Schnell, E. (1994). Laminar selectivity of the cholinergic suppression of synaptic transmission in rat hippocampal region CA1: Com-
The Role of the Hippocampus
107
putational modeling and brain slice physiology. Journal of Neuroscience, 14(6), 3898–3914. Hennevin, E., Hars, B., Maho, C., & Bloch, V. (1995). Processing of learned information in paradoxical sleep: Relevance for memory. Behavioural and Brain Research, 69, 125–135. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Huerta, P. T., & Lisman, J. E. (1993). Heightened synaptic plasticity of hippocampal CA1 neurons during a cholinergically induced rhythmic state. Nature, 364, 723–725. Koerner, A., Thomas, M. J., Weisend, M. P., & Sutherland, R. J. (1996). Hippocampal-dependent memory consolidation: An evaluation of three hypotheses. Society for Neuroscience Abstracts, 22, 1118. Kolb, B. (1990). Posterior parietal and temporal association cortex. In B. Kolb and R. C. Tees (Eds.), The cerebral cortex of the rat (pp. 459–471). Cambridge, MA: MIT Press. Kolb, B., & Walkey, J. (1987). Behavioural and anatomical studies of the posterior parietal cortex in the rat. Behavioral and Brain Research, 23, 127–145. Kudrimoti, H. S., Skaggs, W. E., Barnes, C. A., McNaughton, B. L., Gerrard, J. L., Suster, M. S., & Weaver, K. L. (1996). REM sleep and the reactivation of recent correlation patterns in hippocampal neuronal ensembles. Society for Neuroscience Abstracts, 22, 1871. Landfield, P. W., & Deadwyler, S. A. (Eds.) (1988). Long-term potentiation: From biophysics to behavior. New York: Alan R. Liss. Larson, J., & Lynch, G. (1989). Theta pattern stimulation and the induction of LTP: The sequence in which synapses are stimulated determines the degree to which they potentiate. Brain Research, 489, 49–58. Larson, J., Wong, D., & Lynch, G. (1986). Patterned stimulation at the theta frequency is optimal for the induction of hippocampal long-term potentiation. Brain Research, 368, 347–350. Leonard, B., & McNaughton, B. L. (1990). Spatial representation in the rat: Conceptual, behavioral, and neurophysiological perspectives. In R. P. Kesner & D. S. Olton (Eds.), Neurobiology of comparative cognition (pp. 363–422). Hillsdale, NJ: Lawrence Erlbaum Associates. Levy, W. B. (1996). A sequence predicting CA3 is a flexible associator that learns and uses context to solve hippocampal-like tasks. Hippocampus, 6(6), 579–591. Levy, W. B., & Wu, X. (1996). The relationship of local context cues to sequence length memory capacity. Network: Computation in Neural Systems, 7, 371–384. Marr, D. (1971). Simple memory: A theory for archicortex. Philosophical Transactions of the Royal Society of London, Series B, 262, 23–81. McClelland, J. L., McNaughton, B. L., & O’Reilly, R. C. (1995). Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3), 419–457. McDonald, R. J., & White, N. M. (1994). Parallel information processing in the water maze: Evidence for independent memory systems involving dorsal striatum and hippocampus. Behavioral and Neural Biology, 61, 260–270.
108
A. David Redish and David S. Touretzky
McGaugh, J. M. (1966). Time-dependent processes in memory storage. Science, 153, 1351–1358. McNaughton, B. L., Mizumori, S. J. Y., Barnes, C. A., Leonard, B. J., Marquis, M., & Green, E. J. (1994). Cortical representation of motion during unrestrained spatial navigation in the rat. Cerebral Cortex, 4(1), 27–39. Mehta, M. R., Barnes, C. A., & McNaughton, B. L. (in press). An evidence for Hebbian synaptic potentiation in hippocampus during behavior. In M. Mozer, M. Jordan, and T. Petsche (Eds.), Advances in Neural Information Processing, 9. Cambridge, MA: MIT Press. Mittelstaedt, M. L., & Mittelstaedt, H. (1980). Homing by path integration in a mammal. Naturwissenschaften, 67, 566–567. Mogenson, G. J. (1984). Limbic-motor integration—with emphasis on initiation of exploratory and goal-directed locamotion. In R. Bandler (Ed.), Modulation of sensorimotor activity during alterations in behavioral states (pp. 121–138). New York: Alan R. Liss. Morris, R. G. M., Garrud, P., Rawlins, J. N. P., & O’Keefe, J. (1982). Place navigation impaired in rats with hippocampal lesions. Nature, 297, 681–683. Morris, R. G. M., Schenk, F., Tweedie, F., and Jarrard, L. E. (1990). Ibotenate lesions of hippocampus and/or subiculum: Dissociating components of allocentric spatial learning. European Journal of Neuroscience, 2, 1016–1028. Muller, R. U., Kubie, J. L., and Saypoff, R. (1991). The hippocampus as a cognitive graph. Hippocampus, 1(3), 243–246. Nadel, L. (Ed.) (1991). The hippocampus and space revisited [Special issue]. Hippocampus, 1(3). O’Keefe, J., & Nadel, L. (1978). The hippocampus as a cognitive map. Oxford: Clarendon Press. O’Keefe, J., & Recce, M. (1993). Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3, 317–330. O’Reilly, R. C., & McClelland, J. L. (1994). Hippocampal conjunctive encoding, storage, and recall: Avoiding a trade-off. Hippocampus, 4(6), 661–682. Packard, M. G. (1994). “Place” versus “response” learning debate revisited in the brain. Society for Neuroscience Abstracts, 20, 1016. Packard, M. G., & McGaugh, J. L. (1992). Double dissociation of fornix and caudate nucleus lesions on acquisition of two water maze tasks: Further evidence for multiple memory systems. Behavioral Neuroscience, 106(3), 439– 446. Pavlides, C., Greenstein, J., Grudman, M., & Winson, J. (1988). Long-term potentiation in the dentate gyrus is induced preferentially on the positive phase of θ-rhythm. Brain Research, 439, 383–387. Pavlides, C., & Winson, J. (1989). Influences of hippocampal place cell firing in the awake state on the activity of these cells during subsequent sleep episodes. Journal of Neuroscience, 9(8), 2907–2918. Pinto, D. J., Brumberg, J. C., Simons, D. J., & Ermentrout, G. B. (1996). A quantitative population model of whisker barrels: Re-examining the Wilson-Cowan equations. Journal of Computational Neuroscience, 3(3), 247–264. Poucet, B., Thinus-blanc, C., & Chapuis, N. (1983). Route planning in cats, in relation to the visibility of the goal. Animal Behavior, 31, 594–599.
The Role of the Hippocampus
109
Redish, A. D., Elga, A. N. & Touretzky, D. S. (1996). A coupled attractor model of the rodent head direction system. Network, 7(4), 671–685. Redish, A. D., & Touretzky, D. S. (1996). Details of a comprehensive theory of rodent navigation. Society for Neuroscience Abstracts, 22, 678. Redish, A. D., & Touretzky, D. S. (1997). Cognitive maps beyond the hippocampus. Hippocampus, 7(1), 15–35. Rempel-Clower, N. L., Zola, S. M., Squire, L. R., & Amaral, D. G. (1996). Three cases of enduring memory impairment after bilateral damage limited to the hippocampal formation. Journal of Neuroscience, 16, 5233–5255. Rolls, E. T. (1989). The representation and storage of information in neuronal networks in the primate cerebral cortex and hippocampus. In R. Durbin, C. Miall, & G. Mitchison (Eds.), The computing neuron (pp. 125–159). Reading, MA: Addison-Wesley. Rolls, E. T. (1996). A theory of hippocampal function in memory. Hippocampus, 6, 601–620. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propogation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (volume 1, pp. 318–362). Cambridge, MA: MIT Press. Samsonovich, A., & McNaughton, B. L. (in press). Path integration and cognitive mapping in a continuous attractor model of the hippocampus. Journal of Neuroscience. Schallert, T., Day, L. B., Weisend, M., & Sutherland, R. J. (1996). Spatial learning by hippocampal rats in the Morris water task. Society for Neuroscience Abstracts, 22, 678. Scoville, W. B., & Milner, B. (1957). Loss of recent memory after bilateral hippocampal lesions. Journal of Neurology, Neurosurgery, and Psychiatry, 20, 11–21. Shen, B., & McNaughton, B. L. (1997). Modeling the spontaneous reactivation of experience-specific hippocampal cell assembles during sleep. Hippocampus, 6(6), 685–693. Skaggs, W. E. (1995). Relations between the theta rhythm and activity patterns of hippocampal neurons. Unpublished doctoral dissertation, University of Arizona. Skaggs, W. E., & McNaughton, B. L. (1996). Replay of neuronal firing sequences in rat hippocampus during sleep following spatial experience. Science, 271, 1870–1873. Skaggs, W. E., McNaughton, B. L., Wilson, M. A., & Barnes, C. A. (1996). Theta phase precession in hippocampal neuronal populations and the compression of temporal sequences. Hippocampus, 6(2), 149–173. Smith, C. (1995). Sleep states and memory processes. Behavioural and Brain Research, 69, 137–145. Squire, L. R. (1992). Memory and the hippocampus: A synthesis from findings with rats, monkeys, and humans. Psychology Review, 99(2), 195–231. Squire, L. R., & Alvarez, P. (1995). Retrograde amnesia and memory consolidation: A neurobiological perspective. Current Opinion in Neurobiology, 5, 169–177.
110
A. David Redish and David S. Touretzky
Squire, L. R., & Zola-Morgan, S. (1988). Memory: Brain systems and behavior. Trends in Neurosciences, 11(4), 170–175. Sripanidkulchai, K., & Wyss, J. M. (1986). Thalamic projections to retrosplenial cortex in the rat. Journal of Comparative Neurology, 254, 143–165. Stewart, M., & Fox, S. E. (1990). Do septal neruons pace the hippocampal theta rhythm? Trends in Neurosciences, 13(5), 163–168. Sutherland, R. J., & Hoesing, J. M. (1993). Posterior cingulate cortex and spatial memory: A microlimnology analysis. In B. A. Vogt & M. Gabriel (Eds.), Neurobiology of cingulate cortex and limbic thalamus: A comprehensive handbook (pp. 461–477). Boston: Birkhauser. Sutton, R. S. (Ed.) (1992). Reinforcement learning [Special issue]. Machine Learning, 8(3/4). Tolman, E. C. (1948). Cognitive maps in rats and men. Psychological Review, 55, 189–208. Touretzky, D. S., & Redish, A. D. (1996). A theory of rodent navigation based on interacting representations of space. Hippocampus, 6(3), 247–270. Tsodyks, M., & Sejnowski, T. J. (1995). Associative memory and hippocampal place cells. International Journal of Neural Systems, 6 (Suppl.), 81–86. Tsodyks, M. V., Skaggs, W. E., Sejnowski, T. J., & McNaughton, B. L. (1996). Population dynamics and theta rhythm phase precession of hippocampal place cell firing: A spiking neuron model. Hippocampus, 6(3), 271–280. van Groen, T., Vogt, B. A., & Wyss, J. M. (1993). Interconnections between the thalamus and retrosplenial cortex in the rodent brain. In B. A. Vogt & M. Gabriel (Eds.), Neurobiology of cingulate cortex and limbic thalamus: A comprehensive handbook (pp. 123–150). Boston: Birkhauser. Vanderwolf, C. H. (1971). Limbic-diencephalic mechanisms of voluntary movement. Psychological Review, 78(2), 83–113. Vanderwolf, C. H. (1990). An introduction to the electrical activity of the cerebral cortex: Relations to behavior and control by subcortical inputs. In B. Kolb & R. C. Tees (Eds.), The cerebral cortex of the rat (pp. 151–189). Cambridge, MA: MIT Press. Vanderwolf, C. H., & Leung, (1983). Hippocampal rhythmical slow activity: A brief history and the effects of entorhinal lesions and phencyclidine. In W. Seifert (Ed.), Neurobiology of the hippocampus (pp. 275–302). New York: Academic Press. Vogt, B. A. (1985). Cingulate cortex. In A. Peters & E. Jones (Eds.), Association and auditory cortices (pp. 89–150). New York: Plenum Press. Weisend, M. P., Astur, R. S., & Sutherland, R. J. (1996). The specificity and temporal characteristics of retrograde amnesia after hippocampal lesions. Society for Neuroscience Abstracts, 22, 1118. Whishaw, I. Q., Cassel, J-C., & Jarrard, L. E. (1995). Rats with fimbria-fornix lesions display a place response in a swimming pool: A dissociation between getting there and knowing where. Journal of Neuroscience, 15(8), 5779–5788. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical Journal, 12(1), 1–24. Wilson, M. A., & McNaughton, B. L. (1994). Reactivation of hippocampal ensemble memories during sleep. Science, 265, 676–679.
The Role of the Hippocampus
111
Winocur, G. (1990). Anterograde and retrograde amnesia in rats with dorsal hippocampal or dorsomedial thalamic lesions. Behavioural and Brain Research, 38, 145–154. Witter, M. P. (1993). Organization of the entorhinal-hippocampal system: A review of current anatomical data. Hippocampus, 3, 33–44. Witter, M. P., Ostendorf, R. H., & Groenwegen, H. J. (1990). Heterogeneity in the dorsal subiculum of the rat. Distinct neuronal zones project to different cortical and subcortical targets. European Journal of Neuroscience, 2, 718–725. Wood, E. R., Mumby, D. G., Pinel, J. P. J., & Phillips, A. G. (1996). Impaired object recognition memory in rats following ischemia-induced damage to the hippocampus. Behavioral Neuroscience, 107(1), 51–62. Wu, X., Baxter, R. A., & Levy, W. B. (1996). Context codes and the effect of noisy learning on a simplified hippocampal CA3 model. Biological Cybernetics, 74, 159–165. Wyss, J. M., & van Groen, T. (1992). Connections between the retrosplenial cortex and the hippocampal formation in the rat: A review. Hippocampus, 2(1), 1–12. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. Journal of Neuroscience, 16(6), 2112–2126. Zola-Morgan, S., & Squire, L. R. (1993). Neuroanatomy of memory. Annual Review of Neuroscience, 16, 547–563. Received October 7, 1996; accepted March 26, 1997.
Communicated by Randall Beer
Near-Saddle-Node Bifurcation Behavior as Dynamics in Working Memory for Goal-Directed Behavior Hiroyuki Nakahara ¤ General Systems Studies, University of Tokyo, Tokyo, Japan Kenji Doya ATR Human Information Processing Research Laboratories, Kyoto, Japan In consideration of working memory as a means for goal-directed behavior in nonstationary environments, we argue that the dynamics of working memory should satisfy two opposing demands: long-term maintenance and quick transition. These two characteristics are contradictory within the linear domain. We propose the near-saddle-node bifurcation behavior of a sigmoidal unit with a self-connection as a candidate of the dynamical mechanism that satises both of these demands. It is shown in evolutionary programming experiments that the near-saddle-node bifurcation behavior can be found in recurrent networks optimized for a task that requires efcient use of working memory. The result suggests that the near-saddle-node bifurcation behavior may be a functional necessity for survival in nonstationary environments. 1 Introduction
Short-term memory is considered in the psychological literature as a system that registers and retains incoming information in a highly accessible form for a short period of time after the input (Tulving, 1991). Working memory is proposed as an extended concept of the short-term memory. According to Baddeley (1992), working memory is dened as a brain system that provides selection, temporary storage, and manipulation of information necessary for such complex cognitive tasks as language comprehension, learning, and reasoning. There can be two basic mechanisms for storage of the short-term or working memory in neural networks. One possibility is that information is stored in rapid short-term changes in the synaptic weights (Malsburg & Schneider, 1986; Schmidhuber, 1992). Another possibility, which is investigated in this study, is that information is stored in the sustained activity of neurons due to the dynamics of the network. There is experimental evidence supporting the latter possibility (Funahashi et al., 1989; Goldman-Rakic, 1987; ¤ Lab. for Information Representation, Frontier Research Program, RIKEN. Hirosawa 2–1, Wako Saitama, 351–01, Japan.
[email protected]
Neural Computation 10, 113–132 (1998)
c 1997 Massachusetts Institute of Technology °
114
Hiroyuki Nakahara and Kenji Doya
Fuster, 1989; Miyashita & Chang, 1988). Recurrent neural network models that replicate some of these experimental data have been proposed (Zipser, 1991; Zipser et al., 1993; McAuley & Stampi, 1994). In those models, xedpoint attractors of the network dynamics (Amit, 1989) play an essential role in robustly storing the memorized items against noise. Although robustness is an important requisite for a working memory system, when we consider its role in goal-directed behavior in a dynamically changing environment, its transient properties are also important. As a concrete example, let us consider a foraging task in which an animal uses visual input to catch prey. It is helpful to store the prey location in its working memory if the prey goes behind a bush and the sensory cue becomes temporarily unavailable. However, the memory should not be retained forever because the prey may have actually gone away or may have been eaten by other animal. Furthermore, if another prey appears near the animal, the animal should quickly load the location of the new prey into working memory without being disturbed by the old memory. This simple example illustrates that there are more requirements for the working memory system in addition to robust maintenance. First, the neural activity for working memory has to be set quickly. Second, the activity should be maintained robustly but not for an innitely long time. Third, the activity should be reset quickly when there is a novel sensory cue that needs to be stored. In other words, the neural dynamics in working memory for goal-directed behaviors should have the properties of long-term maintenance and quick transition. In this article, we investigate possible mechanisms that satisfy these seemingly opposing requirements. Our basic hypothesis is that these demands can be satised by the nonlinear dynamical behavior of simple recurrent neural networks. In the following sections, the dynamics of a network of sigmoidal units with self-connections will be analyzed. It is shown that both long-term maintenance and quick transition can be achieved when the system parameters are near a saddle-node bifurcation point. Then it will be tested if such a dynamical mechanism can actually be helpful for a goal-directed behavior of an autonomous agent in simulations of a foraging task similar to the one used in Noi et al. (1990). After optimizing neural networks that control the movement of the agents by evolutionary programming, near-saddle-node bifurcation behavior is robustly found under conditions that demand efcient use of working memory. The result indicates that near-saddle-node bifurcation behavior can emerge in the course of evolution as a necessity for survival in nonstationary environments. Preliminary results for this study are reported in Nakahara and Doya (1996). 2 Near-Saddle-Node Bifurcation Behavior
It is essential that we consider nonlinear dynamics in modeling the function of working memory. The solution of a linear dynamical system is a
Near-Saddle-Node Bifurcation Behavior
115
superposition of exponential curves in generic cases (Hirsh & Smale, 1974). Suppose we give a pulse-like input to a linear system; then the time constants for the rising and decaying phases of the response are the same. Therefore, it is impossible to achieve both quick transition (fast rising) and long maintenance (slow decaying) of activity with a linear dynamical model. On the other hand, the effective response time constants of a nonlinear system can be different depending on the input and the state of the system. 2.1 Dynamics of a Self-Recurrent Unit. First, we consider the dynamics of a single sigmoidal unit with the self-connection weight a and the bias b as follows:
y (t C 1) D F (ay (t ) C b), F(x ) D
(2.1)
1 . 1 C exp(¡x)
(2.2)
The parameters (a, b) determine the qualitative behavior of the system such as the number of xed points and their stabilities. As we change the parameters, the qualitative behavior of the system may suddenly change. This is referred to as bifurcation (Guckenheimer & Homes, 1983). One typical example is a saddle-node bifurcation in which a pair of xed points, one stable and one unstable, emerges. For example, as the bias b is increased in equation 2.1, the number of xed points changes from one (see Figure 1, Case A), two (Case B), three (Case C), two (Case D), and then back to one (not shown). A saddle-node bifurcation occurs when the state transition curve y(t C 1) D F(ay(t) C b) is tangent to y(t C 1) D y (t), as in the case of Figure 1, Cases B and D. Let y¤ be this point of tangency. We have the following condition for saddle-node bifurcation: F (ay¤ C b ) D y¤
(2.3)
D 1.
(2.4)
dF (ay C b ) dy
y D y¤
These equations can be solved, by noting F0 (x ) D F(x )(1 ¡ F (x )), as aD
y¤ (1
1 ¡ y¤ )
b D F ¡1 (y¤ ) ¡ ay¤ D F ¡1 (y¤ ) ¡
(2.5) 1 . 1 ¡ y¤
(2.6)
By changing the xed-point value y¤ between 0 and 1, we can plot a curve in the parameter space (a, b) on which saddle-node bifurcation occurs, as
116
Hiroyuki Nakahara and Kenji Doya
Figure 1: State transition diagrams of the self-recurrent unit for four different cases. Case A: One xed point near y D 0. Case B: A saddle-node bifurcation at y D 0.9. Case C: Three xed points. Case D: Another saddle-node bifurcation at y D 0.1. In each graph, a solid circle stands for a stable xed point, an empty circle for an unstable xed point, and an empty circle with a solid one inside for a saddle xed point.
shown in Figure 2. The system has only one stable xed point when the parameters are outside the cusp (Case A) and three xed points inside the cusp (Case C). A pair of stable and unstable xed points emerges or disappears when the parameters pass across the cusplike curve (Cases B and D). An interesting behavior can be found when the parameters are just outside the cusp, as shown in Figure 3 (center). The system has only one xed point near y D 0, but once the unit is activated (y ’ 1), because the trajectory “bounces” in the narrow channel between y (t C 1) D y (t) and the sigmoid activation curve, the unit stays “on” for many time steps and then goes back to the xed point quickly. Such a mechanism may be useful in satisfying the requirements of the dynamics in working memory for goal-directed behaviors: long-term maintenance and quick transition.
Near-Saddle-Node Bifurcation Behavior
117
b 5
20
10
30
a
CASE D CASE C CASE B -10 CASE A
-20 Figure 2: The bifurcation set in the parameter space of the self-recurrent unit. Saddle-node bifurcation is seen on the cusp-shaped curve. There are three xed points inside and one xed point outside the cusp.
2.2 Network of Self-Recurrent Units. Next, we examine the dynamics of a network of the above self-recurrent units (see Figure 4),
yi (t C 1) D F[ayi (t) C b C
X 6 i j, j D
cij yj (t) C di xi (t)],
(2.7)
where a is the self-connection weight, b is the bias, cij is the lateral connection weight, di is the input connection weight, and xi (t) is the external input. The effect of the sum of the lateral and external inputs, X
ui D
6 i j, j D
cij yj C di xj ,
is equivalent to the change in the bias, which slides the sigmoid curve in the state transition diagram horizontally without changing the slope. Therefore, we can analyze the behaviors of a multiple of units based on the single-unit behavior discussed above. For example, let us consider a case in which a saddle-node bifurcation occurs at y1 D 0.9. From equation 2.6, the parameters for this case are a D 11.11 and b D b1 ’ ¡7.80. As we increase b while keeping a constant, the system rst has three xed points as in Figure 1, Case C, and then the lower
118
Hiroyuki Nakahara and Kenji Doya
a = 11.11 b = -3.
a = 11.11 b = -7.9
0.5 0
0.5
0.5 0
1.
y(t)
0
1.
20
1.
1.
y(t)
0.5
10
0.5
y(t)
1.
y(t)
y(t)
0.5
0.5
y(t)
1.
0
1.
y(t+1)
1.
y(t+1)
y(t+1)
1.
a = 11.11 b = -9.
0.5 0
Time(t)
0
20
10
0.5
Time(t)
10
20
Time(t)
Figure 3: Temporal responses of self-recurrent units. (center) Near-saddle-node bifurcation with a D 11.1111, b D ¡7.9. (left) Increased bias b D ¡3.0. (right) Decreased bias b D ¡9.0.
a
b
a
b
a
b
a
b
Figure 4: Example of a recurrent network with the uniform self-connection, a, and the bias, b.
two xed points merge together at y D 1 ¡y1 D 0.1 with the bias b2 ’ ¡3.31, which forms another saddle-node bifurcation as appears in Figure 1, Case D. Let the bias b0 D ¡7.90 so that the unit is near saddle-node bifurcation when there is no lateral or external inputs. If the input sum exceeds the threshold h (ui > h D b2 ¡ b0 ’ 4.59) the lower xed point at y D 0.1
Near-Saddle-Node Bifurcation Behavior
119
disappears, and the state jumps up to the upper xed point near y D 1, quickly turning the unit “on” (see Figure 3, left). As we saw above, when the input is removed, the state stays near y D 0.9 for many time steps (see Figure 3, center). If there are inhibitory lateral connections, the activation of the unit i 6 i ), making it more raises the threshold for other units as hk D h ¡ cki yi (k D difcult for other units to turn on. On the other hand, the time course of the activated unit i is affected very little with the subthreshold input to other units k because their activity is kept low (yk < 0.1). When there is a strong input to unit k that exceeds the threshold hk , however, the unit is turned on and sends an inhibitory input to unit i, which is equivalent to a decrease in the bias. As a result, the activation of the unit i quickly goes down (see Figure 3, right). 3 Evolution of Network Dynamics
In section 2, we theoretically showed the potential usefulness of near-saddlenode bifurcation behavior for satisfying the demands of the dynamics in working memory. We further hypothesize that such behavior is indeed useful in animal behaviors and can be found in the course of learning and evolution of the neural system. To test our hypothesis, we simulated a foraging task in which a creature seeks food in a grid-like world (see Figure 5), similar to the task used by Nol et al. (1990). Our purpose in this simulation is to see whether nearbifurcation dynamics discussed in the previous section can actually improve the creature’s performance in a nonstationary environment where selection and memory of sensory input are necessary. Evolutionary programming (Fogel et al., 1990) was used to optimize the recurrent network that controls the movement of the creature. Figure 5 shows an example of the grid-like world. A certain number of food items were in xed positions, each of which turned visible or invisible in a stochastic fashion, as determined by a two-state Markov system. A creature ate the food when it reached the location of the food, regardless of whether it was visible or not. When a food item was eaten, it was removed from that position, and a new food item was placed randomly. The size of the world was 20 £ 20 and both ends were connected as a torus. The amount of food a creature found in a certain time period was the measure of its performance. This foraging task necessitates the functions of working memory. At each time step, the creature receives sensory information signaling the directions of only some of the food items because each food item may be visible or invisible. Following the immediate sensory cues may result in weaving and inefcient movements. Thus, the direction of the food item must be stored in working memory and maintained long enough until the creature gets the food item. On the other hand, if the creature happens to pass by another
120
Hiroyuki Nakahara and Kenji Doya
Figure 5: The foraging task in a grid-like world. Note that the shown example is not a 20 £ 20 but a 10 £ 10 grid world.
food item, the stored memory has to be switched quickly to the close new item. Hence, the long-term maintenance and quick transition is important for the functions of the working memory in this foraging task. 3.1 The Creature. A creature had ve visual sensors, each of which detected food within a particular 45-degree sector (see Figure 6, top). The activation of each sensory unit was given by
xi D
X 1 j
rj
,
where rj was the distance to the jth food item that was visible within the sector at a time. At each time step, the creature executed one of three motor commands: L: turn left (45 degrees), C: step forward, and R: turn right (see Figure 6, middle). The action of the creature was controlled by a two-layer neural network (see Figure 6, bottom). The dynamics of each of ve units in visual layer was given by 0 1 yi (t C 1) D F @ayi (t) C b C
X 6 i j, j D
cij yj (t) C dxi (t )A ,
(3.1)
where yi (t) was the output of the visual unit at time t, a was the selfconnection weight, b was the bias, cij was the cross-connection weight, d
Near-Saddle-Node Bifurcation Behavior
121
Sensory Input: food invisible
Input = 1/r1 + 1/r2 ri :"distance"
r1
food visible
r2
Creature
Each unit in visual layer receives inputs from a certain angle(45 degree). Actions:
Creature
Three actions: 45 degrees turn left one step forward 45 degrees turn right Network Structure: L
C
e f1 f2
R Motor Layer
f3
f4
c4 a
b c1 d
c5 c3
Visual Layer
c2
Figure 6: A creature’s sensory input (top), motor system (middle), and network architecture (bottom).
122
Hiroyuki Nakahara and Kenji Doya
was the input connection weight, and xi (t) was the external sensory input. Note that the self-connection a, the bias b, and the input weight d were the same for all units. Each of three units in motor layer coded the probability of taking one of the three motor commands (L, C, R). Their output zk was given by vk ( t ) D e k C
X
fki yi (t),
(3.2)
i
exp(vk (t)) , zk (t) D P l exp(vl ( t ))
(3.3)
where ek was the bias and fki was the feedforward connection weight. Note that adding a uniform bias ek to all the units did not affect the output because of the normalization in equation 3.3. In order to avoid the redundancy, we xed the bias of the center unit as e2 D 0. In general, the time step of the internal operation of the network can be different from that of the external world. We chose two steps of internal time t, which corresponded to one step of external time T, that is, T D 2t. This allowed an indirect effect of sensory input through the lateral connection to be utilized in taking the next action. In addition, the activation pattern in visual layer was shifted when the creature made a turn, which should give the proper mapping between the working memory and sensory input at the next external time step. 3.2 The World. The characteristics of the world were determined by two sets of parameters: the food density and the parameters of the Markov transition matrix. We xed the food density at 0.03; there were 12 food items randomly distributed in the 20 £ 20 grid world. At each world time step, each food item took one of two states “visible” (on) and “invisible” (off) as given by a Markov system,
(
P off (T C 1) P on (T C 1)
! D
(
p0 (1 ¡ p0 )
(1 ¡ p1 ) p1
!
(
Poff (T ) Pon (T )
! ,
(3.4)
where Pon (T ) and Poff (T ) were the probabilities that the food item was visible (on) and invisible (off) at time T, respectively. Note that the stationary distribution (PNoff , PN on ) is given by PN off D
1 ¡ p1 (1 ¡ p0 ) C (1 ¡ p1 )
and
PNon D
1 ¡ p0 . (1 ¡ p0 ) C (1 ¡ p1 )
Refer to Figure 7 to see points in the parameter space (p0 , p1 ) of the Markov transition matrix that were used in the simulation.
Near-Saddle-Node Bifurcation Behavior
123
1.
p1
A
D
B
0.5
F
Pon = 1/2
1/3
1/5
1/7
H C
E
0
G
I
0.5
1.
p0
Figure 7: The environmental parameter sets, (p0 , p1 ), which are tested in simulation, are plotted. A D (0.75, 0.75), B D (0.5, 0.5), C D (0.125, 0.125), D D (0.875, 0.75), E D (0.562, 0.125), F D (0.875, 0.5), G D (0.781, 0.125), H D (0.888, 0.2), and I D (0.875, 0.125). Oblique lines represent the set of the parameters that result in the same stationary distributions: PN on (t) D 1 / 2, PN on (t) D 1 / 3, PN on (t) D 1 / 5, and PN on (t ) D 1 / 7. 3.3 Evolutionary Programming. For the sake of simplicity, we put symmetric constraints on the connection weights as follows:
0 B B fcij g D B B @
0 0 c4 0 0
c1 0 c5 0 0
c2 c3 0 c3 c2
0 0 c5 0 c1
0 0 c4 0 0
1 C C C C A
0
f1 f fij g D @ 0 0
f2 f3 0
0 f4 0
0 f3 f2
1 0 0 A. f1
The bias for the motor units was also symmetric (eL , eC , eR ) D (e, 0, e ). Therefore, each creature’s network was characterized by the 13 parameters (a, b, c1 , . . . , c5 , d, e, f1 , . . . , f4 ).
124
Hiroyuki Nakahara and Kenji Doya
A population of 60 creatures was tested on each generation. The performance was measured by the number of pieces of the food a creature obtained in T D 400 time steps. Each of the top 20 scoring creatures produced three offspring: one identical copy of the parameters of the parent’s and two copies of these parameters with a gaussian uctuation » N (0, 1.52 ). These three offspring of each of the top 20 scoring creatures (3 £ 20 D 60) thus become the next generation; their performance was then measured, and it continued. In preliminary experiments, we generated the initial population with random parameters whose range was [¡10.0, 10.0] except for the input connection weight d, whose range was [1.0, 6.0]. Under most conditions, the population converged to a limited range of the parameter space after evolution. Therefore, in the simulations below, a smaller range of initial parameters was used in order to speed up convergence (see the appendix). In this article, the result after 100 generations is reported. 4 Results 4.1 Creature’s Performance. The mean performance of the creature in the 100th generation with the top 20 scoring individuals is shown in Table 1 for different environmental parameters (p0 , p1 ) of the Markov transition matrix. As a comparison, we tested the performance of creatures with a feed-forward network by keeping a D c1 , . . . , c5 D 0 and letting other weights evolve for 100 generations. Performance in the feedforward case was always lower than that in the recurrent case for each set of environmental parameters (see Table 1). We performed t-test between recurrent and feedforward networks under each environmental condition of (p0 , p1 ), and the difference was statistically signicant (p < 0.0001) under all conditions. Generally the performance was lower as p0 (probability of food staying invisible) was increased and as p1 (probability of food staying visible) was decreased. The difference of the performance between the top-scoring creatures of recurrent and feedforward networks, which is normalized by the performance of the top-scoring feedforward network, is plotted in Figure 8. The difference tended to be more marked as p1 (probability of food staying visible) became small and p0 (probability of food staying invisible) became large. 4.2 Convergence to Near-Bifurcation Region. The self-connection and bias parameters of top 10 scoring creatures under different environmental parameters are shown in Figure 9. When either p0 was small or p1 was large, the value of the self-connection a was almost zero, as shown in Figure 9 for the Markov parameter setting of D D (0.875, 0.75). Convergence to the region of the parameters (a, b) similar to that of D was seen also in the case of A D (0.75, 0.75), B D (0.5, 0.5), and C D (0.125, 0.125).
Near-Saddle-Node Bifurcation Behavior
125
Table 1: Mean Performances of the Top 20 Scoring Individuals after 100 Generations under Each Environmental Parameters, (p0 , p1 ). Conditions(p0 , p1 )
A (.750, .750) B (.500, .500) C (.125, .125) D (.875, .750) E (.562, .125) F (.875, .500) G (.781, .125) H (.888, .200) I (.875, .125)
Markov Transition Matrix of Each Food
¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡
.750 .250 .500 .500 .125 .875 .875 .125 .562 .438 .875 .125 .781 .219 .888 .112 .875 .125
¢ ¢ .500 .500 ¢ .875 .125 ¢ .250 .750 ¢ .875 .125 ¢ .500 .500 ¢ .875 .125 ¢ .800 .200 ¢ .875 .250 .750
.125
Recurrent¤
Feedforward¤
50.2 ( § 0.47)
46.7 ( § 0.51)
52.3 ( § 0.63)
45.8 ( § 0.60)
59.0 ( § 0.69)
45.7 ( § 1.02)
42.5 ( § 0.36)
37.4 ( § 0.51)
46.0 ( § 0.77)
37.0 ( § 1.02)
33.1 ( § 0.43)
27.4 ( § 0.42)
34.9 ( § 0.87)
27.9 ( § 0.39)
27.6 ( § 0.58)
22.3 ( § 0.38)
28.3 ( § 0.54)
22.2 ( § 0.52)
Note: ¤ Standard deviations are in parentheses.
As p0 (the probability of food items keeping invisible) was increased and p1 (the probability of food items keeping visible) was decreased—in other words, as the environment became more severe—the convergence to a region in the vicinity of the saddle-node bifurcation boundary, which is called near-saddle-node bifurcation region in this study, became more prominent. Examples of such network parameters (a, b) are shown in Figure 9 D, F, H, and I, where the values of p0 were almost the same constant (see Figure 7). It is clearly seen that the parameters found after evolution lie just underneath the saddle-node bifurcation curve. This was most prominent in the case of I, where p1 was the smallest. 4.3 Dynamics of Activation. Figure 10 illustrates the network dynamics of the top-scoring creature for the environmental condition I. As seen in Figure 9, the network had converged to the near-saddle-node bifurcation region. It is clearly seen that the unit in the visual layer, especially the “center ” unit y3 , functioned as a source of working memory. For example, from T D 5 to T D 9, the “center ” unit y3 exhibited sustained activity with-
126
Hiroyuki Nakahara and Kenji Doya
50 0
%
25
1
p1 0.5
1 0.5
00
p0
Figure 8: The relative performance of the creatures after 100 generations. At each point of the parameter space (p0 , p1 ), the height of the solid bar represents the difference of the performance between the top-scoring creatures of recurrent and feedforward networks, which is normalized by the performance of the top-scoring creature of feedforward network. Oblique lines represent the set of the parameters that result in the same stationary distributions: PN on (t) D 1 / 2, PN on (t) D 1 / 3, PN on (t) D 1 / 5, and PN on (t) D 1 / 7.
out any input. Between T D 16 and T D 40, the unit y3 exhibited robust maintenance of activity with only intermittent inputs to the unit y3 , even though there were small inputs sometimes to other units in visual layer. When enough input came into a noncenter unit, for example, y2 at T D 40, it quickly turned on, and y3 was immediately suppressed. The activation of y2 in visual layer was propagated into units in motor layer so that the creature made a left turn. According to our assumption, then, the activation of y2 was shifted to y3 , and it remained active for several time steps. It should be emphasized that the near-bifurcation behavior of the unit y3 realized the long-term maintenance and quick transition with the help of interaction with other units.
Near-Saddle-Node Bifurcation Behavior
b 0
b
D 5
127
10
15
a
0
-5
-5
-10
-10
-15
-15
b 0
5
b
H 5
F
10
15
a
0
-5
-5
-10
-10
-15
-15
10
15
10
15
a
I 5
a
Figure 9: The convergence of the network parameter (a, b ) with different environmental parameters (p0 , p1 ) of the Markov transition matrix. Top 10 scoring creatures’ network parameters are plotted in the bifurcation diagram. (left top) D D (0.875, 0.75). (right top) F D (0.875, 0.5). (left bottom) H D (0.888, 0.2). (right bottom) I D (0.875, 0.125).
Even after optimization by evolutionary programming, it was sometimes observed in some lower-scoring individuals of the population that the dynamics of the visual layer units did not realize near-saddle-node bifurcation behavior. This is partly because the function of working memory can be realized not only in visual layer alone but also in the entire behaving system, including motor layer. With a large negative bias e in motor layer, the choice of motor command can be strongly biased as “center,” that is, “step forward.” In this case, once a creature detects the food far away and makes a turn toward the food, a creature continues most of the time to “step forward” toward the food so that the creature does not have to remember the direction of food. In other words, without the long-term maintenance of memory in visual layer, a creature can use its body direction as the working memory using the fact that the default choice of motor command is to go straight ahead.
128
Hiroyuki Nakahara and Kenji Doya
y1 10
20
30
40
50
10
20
30
40
50
10
20
30
40
50
10
20
30
40
50
10
20
30
40
50
10
20
30
40
50
10
20
30
40
50
10
20
30
40
50
y2
y3
y4
y5
L
C
R
time Figure 10: Examples of the creature’s activation dynamics in the simulated environment, I D (0.875, 0.125). y1 through y5 indicate outputs in visual layer. L, C, and R indicate outputs in motor layer, corresponding to each of motor commands: “turn left,” “step forward,” and “turn right,” respectively. Dotted lines of units in visual layer represent the external input. Arrowheads at the top of each visual unit indicate that the activation of units in visual layer is shifted according to the creature’s turn. Dots at the top of each motor unit show that the corresponding motor command is chosen at that time. Large dots on the horizontal axis of C show that the food is obtained at that time.
Near-Saddle-Node Bifurcation Behavior
129
5 Discussion
In this article, the long-term maintenance and quick transition are regarded as crucial requirements for the dynamics of working memory in goal-directed behaviors. In addition to robust maintenance, rapid selection, loading, and cancellation of neural activities are important issues to be considered in dynamical models of working memory. We mathematically analyzed the dynamical characteristics of a self-recurrent sigmoidal unit with a bias. It was shown that both long-term maintenance and quick transition can be realized near a saddle-node bifurcation. The behavior of a network of such recurrent units can be analyzed by considering inputs as the change in the bias. Near-saddle-node bifurcation behavior can be considered as a candidate of the dynamics in working memory for goal-directed behaviors. By the simulation of foraging tasks, we tested our hypothesis that the dynamics of long-term maintenance and quick transition in working memory is important and valuable for goal-directed behaviors. Evolutionary programming was used to optimize neural network organisms. Convergence to the near-saddle-node bifurcation region became more pronounced, and the difference of the performance between optimized recurrent networks and feedforward networks became more prominent as the probability of food items staying invisible was increased and the probability of food items staying visible was decreased. It should be emphasized that the convergence to near-saddle-node bifurcation behavior was not hand-coded by the experimenter but emerged in the optimization process of evolutionary programming in interaction of the given nonstationary environment. This result implies that the proposed characteristics of dynamics in working memory, that is, the long-term maintenance and quick transition, may be an emergent functional property of evolving biological systems to survive in the varying world. Although this study investigated how a simple recurrent network can satisfy the properties of long-term maintenance and quick transition, these properties may not necessarily be implemented in a single network. As stated in section 4.3, there were cases in which the network dynamics itself in visual layer did not realize near-saddle-node bifurcation behavior, but the strong bias of “step forward” in motor layer enabled the creature’s body direction as the source of working memory. The single units in our network model may not correspond to single neurons in biological systems. However, the characteristic behaviors of nearsaddle-node bifurcation systems have a broad applicability as models of neural systems at different levels. Recurrent network models as proposed here may subserve as a coarsely grained model of working memory realized by local dynamics of populations of neurons or recurrent loops connecting multiple brain areas. The attention and short-term memory systems tended to be treated as distinct systems, the former selecting what to be memorized
130
Hiroyuki Nakahara and Kenji Doya
by the latter. A marked feature of the current study is that the functions of selection, storage, and switching are emergent properties of single dynamical systems. Nonlinear dynamical system theory has been successfully applied to modeling neural mechanisms and cognitive systems at different levels (Port & Van Gelder, 1995; Kelso, 1995), such as the level of single neurons (Rinzel & Ermentrout, 1989; Marder & Abbott, 1995; Guckenheimer et al., 1993), motor behaviors (Taga, 1995; Collins & Stewart, 1993; Beer, 1995), and language (Elman, 1995). Techniques of research in articial life such as evolutionary programming are helpful in exploring what cognitive functions are necessary for behaving in various environments (Belew & Mitchell, 1996). This article is an example of fruitful convergence of these two lines of research. Appendix
The ranges of initial parameters were set as follows: [5.0, 10.0] for a, [¡8.0, ¡2.0] for b, [¡2.0, 0.50] for c1 , c2 , c3 , [¡1.5, 0.50] for c4 , c5 , [1.0, 6.0] for d, [¡1.0, 0.0] for e, [3.0, 6.0] for f1 , f3 , and [5.0, 9.0] for f2 , f4 . Acknowledgments
This work was done when H.N. visited the Department of Cognitive Science and Institute for Neural Computation, University of California at San Diego, La Jolla, California. H.N. is grateful to Ed Hutchins for his generous support; to Adrian Robert, Paul Rodriguez, David Rogers, and Chris Fry for their helpful comments; to Mark Wallen for his technical support; and to the Department of Cognitive Science, University of California at San Diego, for its hospitality. We also thank anonymous reviewers for their helpful comments. H.N. is supported partly by the JSPS fellowship. References Amit, D. (1989). Modelling brain function. Cambridge: Cambridge University Press. Baddeley, A. D. (1992). Working memory. Science, 255, 556–559. Beer, R. D. (1995). Computational and dynamical languages for autonomous agents. In R. F. Port & T. Van Gelder (Eds.), Mind as motion (pp. 121–147). Cambridge, MA: MIT Press. Belew, R. K., & Mitchell, M. (Eds.). (1996). Adaptive individuals in evolving populations. Redwood City, CA: Addison-Wesley. Collins, J. J., & Stewart, I. (1993). Hexapodal gaits and coupled nonlinear oscillator models. Biological Cybernetics, 68, 287–298. Elman, J. L. (1995).Language as a dynamical system. In R. F. Port & T. Van Gelder (Eds.), Mind as motion (pp. 195–225). Cambridge, MA: MIT Press.
Near-Saddle-Node Bifurcation Behavior
131
Fogel, D. B., Fogel, L., & Porto, V. (1990). Evolving neural networks. Biological Cybernetics, 63, 487–493. Funahashi, S., Bruce, C. J., & Goldman-Rakic, P. S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. Journal of Neurophysiology, 61(2), 331–349. Fuster, J. M. (1989). The prefrontal cortex: Anatomy, physiology,and neuropsychology of the frontal lobe (2nd ed.). New York: Raven. Goldman-Rakic, P. S. (1987). Circuitry of primate prefrontal cortex and regulation of behavior by representational memory. In V. B. Mountcastle (Ed.), Handbook of physiology (Vol. 5, pp. 373–414). Bethesda, MD: Amerian Physiological Society. Guckenheimer, J., Gueron, S.,& Harris-Warrick, R. M. (1993). The dynamics of a conditionally bursting neuron. Philosophical Transactions of the Royal Society of London, Series B: Biological Sciences, 341, 345–359. Guckenheimer, J., & Homes, P. 1983. Nonlinear oscillations, dynamical systems, and bifurcation of vector elds. Berlin: Springer-Verlag. Hirsh, M. W., & Smale, S. (1974). Differential equations, dynamical systems, and linear algebra. New York: Academic Press. Kelso, S. J. A. (1995). Dynamic patterns. Cambridge, MA: MIT Press. Malsburg, C. v. d., & Schneider, W. (1986). A neural cocktail-party processor. Biological Cybernetics, 54, 29–40. Marder, E., & Abbott, L. F. (1995). Theory in motion. Current Opinion in Neurobiology, 5, 832–840. McAuley, J. D., & Stampi, J. (1994). Analysis of the effects of noise on a model for the neural mechanism of short-term active memory. Neural Computation, 6(4), 668–678. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. Nakahara, H., & Doya, K. (1996). Dynamics of attention as near saddle-node bifurcation behavior. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 38–44). Cambridge, MA: MIT Press. Nol, S., Elman, J., & Parisi, D. (1990). Learning and evolution in neural networks. (Tech. Rep. No. 9019). San Diego: University of California. Port, R. F., & Van Gelder, T. (Eds.). (1995). Mind as motion. Cambridge, MA: MIT Press. Rinzel, J., & Ermentrout, G. B. (1989). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 135–169). Cambridge, MA: MIT Press. Schmidhuber, J. (1992).Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1), 131–139. Taga, G. (1995). A model of the neuro-musculo-skeletal system for human locomotion. Biological Cybernetics, 73, 113–121. Tulving, E. (1991).Concepts of human memory. In L. R. Squire, N. M. Weinberger, G. Lynch, & J. L. McGaugh (Eds.), Memory: Organization and locus of change (pp. 3–32). New York: Oxford University Press.
132
Hiroyuki Nakahara and Kenji Doya
Zipser, D. (1991). Recurrent network model of the neural mechanism of shortterm active memory. Neural Computation, 3(2), 179–193. Zipser, D., Kehoe, B., Gwen, L. & Fuster, J. (1993). A spiking network model of short-term active memory. Journal of Neuroscience, 13(8), 3406–3420. Received August 29, 1996; accepted May 2, 1997.
Communicated by Andrew Barto
The Canonical Form of Nonlinear Discrete-Time Models G´erard Dreyfus Yizhak Idan ´ ESPCI, Laboratoire d’Electronique, 75005 Paris, France
Discrete-time models of complex nonlinear processes, whether physical, biological, or economical, are usually under the form of systems of coupled difference equations. In analyzing such systems, one of the first tasks is to find a state-space description of the process—that is, a set of state variables and the associated state equations. We present a methodology for finding a set of state variables and a canonical representation of a class of systems described by a set of recurrent discrete-time, time-invariant equations. In the field of neural networks, this is of special importance since the application of standard training algorithms requires the network to be in a canonical form. Several illustrative examples are presented. 1 Introduction Black box modeling, whether linear or nonlinear, is a highly valuable engineering tool, whose limitations are well known (the training data are usually corrupted with random noise or with deterministic but unmeasured disturbances, they may not be representative of the whole range of operation of the process, etc.). The Bayesian approach, imposing constraints on the parameters based on prior knowledge, is one way to circumvent some of these problems. Alternatively, when confronted with the task of modeling a complex process, one may take advantage of the wealth of knowledge that usually exists on the process, or on part of it, under the form of mathematical equations derived from physics (or from chemistry, biology, economy, etc.). These equations may be approximate or incomplete; hence they may not meet the accuracy requirements of the application under consideration. Nevertheless, they are useful for describing the deterministic behavior of the process. Semiphysical modeling is the approach whereby a first model, usually containing unknown parameters, is designed from prior mathematical knowledge, is complemented with black boxes wherever necessary, and the unknown parameters are finally determined from measurements as in black-box modeling (Lindskog, 1994). This technique has been successfully applied to the modeling of a complex industrial process (Ploix & Dreyfus, 1996; Ploix & Dreyfus, 1997). One of the problems of this approach is that the discretized equations of the physical model are a set of coupled nonlinear difference equations that are not, in general, in the form of a stateNeural Computation 10, 133–164 (1998)
c 1997 Massachusetts Institute of Technology °
134
G´erard Dreyfus and Yizhak Idan
space model. Handling such a model usually requires finding a set of state variables and deriving the corresponding discrete-time state equations. The black box capabilities of discrete-time recurrent neural nets for modeling dynamic processes have been extensively investigated (see Zbikowski & Hunt, 1995). In most cases, the dynamic models used are input-output models, consisting of a feedforward net whose output is fed back to the input with one or several unit delays. Such an architecture can readily be trained by backpropagation through time. However, in the case of semiphysical modeling, the equations of physics may suggest much more complicated architectures, with feedback within the neural network itself, for instance (various such architectures will be considered in this article). In that case, the choice is the following: either derive and implement an ad hoc training algorithm for the specific architecture used to solve the specific modeling problem, or put the network into an equivalent canonical form (introduced in Nerrand, Roussel-Ragot, Personnaz, Dreyfus, & Marcos, 1993), which can be trained by backpropagation through time. In this sense, backpropagation is generic: it can be applied to any neural network, whether recurrent or feedforward, however complex, provided its canonical form has been derived. In this article, we show that given a set of discrete-time, time-invariant difference equations of the form in equation 2.1, it is possible to find automatically a set of state variables and the corresponding canonical form. The proposed procedure is based on a sequence of transformations and simple computations on a graph representation of the model, reminiscent of the flow graph technique used in linear systems theory (linear control, electronics). In nonlinear modeling, bond-graph techniques (Thoma, 1991) have been extensively used for deriving mathematical models from physical equations. The use of bond graphs may be viewed as a preprocessing: it may provide a set of difference equations that subsequently can be handled as described in this article in order to derive an appropriate canonical form. The article is organized as follows: In section 2, we present the problem in detail and provide definitions. Section 3 describes the various steps of the derivation of the order of the network, and subsequently of the derivation of a state vector and of the state equations. Section 4 shows typical examples of the procedure described in section 3. The appendixes provide formal proofs and a detailed treatment of the examples. 2 Definitions and Presentation of the Problem 2.1 The Canonical Form of a Discrete-Time Nonlinear Model. It is well known from linear systems theory that a process described by a given transfer function may be represented by a number of state-space representations, corresponding to different sets of state variables. All of these state-space representations are strictly equivalent, but some have specific properties that are likely to make them more useful or more easily tractable than others. Such specific representations are termed canonical forms (Jordan canonical
Nonlinear Discrete-Time Models
135
form, first companion form, etc.). For nonlinear models, the term canonical form does not have a universal meaning. It has been defined for specific families of models, such as S systems (Voit, 1991). For our purposes here, we consider that a discrete-time model is in a canonical form if it is in the form z(n + 1) = ϕ[z(n), u(n)] y(n + 1) = ψ[z(n + 1)] where z(n) is the minimal set of ν variables necessary for computing completely the state of the model at time n + 1 if the state of the model and its external input vector u(n) (control inputs, measured disturbances) are known at time n, and y(n) is the output vector. In terms of recurrent neural network architecture, the dynamic part of the canonical form is made of a feedforward network computing function ϕ, whose inputs are the state variables and external inputs at time n and whose outputs are the state variables at time n + 1. The output at time n + 1 is computed from the state variables at time n + 1 by a feedforward neural network implementing function ψ. 2.2 Presentation of the Problem: From an Arbitrary Discrete-Time Model to a Canonical Form. We consider a discrete-time model consisting of a set of N equations of the form: xi (n + 1) = 9i ({xj (n − τij,k + 1)}, {ul (n − τil,k + 1}), i, j = 1, . . . , N,
l = N + 1, . . . , N + N0 ,
k>0
(2.1)
where 9i is an arbitrary function, τij,k is a positive integer denoting the delay of the kth delayed value of variable xj used for the computation of xi (n + 1), and ul denotes an external input. Relation 2.1 expresses the fact that the value of variable xi at time n + 1 may be a nonlinear function of all past variables xj (including xi itself), of present variables (excluding xi itself) and of all external inputs at time n + 1 or at previous times. These equations are usually complemented by an output (or observation) equation expressing the relations between the outputs and the state variables of the model. In the context of neural networks, equations 2.1 may be considered as the description of a recurrent network where xi is the output of neuron i, or the output of a feedforward neural network i, and 9i is the activation function of neuron i, or the function implemented by the feedforward network i. As a didactic example, consider a process described by the following model:
136
G´erard Dreyfus and Yizhak Idan
x¨ 1 = f1 (x1 , x2 , x3 , u) x2 = f2 (x1 , x3 ) x¨ 3 = f3 (x1 , x˙ 2 ) y = x3 where f1 , f2 , and f3 are nonlinear functions. After discretization (by Euler’s method for instance), these equations have the following form: x1 (n + 1) = 91 [x1 (n), x1 (n − 1), x2 (n − 1), x3 (n − 1), u4 (n − 1)], x2 (n + 1) = 92 [x1 (n + 1), x3 (n + 1)], (2.2) x3 (n + 1) = 93 [x3 (n), x3 (n − 1), x1 (n − 1), x2 (n), x2 (n − 1)] y(n + 1) = x3 (n + 1). Thus, referring to relations in equation 2.1, one has N = 3, N0 = 1, τ11,1 = 1, τ11,2 = 2, τ12,1 = 2, τ13,1 = 2, τ14,1 = 2, τ21,1 = 0, τ23,1 = 0, τ33,1 = 1, τ33,2 = 2, τ31,1 = 2, τ32,1 = 1, and τ32,2 = 2. The purpose of the article is to present a methodology that allows one to transform a set of discrete-time equations of the form in equations 2.1 into a canonical form as defined in section 2.1: to find the minimal set of state variables and the corresponding functions ϕ and ψ (which will, in general, have parameters that are to be estimated from measured data). This transformation is performed in three steps: 1. Find the order ν of the system, that is, find the minimum number of variables {zi } that completely describe the model at time n if their initial values are known, and if the values of the external inputs {ul } are known at times 0 to n. 2. Find a state vector, that is, a set of ν state variables. 3. Transform equations 2.1 into the state equations governing the state variables derived in step 2. In the above example, the equations 2.2 are not in a canonical form; however, a canonical form is readily derived by substituting the expression of x2 (n+1) into the equations that give x1 (n + 1) and x3 (n + 1) (or, in the continuoustime model, substituting x2 (t) into the expressions of the second derivatives of x1 (t) and x3 (t)): the order of the model is 4, and the state variables are x1 (n), x1 (n − 1), x3 (n), x3 (n − 1) (or x1 (t), x3 (t) and their first derivatives). We prove in this article that for discrete-time models, these derivations and substitutions can be viewed as a sequence of graph transformations that can be performed on a computer in polynomial time. 3 Derivation of a State Vector 3.1 Graph Representation of a Dynamic Model. We show that the derivation of a canonical form can be performed by a set of transformations on
Nonlinear Discrete-Time Models
137
a graph representation of the recurrent equations 2.1. We define a finite directed graph G(E, V) consisting of a set of edges E and a set of vertices V. Each vertex νi represents a variable xi . A directed edge eij from vertex νj to vertex νi represents a nonzero term on the right-hand side of equation i of the system of equations 2.1. The length of each edge is the associated delay τij,k . The number of parallel edges from νj to νi is equal to the number of different delays τij,k . A directed edge from νj to νi of length τ is denoted by eτij (for simplicity, the superscript τ will be omitted whenever the context makes it unnecessary); {Ri } denotes the set of outgoing edges from vertex νi , and the length of the incoming edge to νi of maximal length is denoted by Mi . c(νi ) is the number of cycles (the number of paths that start and end at the same vertex) that include vertex νi ; c(eij ) is the number of cycles that include edge eij ; Aji is the number of edges eij from vertex νj to vertex νi . Note that the dynamic system is causal if and only if the graph G(E, V) does not contain any cycle of length zero. Figure 1a shows the graph representation of model 2.2. 3.2 Computation of the Order of the Model. The first step in the determination of the canonical form of the model is to find which variables of the model will give rise to state variables—that is, which will appear as components of the state vector z(n) (for instance, in the above example, z(n) = [x1 (n), x1 (n − 1), x3 (n), x3 (n − 1)]T ; only x1 and x3 give rise to state variables). Therefore, we want to reduce the initial graph G0 of the model to a simpler graph G1 , which contains only the vertices that give rise to state variables (vertices ν1 and ν3 in the above example) and has the same number of state variables (but not necessarily the same state variables) as the model described by G0 . From this simplified graph we will be able to compute the order of the model. The simplifications of graph G0 are based on the following considerations: 1. Assume that a model has two state variables zi (n) and zj (n), which are such that zi (n + 1) depends on zj (n), but zj (n + 1) does not depend on zi (n): zi (n + 1) = ϕi [{zm (n), m = 1, . . . , ν}, {ul (n)}] zj (n + 1) = ϕj [{zk (n), k = 1, . . . , ν, k 6= i}, {ul (n)}]. In terms of graph representation, edge eji is not within a cycle. zj (n) can be substituted into zi (n + 1) without changing the order of the model. Then zi (n + 1) does not depend explicitly on zj (n), and zj (n + 1) does not depend explicitly on zi (n). As a consequence, edges of the graph that are not within cycles are irrelevant for the computation of the order of the network.
138
G´erard Dreyfus and Yizhak Idan
(a)
(b)
Figure 1: (a) Graph representation G0 of the model described by equations 2.2; (b) Graph G1 . Following standard notations, the vertices are numbered and the lengths of the edges are written in the squares. The transformations of G0 to G1 are described in detail in appendix A.
Nonlinear Discrete-Time Models
139
2. Assume that a vertex i in G0 represents an equation of the form xi (n + 1) = 9i ({xj (n + 1)}, {ul (n − τil,k + 1}). Then this relation is static, and thus irrelevant for the determination of the state vector. 3. Assume that one has xm (n + 1) = 9m ({xj (n − τ ), {ul (n − τml,k + 1}), xj (n + 1) = 9j (xi (n − τ 0 ), {ul (n − τjl,k + 1}). xm depends on the external inputs and on xj only, and xj depends on the external inputs and on xi only. In addition, suppose that no variable other than xm depends on xj ; then variable xj can be deleted from the model by substitution into xm without changing the order of the model. 4. It has been proved (Nerrand et al., 1993) that Pthe order ν of the model represented by a graph G1 is given by ν = i ωi where ∀νi ∈ G1 ½ Mi − mineji ∈{Ri } (Mj − τji ) ωi = 0
if Mi − mineji ∈{Ri } (Mj − τji ) > 0 otherwise
Assume that two parallel edges, incoming to vertex νi , exist. Since the computation of ωi involves the length of the incoming edge of maximal length, only the larger delay is relevant. Now assume that two parallel edges exist, outgoing from vertex νi , with delays τji1 and τji2 , τji1 > τji2 . Then Mj − τji1 < Mj − τji2 , so that min(Mj − τji ) = Mj − τji1 . Thus, when several parallel edges exist, only the edge of maximal length is relevant for the computation of the order of the model. These remarks result in the following procedure for determining graph G1 from the initial graph G0 of the model; G denotes the current state of the graph before and/or after the considered transformation is performed; the tools necessary for performing these transformations, and their computational complexity, are described in appendix A. Step 1: Delete all edges of G0 that do not belong to any cycle G ← G0 − {eij |c(eij ) = 0} and delete all isolated vertices (vertices without incoming or outgoing edge) that may result. This transformation stems from remark 1 above. Note that none of the subsequent transformations, described in step 2, can generate an edge that does not belong to a cycle. Therefore, there is no need for iterating back to this step once the transformations of step 2 have been performed.
140
G´erard Dreyfus and Yizhak Idan
In principle, this transformation is sufficient for the determination of the order. The transformations of step 2 result in further simplifications of the graph, hence of the computations using the information of the graph. We show in appendix A that this transformation can be performed in polynomial time. Step 2: Iterate until no change is possible: a. Delete vertices whose incoming edges are all of zero length, and recombine their incoming and outgoing edges ∀νj , eji0 , eτkj | Mj = 0
G ← G − νj − eji0 − eτkj + eτki .
This transformation stems from remark 2 above: vertex νj is deleted, and each pair of edges (eji0 , eτkj ) is replaced by an edge eτki from νi to νk with length τ . b. Iterate until no change is possible: If a vertex has one incoming edge only (or one set of parallel incoming edges only) and one outgoing edge only (or one set of parallel outgoing edges only), delete the vertex. If there is a single incoming and a single outgoing edge, merge the edges into a single edge whose length is the sum of the lengths of the merged edges. If there is a set of parallel incoming edges and a set of parallel outgoing edges, merge each pair of one incoming and one outgoing edge into a single edge whose length is the sum of the lengths of the merged edges τ2 | Aji ≥ 1, Ajl = 0 ∀νj , eτij1 , ejk
∀l 6= i, Akj ≥ 1, Alj = 0
∀l 6= k,
τ2 + eτik1 +τ2 . G ← G − νj − eτij1 − ejk
This transformation stems from remark 3 above. c. Iterate until no change is possible: If several parallel edges between two vertices exist, delete all but the edge of maximum length. ∀νj , eτij1 , eτij2 G ← G − eijmin(τ1 ,τ2 ) This transformation stems from remark 4 above. For each pair of parallel edges, the edge of minimum length is deleted, until only one edge remains. When no further change is possible, the resulting graph G1 may be a nonconnected graph.
Nonlinear Discrete-Time Models
141
The state equations of the model described by graph G1 are of the form: z1 (n) ≡ x1 (n) = ψ1 [{zj∈P1 (n − 1)}] z2 (n) = z1 (n − 1) ··· zω1 (n) = zω1 −1 (n − 1) zω1 +1 (n) = ψ2 [{zj∈P2 (n − 1)}] ··· zω1 +ω2 (n) = zω1 +ω2 −1 (n − 1) ··· zν−ωNV +1 (n) = ψNV [{zj∈PN (n − 1)}] ··· zν (n) = zν−1 (n − 1) where NV is the number of vertices in G1 and Pi is the set of edges incoming to vertex νi . Figure 1b shows the graph G1 derived from the graph G0 of Figure 1a. Following the above procedure, edge e 214 is deleted, and the output edge from vertex 3 is deleted, since they do not belong to any cycle. Then vertex 2 is deleted since all its incoming edges have zero length, and edge e 213 and two edges e 231 and e131 are generated. Finally, parallel edges are deleted iteratively until only edges of maximum length are left. The order of the model is easily derived: one has M1 = 2, M3 = 2, ω1 = 2, ω3 = 2, hence ν = 4. 3.3 Determination of a State Vector. The order of the model having been computed as shown above, we are looking for a state vector z(n), of dimension ν, such that z(n + 1) = ϕ[z(n), u(n)], of the form: z(n) = [x1 (n − k1 ) . . . x1 (n − k1 − w1 + 1) x2 (n − k2 ) . . . x2 (n − k2 − w2 + 1) . . . xNV (n − kNV − wNV + 1)]T where ki and wi are nonnegative integers. wi is the number of occurrences of the variable xi in the state vector. If wi = 0, then the variable xi of the model is not a state variable, and the corresponding ki is irrelevant; otherwise, ki denotes the lag of the most recent occurrence of variable xi in the state vector z(n). The wi ’s must comply with the following constraint: X wi = ν. i
In the canonical form, the lag between two successive state variables arising from the same vertex is equal to one. Note that several equivalent canonical representations exist: the wi ’s may be different from the ωi ’s, the only constraint being that the sum of the wi ’s must be equal to the order ν.
142
G´erard Dreyfus and Yizhak Idan
Thus, one must find a set of 2NV integers {ki , wi }. In order to do this, we first derive a new graph, termed graph of time constraints, which accounts for the time constraints that exist between the state variables. We subsequently derive the state vector itself. 3.3.1 Determination of the Graph of Time Constraints. The graph of time constraints G2 is derived from the model graph G0 by deleting all vertices and edges that are not significant with respect to the time constraints that the state variables must satisfy. The main difference between G2 and G1 is the fact that, in order to take the time constraints into account, edges that are not within cycles (and thus are not relevant to the determination of the number of state variables) but express a relation between cycles should be kept because they are relevant to the choice of the state variables. Step 1. Initialize G to G0 . Iterate until no change is possible: a. Delete all vertices whose incoming edges are all of zero length, and recombine their incoming and outgoing edges: ∀νj , eji0 , eτkj | Mj = 0
G ← G − νj − eji0 − eτkj + eτki .
b. Iterate until no change is possible: If a vertex has one incoming edge only (or one set of parallel incoming edges only) and one outgoing edge only (or one set of parallel outgoing edges only), delete the vertex and merge the edges into a single edge whose length is the sum of the lengths of the merged edges: τ2 | Aji ≥ 1, Ajl = 0 ∀νj , eτij1 , ejk
∀l 6= i, Akj ≥ 1, Alj = 0 ∀l 6= k, τ2 + eτik1 +τ2 . G ← G − νj − eτij1 − ejk
c. Iterate until no change is possible: If several parallel edges between two vertices exist, delete all but the edge of maximum length: 1 ,τ2 ) . ∀νj , eτij1 , eτij2 G ← G − emin(τ ij
Step 2. Delete all edges that do not have both vertices belonging to at least one cycle: ∀eij ∈ G | c(νi ) = 0, c(νj ) = 0,
G ← G − eij .
The reason is that we are interested in time constraints between state variables only, and we know that state variables arise only from vertices within cycles. The variables of the model arising from the vertices of the resulting graph G2 are the state variables; thus, two integers ki and wi (ki ≥ 0, wi > 0) are
Nonlinear Discrete-Time Models
143
associated with each vertex νi ; the computation of this set of integers is the final step of the determination of the state vector. 3.3.2 Determination of the State Vector. We denote by NE the number of edges in the graph of time constraints. Consider an edge eji of G2 of length τji (see Figure 2a). From the very definition of the state vector and from the construction of the graph of time constraints, it must be possible to compute xj (n − kj + 1) from one of the state variables, arising from vertex νi , which are available at vertex νj at time n − kj + 1; these variables must have been computed at vertex νi at time n−kj +1−τji . Therefore, the following relations must hold if τji 6= 0: n − ki − wi + 1 + τji ≤ n − kj + 1 ≤ n − ki + τji or equivalently kj − wi + τji ≤ ki ≤ kj + τji − 1.
(3.1)
Hence, a set of 2NE such inequalities with 2NV integer variables must be satisfied. Thus, the problem of finding the state variables and the state equations is amenable to the following linear optimization problem in integer numbers: P find the set of integers {wi } such that i wi is minimum (since the state vector is the smallest set of variables that describe the model), under the set of constraints expressed by the inequalities in equation 3.1. In addition, the value of the minimum is known to be equal to ν, whose value is derived as shown in section 3.2. Note that there is a trivial solution to the set P of inequalities in equation 3.1: ki = 0, wi = maxj τji . This solution is valid if i wi = ν. Otherwise, a solution that satisfies all constraints can be found by linear optimization methods, such P as the simplex (Dantzig, 1963).PThe minimized objective function is i wi and at least one solution with i wi = ν is known to exist. We prove in appendix C that the algorithm (Kuenzi, Tzschach, & Zehnder, 1971; Press, Teukolsky, Vetterling, & Flannery, 1992) converges to a solution with integer values, which is precisely what is needed. The solution may not be unique. Once the pairs {ki , wi } have been determined, the canonical network can be constructed. Note that the effect of merging parallel edges into a maximal delay edge, in step 1c of the derivation of G2 , eliminates possible singular situations, in which two or more inequalities cannot be simultaneously satisfied: consider two parallel edges of delays τji,1 , τji,2 ; then two equations (2.2) should be satisfied simultaneously: kj − wi + τji,1 ≤ ki ≤ kj + τji,1 − 1
and
kj − wi + τji,2 ≤ ki ≤ kj + τji,2 − 1
If |τji,1 −τji,2 | > wi −1, this is impossible; therefore, vertex νi will be duplicated in the canonical form. The choice of assigning the largest delay to merged parallel edges guarantees the feasibility of the canonical form.
144
G´erard Dreyfus and Yizhak Idan
(a)
(b)
Figure 2: (a) The dependence of state variable xj (represented by vertex νj ) on state variable xi (represented by vertex νj ) due to the edge eτij . (b) The striped zones represent the time spans of the variables xi and xj .
4 Examples The following examples illustrate the application of the proposed method. 4.1 The Didactic Example. We derived from Figure 1 the order of the model described by equations 2.2. The graph of time constraints G2 is identical to G1 . Running the simplex algorithm in this case is useless. From symmetry considerations, and knowing that the order is 4, it is clear that the state vector is z(n) = [x1 (n) x1 (n − 1) x3 (n) x3 (n − 1)]T . From the definition of the canonical form, the state variables and the external inputs are the inputs of its feedforward part. Therefore, in order to find this feedforward part, the variables are “backtracked” from the outputs to the inputs (see Figure 3). The feedforward part of the canonical form computes x1 (n + 1) from the external output and from the components of z(n). We see from G0
Nonlinear Discrete-Time Models
145
Figure 3: The canonical form of the model described by equations 2.2. 1 stands for unit delays.
that x1 (n+1) is computed from x1 (n−1), x1 (n), x3 (n−1), x2 (n−1), u4 (n−1). The first three quantities are state variables, so that a direct connection is made between these inputs and vertex 1; u4 (n − 1) will be an input of the feedforward part of the canonical form; x2 (n − 1) is not present in the input, so that vertex 2 is added. It computes x2 (n − 1) from x3 (n − 1) and x1 (n − 1), which are state variables. Thus connections are made from these inputs to vertex 2; this completes this part of the graph. The part of the graph that computes x3 (n + 1) is similarly derived, requiring the replication of vertex 2 because of the two parallel edges between vertices 2 and 3 in G0 . The presence of intermediate neurons 2 expresses graphically the fact that the initial model can be put into a canonical form by simply substituting x2 into the expressions of x1 and x3 . Note that this network has shared weights: the weights of the inputs of neurons 2. The canonical form of the equations of the model is:
146
G´erard Dreyfus and Yizhak Idan
z1 (n + 1) z2 (n + 1) z (n + 1) 3 z4 (n + 1)
≡ = = = =
x1 (n + 1) ψ1 [z1 (n), z2 (n), ψ2 [z2 (n), z4 (n)], z4 (n), u4 (n − 1)] z1 (n) ψ3 [ψ2 [z1 (n), z3 (n)], ψ2 [z2 (n), z4 (n)], z2 (n), z3 (n), z4 (n)] z3 (n)
4.2 An Example Where the Trivial Solution Is Valid. We consider now the model whose graph G0 is shown in Figure 4a. It is made of two cascaded parts. G1 and G2 are shown on Figures 4b and 4c. Clearly, the input edges e16 , e37 and the output edge from vertex 5 are not within cycles; for simplicity, we delete them right away. The mathematical details of the subsequent transformations are given in appendix B. The first step of the transformation of the model graph to G1 is the deletion of edge e642 , which does not belong to any cycle. Vertex 5 has one incoming edge (e354 ) and one outgoing edge (e245 ), which can be merged into a single edge e544 ; similarly, vertex 3 has one incoming edge (e232 ) and one outgoing edge (e313 ), which can be merged into a single edge e512 ; in the resulting graph, vertex 1 has only one set of incoming edges (e412 and e512 ) and one outgoing edge (e221 ), which can be merged into two edges e622 and e722 . Finally, parallel edges are merged, leading to graph G1 , which has two disconnected nodes with one self-loop each. The order ν of graph G1 is easily derived: ν = 12. The graph of time constraints G2 is similarly derived: edges e245 , e354 , e232 , and e313 are deleted; edges e544 and e512 are created. Then edges e512 , e412 , and e221 are deleted, and edges e722 and e622 are generated. Finally, parallel edges are merged. Both ends of edge e642 belong to cycles; thus this edge is kept, resulting in graph P G2 . In this case, the trivial solution (wi = maxj τji ; ki = 0) is valid since j maxj τji = ν. This solution thus defines the state vector: z(n) = [x2 (n) x2 (n − 1) . . . x2 (n − 6) x4 (n) x4 (n − 1) . . . x4 (n − 4)]T . The corresponding canonical form is shown in Figure 5. Once this form has been derived, all the weights, either associated with the connections shown or embedded in one (or more) feedforward network represented by one (or more) vertex of the graph, can be estimated by training with algorithms using backpropagation through time for computing the gradient of the cost function. 4.3 An Example Where the Trivial Solution Is Not Valid. Now the output of the previous model is fed back to one of its inputs with zero delay. Thus, the only difference between the network described in Figure 4 and that described in Figure 6a is the addition of edge e035 . Since all vertices belong to at least one cycle, no graph simplification P is possible: G1 and G2 are identical to the original graph. The order is i ωi = 14, where: ω1 = 4 − min(2 − 2) = 4; ω2 = 2 − min (2 − 2, 6 − 6) = 2; e21
e32 ,e42
Nonlinear Discrete-Time Models
147
Figure 4: Left column, from top to bottom: Graph G after step 1; graph G after the first iteration of step 2a; graph G after the second iteration of step 2a; graph G1 (after step 2a). Right column, from top to bottom: graph G after step 1a; graph G after the first iteration of step 1b; graph G after the second iteration of step 1b; graph G2 (after step 1c). Details of the transformations are given in appendix B.
148
G´erard Dreyfus and Yizhak Idan
Figure 5: Canonical form of the model shown in Figure 4.
ω3 = 2 − min(4 − 3) = 1 e13
ω4 = 6 − min(3 − 3) = 6; ω5 = 3 − min (2 − 0, 6 − 2) = 1 e54
e35 ,e45
The trivial solution, which leads to a model of order 16, is thus not valid. Simplex optimization gives k1 = 4, w1 = 1; k2 = 3, w2 = 5; k3 = 6, w3 = 1; k4 = 2, w4 = 2; k5 = 1; w5 = 5; the state vector is z(n) = [x1 (n − 4) x2 (n − 3) x2 (n − 4) . . . x2 (n − 7) x3 (n − 6) x4 (n − 2) x4 (n − 3) x5 (n − 1) . . . x5 (n − 5)]T . The canonical form is shown in Figure 6b. 4.4 An Example with Replicated Vertices and Shared Weights. As a final example, we consider a model whose canonical form makes extensive
Nonlinear Discrete-Time Models
149
(a)
(b)
Figure 6: (a) Graph of the model of Figure 4, with feedback; G1 and G2 are identical to G0 (see appendix B). (b) Canonical form. For clarity, all the recurrent connections are not shown.
use of duplicated vertices and shared weights. Its graph representation is shown in Figure 7a. In order to compute the order of this model, we derive the graph G1 of Figure 7b. The order of the network is 6 (ω1 = 1, ω2 = 2, ω3 = 1, ω5 = 2), but the trivial solution from the graph of time constraints shown in Figure 7c gives a solution of order 8 (w1 = w2 = w3 = w5 = 2), which is not optimal. Note that during the graph simplification, vertices 1 and 3 have parallel outgoing edges and corresponding constraints that cannot be simultaneously satisfied; thus, these vertices are replicated, and so are the weights
150
G´erard Dreyfus and Yizhak Idan
(a)
(b)
(c)
(d)
Figure 7: (a) Graph representation G0 of a model, with parallel edges. (b) Graph G1 . (c) Graph G2 . (d) Canonical form. Details of the transformations are given in appendix B.
Nonlinear Discrete-Time Models
151
corresponding to their inputs and outputs. These edges are e053 and e253 for vertex 3 (once vertex 4 is eliminated) and e021 and e221 for vertex 1. The solution found by the simplex method is: k1 = 2, w1 = 1; k2 = 1, w2 = 2; k3 = 1, w3 = 1; k5 = 0; w5 = 2. The canonical form of this network is shown in Figure 7d. It is exactly the canonical form that was derived manually in previous work (Nerrand et al., 1993). 5 Conclusions This work was motivated by the current interest in semiphysical modeling, be it “neural” or not. It is clear that black box modeling is wasteful of information when large, complex systems, such as arise in the process industry, for instance, are considered. The importance of canonical forms of models was recognized long ago in the field of linear automatic control. In nonlinear modeling, the problem of finding a canonical form for a model described by an arbitrary set of coupled nonlinear discrete-time equations is also important, but it is much more difficult. In addition, when neural networks are used for modeling, putting the network into a canonical form is mandatory for simplicity of implementation of the training algorithms. We have proposed a general procedure for finding the canonical form of a model described by a set of coupled nonlinear discrete-time equations. Knowledge-based neural modeling, which consists of building a neural net complying with the equations of the model, complementing it by black boxes that take into account the part of the dynamics not modeled by the initial set of equations, and subsequently training the network from measured data in order to estimate the unknown parameters, has been used successfully for complex industrial applications (Ploix & Dreyfus, 1997). This work is a step toward making the first part of this task (deriving a neural network from the equations of the model and putting it in canonical form before training) fully automatic. Appendix A In this appendix, we describe the computational tools that are necessary for the graph transformations and show that the latter have polynomial complexity. The basic tool is the adjacency matrix A of a graph G. Element Aij of matrix A is equal to the number of edges from vertex i to vertex j. The diagonal terms of the matrix denote the number of self-loops (cycles of length 1). We consider the adjacency matrix A0 of graph G0 . We define the matrix A1 as A1 = sign(A0 ) : (A1 )ij = 1 if (A0 )ij > 0, (A1 )ij = 0 if (A0 )ij = 0.
152
G´erard Dreyfus and Yizhak Idan
A.1 Transformation of Graph G0 to Graph G1 , Step 1. The first step in the transformation of G0 to G1 consists of finding and deleting all edges that do not belong to any cycle. Therefore, for each edge eji , one has to find whether a path from j to i exists in the graph. In this task, the actual length of the edges (the delay associated to each edge) is irrelevant; therefore, we consider that all edges of graph G have unit length; the length of a path in the graph is thus equal to the number of edges of the path. We make use of the following result: Consider the sequence of matrices {An } defined by An = sign(A1 An−1 ), n ≥ 2 : (An )ij = 1 if and only if there exists at least one path from i to j in the graph. This is easily proved by recursion. Consider matrix A21 : (A21 )ij =
X (A1 )ik (A1 )kj . k
Each path of length 2 from i to j contributes a nonzero term to the righthand side of the above relation; hence (A21 )ij is equal to the number of paths of length 2 from vertex i to vertex j. We define A2 = sign(A21 ): a nonzero element (A2 )ij denotes the existence of at least one path of length 2 from vertex i to vertex j in the graph. Similarly, we define matrix An whose element (An )ij is equal to 1 if and only if there is at least one path of length n from i to j in the graph and is equal to 0 otherwise. It is easy to show, as before, that matrix An+1 = sign(A1 An ) has the following property: (An+1 )ij = 1 if and only if there is at least one path of length n + 1 from i to j. The longest simple path (the longest path that contains no cycle) in a directed graph of N vertices is of length N. Therefore, in the worst case, the construction of the sequence of matrices {An } is terminated when n = N, thus in polynomial time. Actually, the sequence will frequently terminate when An+1 = An with n < N, as shown in the examples below. Finally, consider matrix à ∗
A = sign
X
! An .
n=1
Element (A∗ )ij is equal to 1 if and only if there is at least one path in the graph from i to j. A diagonal element (A∗ )ii is equal to 1 if and only if vertex i belongs to at least one cycle. To summarize, in step 1 of the transformation of graph G0 to graph G1 , consider each pair of vertices (i, j): if (A0 )ij 6= 0, there is at least one edge between i and j; if (A0 )ij 6= 0 and (A∗ )ji = 0, edges ejiτ are deleted ((A0 )ij is set to zero); otherwise, edges ejiτ are kept in the graph.
Nonlinear Discrete-Time Models
153
A.2 Transformation of graph G0 to graph G1 step 2. a. If a vertex j has all incoming edges of length 0, all elements of row j and column j of the adjacency matrix A of the current graph G are set to zero. For each pair of elements (Aij , Ajk , j 6= k) set to zero, add Ajk to Aik . b. Consider two adjacent edges eij and ejk ; they are both within at least one cycle (otherwise they would have been deleted at the previous step). Vertex j belongs exclusively to this cycle (or set of cycles) if and only if there is no incoming edge to j from vertices other than k and no outgoing edge from j to vertices other than i. Thus, one must have Akj ≥ 1, Alj = 0 for all l 6= k, and Aji ≥ 1, Ajl = 0 for all l 6= i. There must be one and only one nonzero off-diagonal element in row j, there must be one and only one nonzero off-diagonal element in column j, and both Aii and Ajj must be equal to zero. Elements Akj and Aji are then set to zero, and Aki is increased by Akj Aji . c. Compute sign(A) and appropriately update the length of the connections. A.3 Transformation of Graph G0 to Graph G2 . The steps are formally the same as step 2 of the transformation of G0 to G1 , but they do not act on the same initial graph. In the transformation of G0 to G1 , these steps are performed on G0 deprived of the edges that are not within cycles; in the transformation of G0 to G2 , they are performed on G0 itself. In step 2 of the transformation from G0 to G2 , one has to find and delete all edges that do not have both vertices belonging to at least one cycle. Consider each edge in turn. If (A0 )ij = 0, there is no edge for consideration; if (A0 )ij 6= 0 and (A∗ )ii = (A∗ )jj = 1, the edge must be kept in G2 ; otherwise it is deleted. Keeping track of the edge lengths during these transformations is very simple. To illustrate sections A.1 to A.3, we consider the model whose graph G0 is shown in Figure 1a. Derivation of G1 , step 1: 2 1 A0 = 1 1
The adjacency matrix of graph G0 is: 1 1 0 0 2 0 therefore 1 2 0 0 0 0
1 1 A1 = sign(A0 ) = 1 1
1 0 1 0
1 1 1 0
0 0 . 0 0
154
G´erard Dreyfus and Yizhak Idan
3 2 A21 = 3 1
2 2 2 1
3 2 3 1
0 0 therefore 0 0
1 1 2 A2 = sign(A1 ) = 1 1
3 2 A1 A2 = 3 1
3 2 3 1
3 2 3 1
1 1 1 1
1 1 1 1
0 0 . 0 0
0 0 therefore 0 0
1 1 A3 = sign(A1 A2 ) = 1 1
1 1 1 1
1 1 1 1
0 0 . 0 0
Since A3 is identical to A2 the sequence construction stops at this point: 1 1 1 0 Ã ! 2 X 1 1 1 0 An = A∗ = sign 1 1 1 0 ; n=1 1 1 1 0
A∗T
1 1 = 1 0
1 1 1 0
1 1 1 0
1 1 . 1 0
Considering each nonzero element of A1 and the corresponding element of A∗T , it is easily seen that all edges belong to at least one cycle, except for edge e14 , which is therefore deleted. At this step, the adjacency matrix of the current graph G is: 2 1 1 0 1 0 2 0 A= 1 1 2 0 . 0 0 0 0 Derivation of G1 , step 2: a. Since vertex 2 has two incoming edges, both of zero length, A12 and A32 are set to zero, and edges e213 , e131 , and e231 are
Nonlinear Discrete-Time Models
155
generated, resulting in a new adjacency matrix: 2 0 3 0 0 0 0 0 A= 2 0 2 0 . 0 0 0 0 b. In this matrix, no vertex is such that both the corresponding row and the corresponding column have one and only one nonzero element. Therefore, no simplification can be performed at this step. c. Finally, matrix A shows that parallel edges exist as selfloops around vertices 1 and 3 and for edges e31 and e13 . All but the edges of maximal length are deleted. No further iteration is necessary. Thus, the adjacency matrix for G1 is: 1 0 1 0 0 0 0 0 A= 1 0 1 0 . 0 0 0 0 Graph G1 is shown in Figure 1b. For the derivation of G2 , we start with matrix A0 : 2 1 1 0 1 0 2 0 A= 1 1 2 0 0 0 0 0 Step 1:
a. Vertex 2 is deleted as in step 2a for the derivation of G1 , leading to the matrix: 2 0 3 0 0 0 0 0 A= 2 0 2 0 . 1 0 0 0 b. There is no transformation. c. Parallel edges are deleted except for the edges of maximal length: 1 0 1 0 0 0 0 0 A= 1 0 1 0 . 1 0 0 0
Step 2: Edge e41 has only vertex 1 belonging to a cycle; hence this edge is deleted, leading to an adjacency matrix for G2 , which is identical to that of G1 .
156
G´erard Dreyfus and Yizhak Idan
Appendix B This appendix describes the computations of the graph transformations illustrated by Figures 4, 6, and 7. B.1 Figure 4. It is clear by inspection that edges e16 and e37 are irrelevant for the determination of the order and the state variables. In order to simplify the notations, we overlook these edges from the beginning, thus considering only vertices 1 to 5. In a computer implementation of the procedures, these edges are deleted in step 1 of the transformation of G0 to G1 , and they are deleted in step 2 of the transformation of G0 to G2 . For the derivation of G1 : Step 1
0 1 A0 = 1 0 0 1 1 A3 = 1 0 0 1 1 A4 = 1 0 0 1 1 A∗T = 1 1 1 Thus, edge e42 0 1 A= 1 0 0 Step 2
1 1 0 0 0
0 1 0 0 0
0 1 0 1 1
1 1 1 0 0
1 1 1 0 0
1 1 1 1 1
1 1 1 0 0
1 1 1 0 0
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
0 0 0 1 1
0 1 1 1 1 0 1 1 1 1 1 0 0 1 0 0 0 ; 0 = A ; A = ; A 1 0 2 0 0 0 1 1 1 0 0 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ∗ 1 ; A5 = A4 ; A = 1 1 1 1 1 ; 1 0 0 0 1 1 1 0 0 0 1 1 0 0 0 . 1 1
is deleted, hence the current adjacency matrix: 1 0 0 0 1 1 0 0 0 0 0 0 . 0 0 1 1 0 0 1 0
a. No transformation.
Nonlinear Discrete-Time Models
157
b. Row 5 and column 5 have one and only one nonzero, offdiagonal element (A45 and A54 ). The same is true for row and column 3 (A23 and A31 ). Hence edges e45 , e54 , e13 , and e32 are deleted; one edge e44 and one edge e12 are generated: 0 1 0 0 0 2 1 0 0 0 A= 0 0 0 0 0 . 0 0 0 2 0 0 0 0 0 0 Row 1 and column 1 have one and only one nonzero, nondiagonal element (A12 and A21 ); hence edges e21 and e12 are deleted, and two edges from 2 to 2 are generated: 0 0 0 0 0 0 3 0 0 0 A= 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 c. Parallel edges are merged, resulting in graph G1 . For the derivation of G2 : Step 1:
a. No transformation. b. As in step 2 of the derivation of G1 , edges e45 , e54 , e13 , and e32 are deleted, edge e44 and edge e12 are generated: 0 1 0 0 0 2 1 0 1 0 A= 0 0 0 0 0 . 0 0 0 2 0 0 0 0 0 0 Edges e21 , e412 , and e512 to 2 are generated: 0 0 0 0 3 0 A= 0 0 0 0 0 0 0 0 0
are deleted, and two edges from 2 0 1 0 2 0
0 0 0 . 0 0
c. Parallel edges are merged. Step 2: Since both ends of edge e42 are within cycles (A∗22 and A∗44 are both nonzero), this edge is kept in G2 .
158
G´erard Dreyfus and Yizhak Idan
B.2 Figure 6. 0 1 1 1 A0 = 1 0 0 0 0 0 1 1 1 1 A3 = 1 1 1 0 0 1 1 1 1 1 A4 = 1 1 1 1 1 1
0 1 0 0 1
0 1 0 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
0 1 1 0 0 ; A1 = A0 ; A2 = 0 0 1 0 1 0 1 0 1 1 1 1 ∗ 1 ; A = A4 1 1
1 1 1 0 0
1 1 0 1 0
0 1 0 ; 1 1
1 1 0 1 1
All edges belong to at least one cycle, so that no simplification is possible. B.3 Figure 7. For the derivation of G1 : Step 1:
1 1 A0 = 0 0 0 1 1 A2 = 0 0 0 1 1 A3 = 1 0 0 1 1 A∗T = 1 1 1
2 0 1 0 0
0 1 1 0 0
0 0 1 0 0
1 1 1 0 0
1 1 1 0 0
0 1 1 0 0
1 1 1 0 0
1 1 1 0 0 1 1 1 1 1
1 1 1 0 0 1 1 1 1 1
0 1 1 0 1 ; A1 = 0 0 1 1 0 0 1 1 1 1
1 0 1 0 0
0 1 1 0 0
1 1 1 1 ∗ 1 ; A4 = A3 ; A = 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1
0 0 1 0 0
0 0 1 ; 1 1
1 1 1 0 0
1 1 1 0 0
1 1 1 0 0
1 1 1 ; 1 1
Nonlinear Discrete-Time Models
Edges e143 , e154 , and e053 1 2 0 1 0 1 A= 0 1 1 0 0 0 0 0 0 Step 2:
159
are deleted: 0 0 0 0 0 0 . 0 0 0 1
a. No transformation. b. No transformation. c. Parallel edges e021 and e221 are merged into e221 .
For the derivation of G2 : Step 1:
a. No transformation. b. Since row 4 and column 4 of A0 have one and only one nonzero off-diagonal element, (A0 )34 and (A0 )45 are set to zero, and (A0 )35 is increased by 1, resulting in: 1 2 0 0 0 1 0 1 0 0 A= 0 1 1 0 2 0 0 0 0 2 0 0 0 0 1 c. Parallel edges e221 and e021 are merged into e221 ; parallel edges e053 and e253 are merged to e253 , resulting in 1 1 0 0 0 1 0 1 0 0 A= 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1
Step 2: Since all the relevant diagonal elements of A∗ are nonzero, no simplification is possible. The above matrix is the adjacency matrix of G2 . Appendix C We proved in section 3.3 that the problem of the determination of a state vector is amenable to the following linear optimization problem: Minimize the cost function NV X i=1
wi
160
G´erard Dreyfus and Yizhak Idan
subject to the constraints: wi > 0
∀i,
kj ≥ 0
∀j,
kj − ki − wi + τji ≤ 0 ki − kj − τji + 1 ≤ 0
∀τji ∀τji
where all variables ki and wi are integers and the constant terms τji are integers. In this appendix, we prove that the simplex algorithm provides an optimal solution in integer numbers. C.1 General Form of a Linear Optimization Problem. The general form of a linear optimization problem with N variables {xi } is the following: Find a vector [x1 , x2 , . . . , xN ]T that maximizes the cost function N X
fi xi
i=1
under the nonnegativity (hereafter termed primary) constraints xi ≥ 0 for all i,
(C.1)
and a set of C additional (hereafter termed secondary) constraints N X
aij xj ≤ bi (bi ≥ 0)
(C.2)
akj xj ≥ bk ≥ 0
(C.3)
alj xj = bl ≥ 0
(C.4)
j=1 N X j=1 N X j=1
In our problem, all constant terms {bi , i = 1 to C} are integers, and all coefficients {aij , i = 1 to C, j = 1 to N} are equal to −1, 0, or +1. It is desired to find an optimal solution where all variables {xi } are integers. We first recall an essential result of linear optimization theory. First, if an optimal vector exists and if N < C—that is, if its dimension is smaller than the number of secondary constraints (see equations C.2–C.4)—then there is an optimal vector whose components satisfy N out of C constraints
Nonlinear Discrete-Time Models
161
as equalities. Second, if an optimal vector exists and if N > C, then all C secondary constraints (equations C.2–C.4) are satisfied as equalities, and (N−C) primary constraints (see equation C.1) are also satisfied as equalities, so that the optimal vector has at least (N − C) components equal to zero. The simplex algorithm takes advantage of this result by searching an optimal vector among the vectors that satisfy N out of C secondary constraints as equalities if N < C, or that satisfy C secondary constraints and (N − C) primary constraints as equalities if N > C. We are going to prove that if the optimization problem has a solution, then the simplex algorithm provides a solution in integer numbers. The proof is organized in two steps. We first prove that if a linear optimization problem is in restricted normal form (to be defined below) with integer constant terms and coefficients equal to −1, 0, or +1, and if it has a solution, then a solution in integer numbers exists and is found by the simplex algorithm. In the second step, we show that a general linear optimization problem with integer constant terms and coefficients equal to −1, 0, or +1 is amenable to an equivalent linear optimization problem in restricted normal form with integer constant terms and coefficients equal to −1, 0, or +1. Since we know that the problem of the determination of the state vector has a solution, we conclude that the simplex algorithm provides a solution in integer numbers. C.2 Linear Optimization Problem in Restricted Normal Form. A linear optimization problem is said to be in restricted normal form if the only constraints are the N nonnegativity constraints (see equation C.1) and C equality constraints (see equation C.4) and if each equality constraint has at least one variable that has a positive coefficient and appears in one constraint only. These C variables {xi , i = 1 to C} are called basic variables, and the other (N − C) variables {xi , i = C + 1 to N} are called nonbasic variables. The C equality constraints can be solved for the basic variables, hence can be written in the form: xi = bi +
N X
aij xj
i = 1 to C.
(C.5)
j=C+1
All basic variables are on the left-hand side of equation C.5, and all nonbasic variables are on the right-hand side. In our problem, all bi ’s are integers and all aij ’s are equal to −1, 0, or +1. By setting all nonbasic variables to zero, one obtains from equation C.5 an initial vector, with at most C nonzero components (the basic variables), which are equal to the constant terms on the right-hand side of equation C.5, and at least (N − C) components equal to zero (the nonbasic variables). This vector satisfies all constraints as equalities but is not necessarily optimal. We know from the basic result recalled above that an optimal vector, if it exists, is to be found among the vectors that, like the initial vector, have at most C nonzero components and at least (N − C) components equal to zero.
162
G´erard Dreyfus and Yizhak Idan
Therefore, a new candidate vector can be obtained from the initial vector by turning one basic variable into a nonbasic variable and one nonbasic variable into a basic variable (how this can be done will be explained below). Assume that the nonbasic variable xm has been turned into a basic variable and that the basic variable xn has been turned into a nonbasic variable; then the constraints are in a form similar, and equivalent, to equation C.5, where xm now appears on the left-hand side of one of the equations and xn appears on the right-hand side of at least one of the equations. By setting all the new nonbasic variables to zero one obtains a new candidate vector whose component xn is equal to zero, and whose component xm may be nonzero. Yet another candidate vector can be derived from the candidate vector just derived by performing another exchange, and so on. Thus, the problem of finding an optimal vector can be regarded as a combinatorial problem. A brute-force procedure would consist of trying all possible sequences of exchanges of one basic variable for one nonbasic variable and selecting the optimal vector thus found. This would involve impractical computation times. The simplex algorithm is a very economical procedure that starts from the initial vector defined above and performs an appropriate sequence of exchanges of one nonbasic variable for one basic variable, maximizing the increase of the cost function at each exchange, until no further increase of the cost function is possible. The details of the algorithm (Press et al., 1992; Kuenzi et al., 1971)—how the decision is made to exchange a certain nonbasic variable for a certain basic variable at each step of the procedure—are irrelevant for this proof. The only important point is the following: assume that it is found desirable, at the first step of the procedure, to exchange the nonbasic variable xm for the basic variable xn ; the nth equation of C.5 can be solved for xm : xn = bn +
N X j=C+1
anj xj ⇒ xm = −
N X anj bn xn − xj + . anm a a nm nm j=C+1 j6=m
Then xm can be substituted into all the other equations of C.5, thereby making xn a nonbasic variable. In our problem, bn is an integer, and anm is equal to −1 or +1; therefore, the constant term in the expression of xm is an integer, and the coefficients of the nonbasic variables are equal to −1, 0, or +1; similarly, the constant terms in the other constraints after the exchange of xm and xn are integers, and the coefficients of the variables are equal to −1, 0, or +1. Clearly, the same result holds true after each exchange of the procedure. After the final exchange, which leads to the optimal combination of nonbasic and basic variables, the vector, which is found by setting the nonbasic variables to zero, has N − C components, which are equal to zero (the final nonbasic variables), and C components, which are integer numbers (the final basic variables), equal to the constant terms on the right-hand side of the constraints in their final form.
Nonlinear Discrete-Time Models
163
Thus, we have proved that if a linear optimization problem is in restricted normal form, if it has a solution, if all coefficients of the variables are equal to −1, 0, or +1, and if all constant terms of the constraints are integers, then an optimal solution in integer numbers will be found by the simplex algorithm. C.3 General Case. In the case of the determination of the order of a model, we know that an optimal solution exists, but the problem is not in restricted normal form, so that the above result is not directly applicable. We prove in the following that the problem is nevertheless amenable to an equivalent problem in restricted normal form with coefficients equal to −1, 0, or +1, and with integer constant terms, so that the result of the previous section holds in general. In order to do this, one first turns the inequality constraints (see equations C.1 and C.2) into equality constraints; this is achieved by introducing an additional nonnegative variable into each inequality constraint: N X
aij xj ≤ bi ,
bi > 0 →
j=1 N X
N X
aij xj − yi = bi ,
yi ≥ 0
(C.6)
j=1
akj xj ≥ bk ≥ 0 →
j=1
N X
akj xj + yk = bk ,
yk ≥ 0.
(C.7)
j=1
Note that the coefficients of the new variables are equal to −1 or +1. The second step of the transformation of the general form to the restricted normal form consists of adding a second set of additional variables {zi }, which casts equations C.6 and C.7 into the form equation C.5: zi = bi + yi −
N X
aij xj
j=1
zk = bk − yk −
N X
akj xj .
j=1
This set of equalities defines an optimization problem in restricted normal form, with integer constant terms and with coefficients equal to −1, 0, or +1. Any solution of this problem having all zi ’s equal to zero is a solution of the original problem; the simplex algorithm is organized in such a way that the solution found has all zi ’s equal to zero. Therefore the simplex algorithm finds a solution of the original problem in integer numbers, if a solution exists. In the case of the determination of the state vector, we know that a solution exists. Therefore, the simplex algorithm finds an optimal set of integers {ki , wi }.
164
G´erard Dreyfus and Yizhak Idan
Acknowledgments We are grateful to Brigitte Quenet for her critical reading of the manuscript. References Dantzig, G. B. (1963). Linear programming and extensions. Princeton, NJ: Princeton University Press. Kuenzi, H. P., Tzschach, H. G., & Zehnder, C. A. (1971). Numerical methods of mathematical optimization. San Diego: Academic Press. Lindskog, P. (1994). Algorithms and tools for system identification using prior knowledge. Unpublished thesis, Linkoping ¨ Studies in Science and Technology. Nerrand, O., Roussel-Ragot, P., Personnaz, L., Dreyfus, G., & Marcos, S. (1993). Neural networks and nonlinear adaptive filtering: Unifying concepts and new algorithms, Neural Computation, 5, 165–197. Ploix J. L., & Dreyfus, G. (1997). Early fault detection in a distillation column: An industrial application of knowledge-based neural modeling. In B. Kappen (Ed.), Europe’s best neural networks practice. Berlin: Springer-Verlag. Ploix J. L., & Dreyfus, G. (1996). Knowledge-based neural modeling: Principles and industrial applications. In F. Fogelman & P. Gallinari (Eds.), Industrial applications of neural networks. Singapore: World Scientific. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing. Cambridge: Cambridge University Press. Thoma, J. (1991). Simulation by bond graphs. Berlin: Springer-Verlag. Voit, E. O. (1991). Canonical nonlinear modeling. New York: Van Nostrand Reinhold. Zbikowski, R., & Hunt, K. J. (Eds.) (1995). Neural adaptive control technology. Singapore: World Scientific. Received July 17, 1996; accepted March 11, 1997.
Communicated by Fernando Pineda
A Low-Sensitivity Recurrent Neural Network Andrew D. Back Brain Information Processing Group, Frontier Research Program, The Institute of Physical and Chemical Research (RIKEN), Saitama 351-01, Japan
Ah Chung Tsoi Faculty of Informatics, University of Wollongong, Wollongong, NSW 2522, Australia
The problem of high sensitivity in modeling is well known. Small perturbations in the model parameters may result in large, undesired changes in the model behavior. A number of authors have considered the issue of sensitivity in feedforward neural networks from a probabilistic perspective. Less attention has been given to such issues in recurrent neural networks. In this article, we present a new recurrent neural network architecture, that is capable of significantly improved parameter sensitivity properties compared to existing recurrent neural networks. The new recurrent neural network generalizes previous architectures by employing alternative discrete-time operators in place of the shift operator normally used. An analysis of the model demonstrates the existence of parameter sensitivity in recurrent neural networks and supports the proposed architecture. The new architecture performs significantly better than previous recurrent neural networks, as shown by a series of simple numerical experiments.
1 Introduction In order to apply neural networks to practical problems, it is useful to have an understanding of such issues as the model robustness and parameter sensitivity. In this article, the term robustness means that a model that undergoes small perturbations in the network weights maintains similar performance to what it had before the perturbations were introduced. Correspondingly, this implies that a low-sensitivity model is resilient to small perturbations in the weights. This issue is of particular importance in the following situations: • A model can be implemented using finite precision arithmetic (for example, in a low-cost embedded system). In this situation, it is imperative that a given model performs as required and not differ significantly from the theoretical design characteristics, even when implemented Neural Computation 10, 165–188 (1998)
c 1997 Massachusetts Institute of Technology °
166
Andrew D. Back and Ah Chung Tsoi
using finite precision arithmetic. If a model can use lower precision, it may be possible to produce it at a lower cost. • In time-series prediction, a small improvement in the accuracy of a model may lead to better long-term predictive performance. This is particularly evident in the prediction of chaotic time series (Back, Wan, Lawrence, & Tsoi, 1994). Lower-sensitivity models will tend to be more accurate due to their need for less precision to produce similar results to higher-sensitivity models. One way in which model sensitivity can be examined is in terms of the sensitivity of some characteristic behavior of the model with respect to its parameters. Suppose we define a measure of sensitivity Sj =
∂F ∂θj
(1.1)
where F is a measure of the behavior of the model and θj is the jth parameter. Ideally the set of parameter sensitivities, {S}, should be approximately constant for all θ. The reason is that if S becomes high for some θ values, then small perturbations around these particular θ values will cause a large change in the model behavior, which means that a greater dynamic range for θ is required. If the dynamic range is insufficient to cope, then there will be some round-off error introduced and a subsequent loss of accuracy (Gevers & Li, 1993; Orlandi & Martinelli, 1984). Worse still, instability and nonminimum phase behavior can result (Astrom, Hagander, & Sternby, 1984). Instances of when such perturbations may occur are learning, finite word-length (FWL) implementations, and additive noise in the model. Low-sensitivity models have been studied extensively in linear systems theory (Agarwal & Burrus, 1975; Astrom et al., 1984; Fan & Li, 1993; Gevers & Li, 1993; Goodwin, Middleton, & Poor, 1992; Gold & Radar, 1969; Heuberger, Bosgra, & Van den Hof, 1995; Mantey, 1968; Orlandi & Martinelli, 1984; Williamson, 1988), but are also well recognized within numerical analysis (Dahlquist & Bjorck, 1974; Goodwin et al., 1992). Hence, for practical applications it is well recognized that low-sensitivity models are important. Although this topic has been considered in linear systems, it also deserves consideration for practical neural network models (Puskorius & Feldkamp, 1992). Various authors have considered the issue of sensitivity to errors in the weights in feedforward neural networks (Alippi, Piuri, & Sami, 1995; Choi & Choi, 1992; Dundar ¨ & Rose, 1995; Kerlirzin & R´efr´egier, 1995; Oh & Lee, 1995; Orzechowski, Kumara, & Das, 1993; Stevenson, Winter, & Widrow, 1990; Xie & Jabri, 1992). Typically these analyses are based on probabilistic methods. Minai and Williams (1994) considered the issue of performance changes of a network due to perturbations in the individual output response of units within the network. Von Lehman Paek, Liao, Marrakchi, and Patel
A Low-Sensitivity Recurrent Neural Network
167
(1989) showed that weight discretization in a feedforward network resulted in very poor learning and performance. They observed that networks with fewer than 300 levels of quantization did not converge, but that adding noise to the network significantly improved the probability of convergence. The problem of robustness in recurrent neural networks has been investigated for the case of Hopfield-type networks (Wang & Michel, 1994). In this situation, Wang and Michel examined the performance of a network in terms of the distance between equilibrium points of perturbed and unperturbed associative memory networks. For recurrent neural networks in particular, discrete-time recurrent neural networks, which are used to give a dynamical output in response to some input, the issue of robustness and sensitivity does not appear to have been as well considered. Our contribution in this article is to examine the issue of sensitivity in recurrent neural networks when modeling nonlinear dynamical systems. In contrast to the previous probabilistic approaches, we consider the issue from the perspective of linear systems theory and indicate a new recurrent neural network architecture based on these considerations. In section 2 a review of sensitivity aspects of dynamic models is given. We show how the notion of alternative discrete-time operators can be used to reduce the sensitivity of dynamic models and thereby improve the robustness of models. In section 3 we briefly review the fully connected recurrent neural network architecture considered in this article. A new recurrent neural network architecture based on an implementation with alternative discrete-time operators is proposed. An analysis of recurrent neural networks’ parameter sensitivity is given in section 4. The results of numerical experiments comparing the previous and proposed networks are given in section 5, followed by conclusions in section 6. An online learning algorithm for the new low-sensitivity recurrent neural network is provided in the appendix. 2 Parameter Sensitivity in Linear Dynamic Models This section provides a brief review of some established principles pertaining to parameter sensitivity in linear models. It may be familiar to many readers, yet is included as background material for the subsequent sections. In signal processing, time-series analysis, and system identification, a commonly used model is y(t) = A(q)x(t) where A(q) =
n X i=0
ai q−i
(2.1)
168
Andrew D. Back and Ah Chung Tsoi
= q−n
n Y (q − λi )
(2.2)
i=1 4
qx(t) = x(t + 1).
(2.3)
A structure of this form is known as a finite impulse response (FIR) filter or moving average (MA) model. The variables {λi } are the roots of the characteristic equation A(q) = 0. These are also known as the zeros of the model A(q). If y(t) = x(t)/A(q), the model in this case is infinite impulse response (IIR) or autoregressive (AR) and the roots {λi } are known as poles. The sensitivity of this polynomial can be found using a first-order infinitesimal approximation to the derivative (Mantey, 1968) as follows. Define the sensitivity criterion Sij as Sij =
dλi . daj
(2.4)
Using the chain rule, it can be shown that (Gold & Radar, 1969) n−j
λi . k6=i (λi − λk )
Sij = Q
(2.5)
Note that, referring to equation 1.1, F = [λ1 · · · λn ]T . We note that the sensitivity is greater for roots that are close together (which will make the denominator small) and for those roots that have a large magnitude (which will make the numerator large).1 Models that have these two characteristics—poles that are close to one another and located in the vicinity of the point (1,0) on the complex plane—are commonly called lightly damped low-frequency (LDLF) models, and the region around the vicinity of (1,0) is known as the LDLF region. From equation 2.5, it is obvious that LDLF models have high sensitivity relative to models not classed as LDLF. When a finite word length is used with an LDLF model, the behavior of the model may be much different from what is expected (Astrom et al., 1984; Gevers & Li, 1993). In addition, for any given word length, the accuracy that can be expected when modeling systems with a high sensitivity will be reduced. The problem of modeling LDLF systems can arise more easily than might first be considered. For systems sampled at high speed, it has been shown that discrete-time models based on the shift operator q have poles and zeros that approach the point (1,0), regardless of the true underlying system (Astrom et al., 1984). 1 For a stable system, roots with magnitudes close to 1 will be considered large magnitude roots.
A Low-Sensitivity Recurrent Neural Network
169
Approaches open to the user to overcome such a problem are subsampling the given measurements (commonly known as decimation in signal processing; Rabiner & Gold, 1975) or filtering the underlying low-frequency signal components embedded in the measurements (Goodwin et al., 1992). In numerical analysis, these types of problems are regularly encountered. A technique for dealing with such situations is focused on performing some form of transformation on the data to improve the ill-conditioning (Dahlquist & Bjorck, 1974). This is equivalent to performing filtering on the signal and is the approach we consider in this article. These numerical problems can be significantly overcome by introducing algebraic manipulations to A(q). To overcome the problem of high sensitivity ˆ q) = A(q), where in the linear model above, we seek to find a polynomial A(ˆ ˆ q) = 1 + aˆ 1 qˆ −1 + · · · + aˆ n qˆ −n A(ˆ
(2.6)
with qˆ = g(q), where g(·) is a mapping such that Sˆ ij ≤ Sij
∀ i, j.
(2.7)
An operator g(q) that maps the LDLF zeros away from the sensitive region is ν = g(q) =
q−m . r
(2.8)
Here, a dilation 1/r and translation m/r allows the zeros near (1,0) in the q-plane to be mapped to a less sensitive region in the ν-plane2 (see Figure 1). Such mappings, termed alternative discrete-time operators (ADTOs), have been proposed in the literature (Agarwal & Burrus, 1975; Goodwin et al., 1992). Some of these operators are described below. The delta operator has been proposed for digital filtering and control, with successful results (Agarwal & Burrus, 1975; Goodwin et al., 1992). This operator is defined as δ=
q−1 , 1
2 In this article we use the nu operator (ν) notation to refer to a generic discrete-time operator (Back & Tsoi, 1994). Note that the situation normally considered in the digital ˆ q)/A(ˆ ˆ q) = filtering literature is the equivalence of rational function models, that is, B(ˆ B(q)/A(q), for which transformations between coefficients can be readily found (Gevers & Li, 1993). For the nonlinear models considered in this article, such transformations are not generally required (i.e., the models adapt and are not transformed from one model to another).
170
Andrew D. Back and Ah Chung Tsoi
x x x x λ = α +/− j β q-plane
λ = α − m +/- j β r ν-plane
Figure 1: A mapping that shifts the points around (1,0) in the q-plane to a less sensitive region in the ν-plane.
where 1 is the discrete-time sampling interval. In our experience however, we have found that other operators are capable of better performance (Back & Tsoi, 1994). These include the gamma operator, proposed independently of the delta operator (de Vries & Principe, 1992; Principe, de Vries, & de Oliveira, 1993); the rho operator proposed as an extension of the delta operator (Palaniswami, 1989); and the pi operator (Back & Tsoi, 1994).3 These operators are defined, respectively, as: γ =
q − (1 − c) c
ρ=
q − (1 − c1 1) c2 1
(2.10)
π =
2(c1 q − c2 ) 1(c3 q + c4 )
(2.11)
(2.9)
where c, c1 , c2 ∈ R are constants. These structures have shown particularly good performance when used to model systems with high-parameter sensitivity. That is, the models themselves have low sensitivity, even though the systems that they approximate may have a high sensitivity. Further work in the improvement of parameter sensitivity in linear systems has been considered using state-space techniques (see, e.g., Gevers & 3 This operator was also proposed independently in Gevers and Li (1993), where, in this case, c1 = c2 = c3 = c4 = 1.
A Low-Sensitivity Recurrent Neural Network
171
Li, 1993; Heuberger et al., 1995; Roberts & Mullis, 1987). Wahlberg (1991) has shown that the discrete-time Laguerre filter has good numerical sensitivity properties. In fact, the literature is quite extensive in concerning sensitivity analysis of linear models using various approaches (Agarwal & Burrus, 1975; Fan & Li, 1993; Goodwin et al., 1992; Palaniswami, 1989; Middleton & Goodwin, 1990). Our purpose here is not to review all of these approaches, but rather to consider the basic issue of parameter sensitivity and possible improvements that can be made based on fundamental properties of the architecture. Clearly, however, it would be relevant to consider more advanced issues such as state-space sensitivity analysis and generalized discrete-time operators, which have been considered in the linear systems theory literature (Gevers & Li, 1993; Heuberger et al., 1995; Roberts & Mullis, 1987). These issues will not be considered in this article, however. To consider the parameter sensitivity in a class of recurrent neural networks (Williams & Zipser, 1989), consider a nonlinear dynamical model, given by y(t) = h(x(t), y(t − 1)),
(2.12) 0
0
where x(t) = [x(t)x(t − 1) · · · x(t − m)] , y(t − 1) = [y(t)y(t − 1) · · · y(t − n)] , h(.) is a nonlinear function. We define 0
y(t − 1) = [1 q−1 , . . . , q−n ] y(t − 1).
(2.13)
In a manner analogous to linear systems, the shift operator is also used within this model structure. This raises some interesting questions: 1. Does the problem of parameter sensitivity occur in dynamic recurrent neural network models in a similar manner to linear systems, due to the use of the shift operator q? 2. If so, can alternative discrete-time operators be used to improve parameter sensitivity in recurrent neural networks ? These issues are examined in the following sections. 3 Recurrent Network Architectures 3.1 An RNN(q) Architecture. For convenience, we define a fully connected recurrent neural network (RNN) architecture here in the manner of Williams and Zipser (1989): ¡ ¢ yj (t) = f xˆj (t) xˆj (t) =
MT X i=1
wji zi (k)zi (t) =
(3.1) (
xi (t)
i∈I
yi (t − 1)
i∈U
(3.2)
172
Andrew D. Back and Ah Chung Tsoi
where xˆj (t) is the activation value of neuron j at time t, wji is the weight between unit i and unit j, and a sigmoidal activation function may be defined as f (x) = tanh(x). The sets of all indices for input and output units are defined as I and U , respectively. Each neuron i has an output zi (t). If we have only one input, then the cardinality of I is 1. Similarly, for one output, the cardinality of U is 1. There are Mi input units, Mo output units, and MT = Mo + Mi total units in the network. Note that for the general case, MT ≤ Mo + Mi since a unit may be both input and output. For convenience, we will denote this model as the RNN(q) model since it is inherently based on using the shift operator q within the network structure (see equation 3.2). 3.2 An RNN(ν) Architecture. In this section, we propose a fully recurrent neural network based on the ν-operator. Based on our previous experience with the use of alternative discrete-time operators (Back & Tsoi, 1994), we introduce this architecture as a candidate network, which may have reduced parameter sensitivity problems. We designate this architecture as RNN(ν) in comparison with the RNN(q) architecture described in section 3.1. The network is defined as follows: yj (t) = f (xˆj (t)) xˆj (t) =
MT X
(3.3)
wji zi (t)
(3.4)
i=1
( zi (t) =
xi (t)
i∈I
ν −1 yi (t)
i∈U
.
(3.5)
In conformity with what was considered in the previous section, it is possible to obtain a single-input–single-output model by the additional requirement that the cardinality of I and U is 1. The RNN(ν) architecture is a novel architecture in the sense that we have replaced all the delays q−1 in the RNN(q) architecture by ν −1 operators. This model generalizes the usual RNN(q) model so that in the proposed RNN, each connection consists of a low-pass filter. For example, using the gamma or rho operators, this is given by ν −1 =
θ1 q−1 , 1 − θ2 q−1
(3.6)
where θ1, θ2 are appropriately parameterized.4 An online learning algorithm that can be used to determine these parameters is given in the appendix. 4 For the gamma operator, θ = c, θ = 1 − c. For the rho operator, θ = c 1, θ = 1 2 1 2 2 1 − c1 1.
A Low-Sensitivity Recurrent Neural Network
173
An analysis of the parameter sensitivity problem in RNNs is given in the next section. 4 Parameter Sensitivity Analysis of the Recurrent Neural Network In this section, we analyze the parameter sensitivity of the RNN given in equations 3.1 and 3.2. The approach used here bears some similarity to other analysis methods such as perturbation analysis (Vidyasagar, 1978) in that the global behavior of the recurrent network due to some change in the neural network parameters is of particular interest and the analysis is performed by formulating an approximate model of the structure to be analyzed. Subsequently, we seek to determine the properties of the nonlinear model from this approximate model. Consider an arbitrary RNN structure, yp (t) = Fs (θ; xm (t), yp (t − 1)),
(4.1) 0
where yp (t) = [y(t)y(t − 1), . . . , y(t − n)] is the vector of outputs from the network Fs (·), s indicates a particular network structure, θ is the vector of network parameters, and xm (t) is the vector of inputs. Suppose we now have the following networks: yq (t) = Fsq (θq ; xm (t), yq (t − 1))
(4.2)
yˆ q (t) = Fsq (θˆq ; xm (t), yˆ q (t − 1))
(4.3)
yν (t) = Fsν (θν ; xm (t), yν (t − 1))
(4.4)
yˆ ν (t) = Fsν (θˆν ; xm (t), yˆ ν (t − 1))
(4.5)
where sq represents the RNN(q) model structure with parameters {θq }, and sν represents the RNN(ν) model structure with parameters {θν }. We seek to derive a measure of parameter sensitivity for RNNs, analogous to that found for linear systems 2.4 and 2.5. For the purposes of illustrating the analysis, let us consider, a single-input–single-output, two-node network of the form given in equation 4.1, y(t) = f (x(t) + a1 y(t − 1) + a2 q−1 f (y(t − 1)))
(4.6)
where f (u) = 1/(1 + e−au ) is a sigmoid activation function5 (the network is shown in Figure 2). 5
We have assumed a sigmoid function with a gain factor a.
174
Andrew D. Back and Ah Chung Tsoi
x(t)
a1
y(t)
a2
∆
∆
Figure 2: A simple two-node recurrent neural network used for the purposes of the analysis. Each circle corresponds to the usual summation and sigmoid activation mapping f (·) and 1 is the discrete-time operator used, that is, q−1 in equation 4.6.
The activation function f (·) approximated by a power series expansion is (Wray & Green, 1994)
f (u) =
∞ X
ξj u j .
(4.7)
j=0
For a finite pth-order expansion around u = 0, we have, µ µ ¶ ¶ 1 3 3 1 5 5 1 1 a u f (u) = + au + − a u + 2 4 48 480 µ ¶ ¡ ¢ 17 7 7 a u + · · · + O up + ², + − 80640
(4.8)
where ² is the residue of the approximation. If the argument au is small,6 the series expansion will be convergent. Thus, if we consider equation 4.6, corresponding to the simple RNN structure, then the following approximation
6
The series approximation will be different for a sigmoid function with a bias term.
A Low-Sensitivity Recurrent Neural Network
175
can be made:7 y(t) =
p X
ξi (x(t) + a1 y(t − 1) + a2 q−1 f (y(t − 1)))i
i=0
≈
p X p p X X
αijk xi (t)y j (t − 1)yk (t − 2),
(4.9)
i=0 j=0 k=0
where {α} are the appropriate parameters found from {ξ, a}. This formulation is derived using the same approach as for the discrete Volterra series, but involves the recursion in y(t) and differs in that there are nested nonlinearities (not just one nonlinear function at the output). Expanding equation 4.9, we have y(t) =
p X
αi00 xi (t) + α010 y(t − 1) + α001 y(t − 2) + · · · + α0p0 yp (t − 1)
i=0
+α00p yp (t − 2) +
X
αijk xi (t)y j (t − 1)yk (t − 2).
(4.10)
06i,j,k6p i+j+k>2
This can be considered in the following form: P p Pp Pp i j k j=0 k=0 αijk x (t)y (t − 1)y (t − 2) i=0 k6=1 j6=1 . y(t) = 1 − α010 q−1 − α001 q−2
(4.11)
The sensitivity of the approximate model depends on the denominator, 0(q) = 1 − α010 q−1 − α001 q−2 ,
(4.12)
of equation 4.11. Since this is a simple second-order polynomial in q−1 , the sensitivity analysis can be carried out in a similar manner to the linear model case, as shown in section 2. Consider the use of alternative discrete-time operators in an RNN. It is self-evident that 0(q) can be replaced by 0(ν), which will result in lower-parameter sensitivity in the feedback transfer function, provided the appropriate parameterization of the alternative discrete-time operator ν is selected (Gevers & Li, 1993). That is, for ¯ A(ν) =
n X
a¯ i ν −i
i=0 7 We are not suggesting that a polynomial approximation has the same representational characteristics as the original network. This is not the case in multilayer feedforward networks (Mhaskar, 1996) and is not true here either. We are simply using the polynomial as a convenient device for the analysis. A similar result could be obtained by using a power series approximation for F(·) in equation 4.1.
176
Andrew D. Back and Ah Chung Tsoi
= ν −n
n Y (ν − λ¯ i )
(4.13)
i=1
q − c1 c2
ν=
(4.14)
we have the result that S¯ ij < Sij
for c2 < 1.0,
(4.15)
where dλ¯ i S¯ ij = d¯aj n−j λ¯ i ¯ ¯ k6=i (λi − λk )
= Q
j−1
= c2
(4.16)
(λi − c1 )n−j Q k6=i (λi − λk )
and |c1 | < 1.0 for stability of ν-operator. Hence it is obvious that a significantly lower sensitivity can be obtained as required by using the ν-operator in place of the usual q-operator. The above analysis is carried out on a simple example, but the method can be applied to general RNNs. The analysis is based on the approximation of the sigmoidal activation function by a power series. Clearly, each power series approximation is only valid instantaneously. The parameter sensitivity measure depends on the instantaneous approximation made locally at the operating point on the curve of each sigmoid function. Hence, we have time-varying feedback transfer functions 0(q; t), where the resultant sensitivity measure is also time varying. In this case, based on the previous analysis, the problem of parameter sensitivity can arise in recurrent neural networks. The above analysis can be summarized in the following observations: • An RNN(q) model may suffer from problems of high-parameter sensitivity. • A low-sensitivity RNN, RNN(ν), can be obtained by replacing the usual shift operators in a standard recurrent network with alternative discrete-time operators that are appropriately parameterized. A learning algorithm for parameterizing the RNN(ν) model is given in the appendix. In practice, however, we have found that it is not always necessary to adapt the parameters in the ν-operator. The parameters can simply
A Low-Sensitivity Recurrent Neural Network
177
System
y(t) S
x(t)
FWL
~
y(t)
M Model
1. M = S ~ 2. M M
Figure 3: The system identification arrangement used for the experiments.
be fixed8 a priori to values that will provide a low-sensitivity mapping, as indicated in equations 4.14 and 4.15. If these parameters are not adapted, then essentially we have simply allowed a new mapping to exist in place of the previously fixed q-operators. If it is desired to adapt these parameters online, then the learning algorithm proposed can be used to adapt the parameters further. In contrast to linear systems, in this article we do not provide a transformation from RNN(q) to RNN(ν) models. It is conjectured that an exact transformation will not exist, although it may be possible to find an approximate transformation with the polynomial approach used above. When designing an RNN, however, using the ν-operator from the start will result in a structure that is less sensitive than a corresponding structure employing the q-operator. 5 Experimental Results 5.1 Experiments. To demonstrate the problems of parameter sensitivity in recurrent networks and the usefulness of the proposed alternative discrete-time operator modification, indicative experimental results are given in this section. The basic approach adopted here is shown in Figure 3. 8 For further discussion on some of the issues concerned with choosing operator weights, see Back and Tsoi (1995).
178
Andrew D. Back and Ah Chung Tsoi
A series of system identification experiments are set up where the system S and the model M are identical. The model is then subject to a finite word-length constraint, and then the output responses are compared when subjected to an impulse input signal. Note that an impulse signal of fixed magnitude is insufficient in general to characterize nonlinear systems of the type we consider here. However, our purpose in these experiments is merely to illustrate the problem of parameter sensitivity and the effectiveness of the proposed solution. Hence in this sense, the experiments fulfill their purpose. If a range of input signals is used, it would be possible to obtain a more accurate characterization of the nature of the parameter sensitivity in a particular model. It is sufficient here, however, to use a single input signal to demonstrate the problem. A series of experiments are performed. As a comparison, we give the results for a linear system. We then examine the performance of the usual RNN(q) model and then the RNN(ν) model. The linear transfer function used in the first of each group of experiments is also used as the linearized underlying transfer function in each of the RNN experiments. That is, the RNNs, when linearized, are the same in each case as the given linear transfer function. For high-sensitivity models, a small perturbation in the parameters can be expected to cause some significant deviation in the model output. For the desired low-sensitivity model, a small perturbation in the parameters will not result in any substantial change in the network output. Thus, the experiments give an indication of the sensitivity of the network with respect to the parameters. Remarks. 1. The main focus is how a sufficiently accurate model will perform under small changes in the weights. 2. The properties of low sensitivity are desirable for many real-world applications and offer the opportunity for improved model accuracy. 5.1.1 Experiment 1.1: LDLF Linear System, Linear Model. The problem considered is a system identification task, closely related to a high bit rate echo cancellation problem (Fan & Li, 1993). The system being modeled is described by the linear transfer function, H(q) =
1−
1.0 , + 0.9801q−2
1.9725q−1
which has poles at r = 0.99, θ = ±5◦ . A linear model was tested using exactly the same structure, but with the weights perturbed by implementing with 16-bit word length, where the round-off method is θ˜ = 2−w round(2w θ),
A Low-Sensitivity Recurrent Neural Network
0.015
4
0.01
2
0.005
el(t)
d(t), y(t)
6
179
0 -2
0
-0.005
-4
-0.01
-6
-0.015 0
100
200
300 t
(a)
400
500
0
500
1000 t
1500
2000
(b)
Figure 4: Linear shift operator model results for experiment 1 showing (a) output and (b) error. In this example, it is clear that there is only a small error initially, which dies away with the output signal.
where θ is the original weight, θ˜ is the FWL weight, and w is the word length in bits (Williamson, 1988). This perturbation causes a change in the weights in only the fifth or sixth decimal place. We then test the model with an impulse input. The results of this experiment are observed in Figure 4, where the error resulting from the FWL approximation of the weights can be clearly seen.9 5.1.2 Experiment 1.2: RNN(q) LDLF System, RNN(q) Model. The problem considered is an extension of that considered in experiment 1.1. In this case, however, the system being modeled is an RNN, which is obtained from a linearized RNN model having the same transfer function as in the previous experiment. The model used is an RNN(q) model, which is identical with the weights also perturbed using 16 bits. As before, an impulse input is used. In this case, the RNN(q) model has a limit cycle operation. The results shown in Figure 5 indicate a significant error, which is in fact peaking at values greater than the magnitude of the output signal. 5.1.3 Experiment 1.3: RNN(ν) LDLF System, RNN(ν) Model. In this experiment we compare the performance of the RNN(ν) model with that of the RNN(q) model. Since there is no straightforward transformation between the weights of an RNN(q) model and an RNN(ν) model, we simply use the transformation of the underlying instantaneous small signal model to obtain the weights. Thus, the linear models are identical, but the RNNs 9 A convenient way to visualize the effect of FWL weights and the results of the proposed network is to compare the respective outputs of the models: one with infinite word-length precision and the other with finite word-length precision.
Andrew D. Back and Ah Chung Tsoi
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
e(t)
d(t), y(t)
180
0
500
1000 t
1500
2000
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 1000
(a)
1250
1500 t
1750
2000
(b)
Figure 5: RNN(q) model results for experiment 1 showing (a) output and (b) error. In this example, it is clear that there is a very large error relative to the output signal. The error persists much more strongly than in the linear model due to the presence of limit cycles.
are not. Although we do not have exactly the same model as in experiment 1.2, this experiment serves to gain an indication of the effect of modeling an RNN formed from an underlying linearized LDLF system. The system model is an RNN(ν) structure with weights described by the linearized transfer function, H(ν) =
β0 + β1 ν −1 + β2 ν −2 , 1 + α1 ν −1 + α2 ν −2
β0 = 1.000000, β1 = 0.500000, β2 = 0.062500, α1 = -1.965576, and α2 = 0.977523. An RNN(ν) model was tested using exactly the same structure, but with the weights perturbed using a 16-bit word length. The particular ν-operator tested was a gamma operator (equation 2.9) with c = 0.8. The results for this experiment, shown in Figure 6, clearly show that the model approximates the system with a much smaller error than before. 5.1.4 Experiment 2.1: Wide-band LDLF Linear System, Linear Model. Recently there has been some interest in modeling systems of different time scales. Hence, we consider here the task of modeling a wide-band system that includes an LDLF pole pair. The system is a linear model described by H(q) =
1 + a1
q−1
1.0 , + a2 + a3 q−3 + a4 q−4 q−2
(5.1)
with weights a1 = −2.0770354271, a2 = 2.1843614578, a3 = −2.0710115433, a4 = 0.9781407714, and poles at: r1 = 0.99, θ = ±5◦ , r2 = 0.999, θ = ±87◦ . A linear model was tested using exactly the same structure, but with the weights perturbed as before using a 16-bit word length. An impulse input
A Low-Sensitivity Recurrent Neural Network
181
1.5
0.2 0.1
0.5
ev(t)
d(t), y(t)
1
0 -0.5
0 -0.1
-1 -1.5 1500
-0.2 1500 1600
1700
1800
1900
1600
2000
1700
1800
1900
2000
t
(a)
(b)
Figure 6: RNN(ν) model results for experiment 1 showing (a) output and (b) error. Here it is clear that the error is more than 10 times smaller than that for the RNN(q) model (the scale for the error in this case is one-tenth that used for the RNN(q) model).
1.5e-05
0.006 0.004 0.002 0 -0.002 -0.004 -0.006
1e-05
el(t)
d(t), y(t)
5e-06 0
-5e-06 -1e-05 -1.5e-05 0
100
200
300 t
(a)
400
500
0
500
1000 t
1500
2000
(b)
Figure 7: Linear shift operator model results for experiment 2.1 showing (a) output and (b) error.
signal was used. The output and error plots resulting from this experiment are observed in Figures 7a and b, respectively. 5.1.5 Experiment 2.2: Wide-Band RNN(q) LDLF System, RNN(q) Model. The problem considered is an extension of that considered in experiment 2.1. As in experiment 1.2, the system being modeled is an RNN obtained from a linearized RNN model having the same transfer function as in the previous experiment. The model used is an RNN(q) model identical to the system but with the weights perturbed using 16-bit finite word length. An impulse input signal is used to test the model. In this case, the RNN(q) model has a limit cycle operation. The small
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 1500
e(t)
Andrew D. Back and Ah Chung Tsoi
d(t), y(t)
182
1600
1700
1800
1900
2000
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 1000
t
(a)
1250
1500 t
1750
2000
(b)
Figure 8: RNN(q) model results for experiment 2.2 showing (a) output and (b) error. The perturbed RNN(q) model has a significant error, brought about by a change in value of the weights in only the fifth or sixth decimal place.
perturbation in the coefficients causes a surprisingly large change in the model performance, as indicated in Figure 8. The limit cycle of the model is at a different frequency from that of the model, resulting in a significant error. 5.1.6 Experiment 2.3: Wide-Band RNN LDLF(ν) System, RNN(ν) Model. This experiment follows the same approach as used in experiment 1.3, where we transform the model in equation 5.1 to the transfer function H(ν) =
β0 + β1 ν −1 + β2 ν −2 + β3 ν −3 + β4 ν −4 , 1 + α1 ν −1 + α2 ν −2 + α3 ν −3 + α4 ν −4
where β0 = 1.000000, β1 = 1.000000, β2 = 0.375000, β3 = 9.781250, β4 = 0.003906, α1 = −1.596294, α2 = 1.840844, α3 = −2.762717, and α4 = 1.553459. The weights in this transfer function are used in the RNN(ν) model. An RNN(ν) model was tested as in the previous experiment using a 16bit word length with the same ν-operator and input signal. Once again, the results, observed in Figure 9, indicate that the RNN(ν) model does not have the same level of difficulty that the RNN models using the conventional shift operator have. We conclude that the ν-operator appears to be quite successful in overcoming the problems associated with modeling LDLF systems using an RNN(q) model. In our experiments, it was observed that, as expected, the closer the gamma operator was to the shift operator (c = 1), the worse the error performance was (see Figure 10). 5.2 Discussion. The experiments presented here serve to verify the problem of parameter sensitivity in RNNs. An important point is that it is possible for even a simple recurrent network, (which can form part of a larger network) to be subject to parameter sensitivity problems. As observed in
A Low-Sensitivity Recurrent Neural Network
1.5
183
0.025 0.01
0.5
ev(t)
d(t), y(t)
1
0 -0.5 -1 -1.5 1500
0
-0.01 -0.025 1500 1600
1700
1800
1900
1600
2000
1700
1800
1900
2000
t
(a)
(b)
NMSE
Figure 9: RNN(ν) model results for experiment 2.3 showing (a) output and (b) error. The RNN(ν) model provides a much more accurate model, which is robust under the changes in weights.
1 0.1 0.01 0.001 0.0001 1e-05 1e-06 1e-07 0
0.25
0.5 c
0.75
1.0
Figure 10: Normalized mean square error versus the gamma operator c parameter for the RNN(ν) model. The relationship, although nonlinear, shows a clear trend of improvement as the gamma operator translates and dilates the region of high sensitivity in the q-plane to a region of lower sensitivity in the γ -plane.
the examples given here, the problem of high sensitivity is clearly present in RNNs. A further interesting observation is that the problem can be much greater in an RNN than in linear models due to the presence of limit cycles.10 Although the linear model has errors that will die away, in the RNN, limit cycle behavior is encountered, and the effect of the errors propagates throughout the entire structure, persisting very strongly. The robustness of the new RNN(ν) model in comparison to the standard RNN(q) model is clear. The original RNN(q) model has a significantly higher sensitivity than the RNN(ν) model. The high sensitivity in the RNN(q) 10 To see a similar effect in a linear model, the poles would need to lie exactly on the unit circle.
184
Andrew D. Back and Ah Chung Tsoi
model causes significant error when a change in the weights occurs in only the fifth or sixth decimal place. The RNN(ν) model, on the other hand, appears to be much more robust, and this may be a significant advantage in practical applications. Parameter sensitivity in RNNs is also relevant to the issue of noise occurring in such models. Noise can occur in a number of ways: measurement noise; noise in the inputs, outputs, or hidden outputs; quantization noise; analog-digital converter noise; and so on. If some perturbation occurs in a model or learning algorithm due to noise, then it is desirable that it has a negligible effect. For a highly sensitive model, however, the problem of noise may be more serious than for a low-sensitivity model. In a high-sensitivity model, the same amount of noise that leads to perturbation of a weight can result in a much greater disturbance to the model than would occur in a less sensitive model. Hence, this means that it could take longer for the model to “recover” and result in larger errors during this transient period. If there is continual noise in a high-sensitivity model, convergence will be more difficult. A low-sensitivity model, on the other hand, will be less sensitive to this type of noise-induced disturbance. 6 Conclusions RNNs based on the shift operator have been shown to have problems when modeling systems with a high sensitivity in a linear systems sense. We have introduced an RNN architecture based on alternative discrete-time operators, which substantially lowers the parameter sensitivity of the network and hence overcomes the difficulties associated with the networks based on the shift operator. The lower-sensitivity properties have been established by an approximation analysis of the networks. We have indicated some avenues for further detailed study of these properties. Experiments have been used to verify the advantages of the proposed architecture. Appendix: A Learning Algorithm for the RNN(ν) Model A gradient-descent-based algorithm for updating the weights in an RNN has been derived previously in Williams and Zipser (1989) and Robinson (1989). These methods are based on minimizing an instantaneous cost function using the output error. Other algorithms have been derived, including those based on second-order methods, typically employing an extended Kalman-filter-based approach (Puskorius & Feldkamp, 1992). In this article, we give the derivation for a ν-operator-based RNN, which is based on minimizing an instantaneous cost function. We now indicate the weight update equations. The weights are adjusted using a simple gradient method: wij (t + 1) = wij (t) + 1wij (t)
(A.1)
A Low-Sensitivity Recurrent Neural Network
185
ci (t + 1) = ci (t) + 1ci (t) 1wij = −η =η
(A.2)
∂ J(t) ∂wij
X
en (k)
n∈U
∂yn (t) ∂wij
(A.3)
∂ J(t) ∂ci X ∂yn (t) =η en (k) ∂ci n∈U
1ci = −η
(A.4)
where w and c are the network weights and operator weights, respectively (the same operators are used throughout the network). The cost function J(t) is defined for the output units as J(k) =
1X 2 e (k) 2 n n
(A.5)
en (k) = dn (k) − yn (k)
n∈U
(A.6)
Using a standard algorithmic trick (Williams and Zipser, 1989) to compute the derivative ∂yk /∂wij , new dynamical systems are defined as pnij (t) =
∂yn (t) ∂wij
(A.7)
rni (t) =
∂yn (t) ∂ci
(A.8)
pnij (0) = 0
(A.9)
rni (0) = α
(A.10)
Hence "
X
0
pnij (t) = f (xˆ n (t))
rni (t) = f (xˆ n (t))
wnl ν
wnl ν
−1
plij (t) + δin zj (t)
l
" 0
# −1
X
(A.11)
# rli (t)
(A.12)
l
The learning algorithm for the RNN(ν) model is given by the equations in this appendix. Note that α is some positive definite constant, required in order to allow adaptation of ci .
186
Andrew D. Back and Ah Chung Tsoi
Acknowledgments The first author acknowledges financial support from the Australian Research Council and the Frontier Research Program (RIKEN), Japan. The second author acknowledges partial support from the Australian Research Council. We also thank the anonymous reviewers for their helpful comments. References Agarwal, R., & Burrus, C. (1975). New recursive digital filter structures having very low sensitivity and roundoff noise. IEEE Trans. Circuits, Syst., CAS22(12), 921–927. Alippi, C., Piuri, V., & Sami, M. (1995). Sensitivity to errors in artificial neural networks: A behavioral approach. IEEE Trans. Circuits, Syst. I: Fundamental Theory and Applications, 42(6), 358–361. Astrom, K., Hagander, P., & Sternby, J. (1984). Zeros of sampled systems. Automatica, 20(1), 31–38. Back, A. D, & Tsoi, A.-C. (1994). A comparison of discrete-time operator models for nonlinear system identification. In D. S. T. G. Tesauro & T. K. Leen (Eds.), Advances in neural information processing systems 7 (pp. 883–890). Cambridge, MA: MIT Press. Back, A. D., & Tsoi, A.-C. (1995). Constrained pole-zero filters as discrete-time operators for system approximation. In E. M. F. Girosi, J. Makhoul, & E. Wilson (Eds.), Proc. of the 1995 IEEE Workshop Neural Networks for Signal Processing 5 (NNSP95) (pp. 191–200). New York: IEEE Press. Back, A. D, Wan, E., Lawrence, S., & Tsoi, A.-C. (1994). A unifying view of some training algorithms for multilayer perceptrons with FIR filter synapses. In J. H. J. Vlontzos & E. Wilson (Eds.), Proc. of the 1994 IEEE Workshop Neural Networks for Signal Processing 4 (NNSP94) (pp. 146–154). New York: IEEE Press. Choi, J., & Choi, C.-H. (1992). Sensitivity analysis of multilayer perceptrons with differentiable activation functions. IEEE Trans. Neural Networks, 3, 101–107. Dahlquist, G., & Bjorck, A. (1974). Numerical methods. Englewood Cliffs, NJ: Prentice Hall. de Vries, B., & Principe, J. (1992). The Gamma model—a new neural model for temporal processing. Neural Networks, 5(4), 565–576. Dundar, ¨ G., & Rose, K. (1995). The effects of quantization on multilayer perceptrons. IEEE Trans. Neural Networks, 6(6), 1446–1451. Fan, H., & Li, Q. (1993). A δ-operator recursive gradient algorithm for adaptive signal processing. In Proc. IEEE Int. Conf. Acoust. Speech, Signal Proc. (Vol. III, pp. 492–495). New York: IEEE Press. Gevers, M., & Li, G. (1993). Parameterizations in control, estimation and filtering problems: Accuracy aspects. London: Springer-Verlag. Gold, B., & Radar, C. (1969). Digital processing of signals. New York: McGraw-Hill. Goodwin, G., Middleton, R., & Poor, H. (1992). High-speed digital signal pro-
A Low-Sensitivity Recurrent Neural Network
187
cessing and control. Proc. IEEE, 80(2), 240–259. Heuberger, P., Bosgra, O., & Van den Hof, P. (1995). A generalized orthonormal basis for linear dynamical systems. IEEE Trans. Automat. Control, AC-40(3), 451–465. Kerlirzin, P. & R´efr´egier, P. (1995). Theoretical investigation of the robustness of multilayer perceptrons: Analysis of the linear case and extension to nonlinear networks. IEEE Trans. Neural Networks, 6, 560–571. Mantey, P. (1968). Eigenvalue sensitivity and state-variable selection. IEEE Trans. Automat. Control, AC-13(3), 263–269. Mhaskar, H. (1996). Neural networks and approximation theory. Neural Networks, 9(4), 721–722. Middleton, R., & Goodwin, G. (1990). Digital control and estimation. Englewood Cliffs, NJ: Prentice-Hall. Minai, A., & Williams, R. (1994). Perturbation response in feedforward networks. Neural Networks, 7(5), 783–796. Oh, S.-H., & Lee, Y. (1995). Sensitivity analysis of single hidden-layer neural networks with threshold functions. IEEE Trans. Neural Networks, 6, 1005–1007. Orlandi, G., & Martinelli, G. (1984). Low-sensitivity recursive digital filters obtained via the delay replacement. IEEE Trans. Circuits, Syst., CAS-31, 654–657. Orzechowski, N., Kumara, S., & Das, C. (1993). Performance of multilayer neural networks in binary-to-binary mappings under weight errors. In Proc. ICNN93, San Francisco (pp. 1684–1689). New York: IEEE Press. Palaniswami, M. (1989). A new discrete-time operator for digital estimation and control (Tech. Rep. No. 1). University of Melbourne, Department of Electrical Engineering. Principe, J., de Vries, B., & de Oliveira, P. G. (1993). The Gamma filter—a new class of adaptive IIR filters with restricted feedback. IEEE Trans. Signal Processing, 41, 649–656. Puskorius, G. V., & Feldkamp, L. A. (1992). Model reference adaptive control with recurrent networks trained by the dynamic DEKF algorithm. In Proceedings of the 1992 International Joint Conference on Neural Networks, Baltimore, Maryland. Rabiner, L., & Gold, B. (1975). Theory and application of digital signal processing. Englewood Cliffs, NJ: Prentice Hall. Roberts, R., & Mullis, C. (1987). Digital signal processing. Reading, MA: AddisonWesley. Robinson, A. (1989). Dynamic error propagation networks. Unpublished doctoral dissertation, Cambridge University. Stevenson, M., Winter, R., & Widrow, B. (1990). Sensitivity analysis of feedforward neural networks to weight errors. IEEE Trans. Neural Networks, 1, 71–90. Vidyasagar, M. (1978). Nonlinear systems analysis. Englewood Cliffs, NJ: Prentice Hall. Von Lehman, A., Paek, E., Liao, P., Marrakchi, A., & Patel, J. (1989). Influence of interconnection weight discretization and noise in an optoelectronic neural network. Optics Letters, 14, 928–930. Wahlberg, B. (1991). System identification using Laguerre models. IEEE Trans. Automat. Control, 36, 551–562.
188
Andrew D. Back and Ah Chung Tsoi
Wang, K., & Michel, A. (1994). Robustness and perturbation analysis of a class of artificial neural networks. Neural Networks, 7(2), 251–259. Williams, R., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. Williamson, D. (1988). Delay replacement in direct form structures. IEEE Trans. Acoust., Speech, Signal Processing, 34(4), 453–460. Wray, J., & Green, G. (1994). Calculation of the Volterra kernels of non-linear dynamic sytems using an artificial neural network. Biological Cybernetics, 71, 187–195. Xie, Y., & Jabri, M. (1992). Analysis of the effects of quantization in multilayer neural networks using a statistical model. IEEE Trans. Neural Networks, 3, 334–338. Received May 28, 1996; accepted March 7, 1997.
Communicated by Geoffrey Goodhill
Fixed-Point Attractor Analysis for a Class of Neurodynamics Jianfeng Feng David Brown Biomathematics Laboratory, Babraham Institute, Cambridge CB2 4AT, U.K.
Nearly all models in neural networks start from the assumption that the input-output characteristic is a sigmoidal function. On parameter space, we present a systematic and feasible method for analyzing the whole spectrum of attractors—all-saturated, all-but-one-saturated, all-but-twosaturated, and so on—of a neurodynamical system with a saturated sigmoidal function as its input-output characteristic. We present an argument that claims, under a mild condition, that only all-saturated or allbut-one-saturated attractors are observable for the neurodynamics. For any given all-saturated configuration ξ (all-but-one-saturated configuration η) the article shows how to construct an exact parameter region R(ξ ) ¯ ¯ (R(η)) such that if and only if the parameters fall within R(ξ ) (R(η)), then ξ (η) is an attractor (a fixed point) of the dynamics. The parameter region for an all-saturated fixed-point attractor is independent of the specific choice of a saturated sigmoidal function, whereas for an all-but-one-saturated fixed point, it is sensitive to the input-output characteristic. Based on a similar idea, the role of weight normalization realized by a saturated sigmoidal function in competitive learning is discussed. A necessary and sufficient condition is provided to distinguish two kinds of competitive learning: stable competitive learning with the weight vectors representing extremes of input space and being fixed-point attractors, and unstable competitive learning. We apply our results to Linsker’s model and (using extreme value theory in statistics) the Hopfield model and obtain some novel results on these two models.
1 Introduction Neural networks provide a systematic approach to massively parallel computation, as well as a possibly better understanding of brain function. Interestingly, to date, sigmoidal functions have been utilized in the vast majority of neural networks as the input-output characteristic (Sejnowski, 1995), on either the state space or the weight space. The sigmoidal function is nearly saturated outside a region by suitably adjusting some parameters of the function. In this article, we first report our recent work on neurodynamics Neural Computation 10, 189–213 (1998)
c 1997 Massachusetts Institute of Technology °
190
Jianfeng Feng and David Brown
defined by yi (t+1) = f (yi (t)+
N X
(aij +k2 )rj yj (t)+k1 ), i = 1, . . . , N, t = 1, . . . , (1.1)
j=1
where y(t) = (yi (t), i = 1, . . . , N) ∈ RN , A = (aij , i, j = 1, . . . , N) an N × N matrix representing interaction between units (either weights or states), (k1 , k2 ) are two key parameters of the dynamics, N can be thought of as either the number of neurons or the number of synaptic efficacies connected to the ith neuron, R = (ri δij , i, j = 1, . . . , N) plays the role of normalization with ri > 0, i = 1, . . . , N, and f is a saturated sigmoidal function which is continuous and defined by1 ymax f (x) = f (x) ymin
if x > ymax f (x) is strictly increasing for x ∈ [ymin , ymax ] . if x < ymin
(1.2)
As has been amply shown before (Feng 1995, 1997; Feng, Pan, & Roychowdhury, 1996; Feng & Tirozzi 1995b, 1997b) and as further demonstrated here, the dynamical system (see equation 1.1) reflects the main features of dynamics using sigmoidal functions. We introduce the Nemytskij operator (Feng & Hadeler, 1996) F: RN → N R acting coordinate-wise (F(y))j = f (yj ), y = (y1 , . . . , yN ), j = 1, . . . , N.
(1.3)
Dynamics 1.1 reads y(t + 1) = F(y(t) + (A + K2 )R(y(t))0 + k1 )
(1.4)
for k1 = (k1 , . . . , k1 ), (·)0 representing the transpose of a vector and 1 ··· 1 . ··· K2 = k2 1 ··· 1 Dynamics 1.4 is characterized by the following three properties: P1: There is a linear operation given by the matrix (A + K2 )R. P2: There is a simple nonlinear function, saturated outside [ymin , ymax ]N , making dynamics 1.4 a nontrivial one. 1 Some biological implications of dynamics (see equation 1.1) can be found in sections 4 and 5, as well as in Feng (1995), Feng and Hadeler (1996), Feng et al. (1996), and Feng and Tirozzi (1997a, 1997b).
Fixed-Point Attractor Analysis
191
P3: There is monotonicity inside the region [ymin , ymax ]N with respect to a partial ordering of the underlying space. A variety of methods (Hui & Zak, 1992; MacKay & Miller, 1990) have been developed in recent years for exploring different aspects of the properties of dynamics 1.4 with some further assumptions on f or A. In Feng (1995), Feng et al. (1996), and Feng and Tirozzi (1995b, 1997b) the behavior of a neurodynamical system with a limiter function as its input-output characteristic is analyzed. Nevertheless, there are two severe restrictions that prevent a wide application of the approach presented in the latter set of articles. 1. Limiter functions have been applied in a few models, and because of this, linear analysis can be carried out and informative results obtained, as in MacKay and Miller (1990) and Miller and MacKay (1994). But nearly all models of neural networks start from the assumption that the firing rate is a sigmoidal function of the summed inputs rather than a linear function, although both allow single neurons to make linear discriminations in the space of input features. 2. All results obtained in Feng (1995), Feng et al. (1996), and Feng and Tirozzi (1995b, 1997b) are based on an assumption that we exclusively consider the set of all saturated attractors and tacitly suppose that this is a generic case. This assumption is partly confirmed by Linsker (1986) for the limiter function, using both numerical simulation and theoretical proof. But he also points out the possibility of the emergence of all-but-one-saturated attractors, even in the limiter function case. When we consider the (more general) sigmoidal function case (see equation 1.4), we are not certain, at least at a first glance, what form of attractor occurs in general. With the aim of providing a systematic and feasible tool for grasping some informative properties of dynamics 1.4, here we generalize the saturated fixed point attractor analysis on parameter space with limiter functions developed in Feng (1995), Feng et al. (1996), and Feng and Tirozzi (1995b, 1997b) to fixed-point attractor analysis with saturated sigmoidal functions. By this we mean the following: • We perform an analysis on parameter space for the whole set of fixedpoint attractors rather than the set of saturated fixed-point attractors. We provide an argument to claim that we are only likely to observe the set of all-saturated or all-but-one-saturated attractors for dynamics 1.4. In other words, the generic outcome of dynamics 1.4 is an all-saturated or all-but-one-saturated attractor. • We derive a necessary and sufficient condition to test whether a given all-saturated (all-but-one-saturated) state is an attractor (fixed point) or not for any given set of system parameters. This result enables us to study Linsker’s model in a more general setting and consider the set
192
Jianfeng Feng and David Brown
of all-saturated attractors, as well as the set of all-but-one-saturated fixed points. Using extreme value theory in statistics, we give an exact parameter region for the threshold of the Hopfield model within which a stored pattern is an attractor. The first property P1 of dynamics 1.4 is the linearity of the operation (A + K2 )R, which includes many learning rules in neural networks. For a learning rule of this kind, Miller and MacKay (1994) carried out a detailed discussion directed at elucidating the effect of different constraints— subtractive and divisive. They show that divisive enforcement causes the weight pattern to tend to the principal eigenvector of the synaptic development operator(matrix AR), whereas subtractive enforcement causes almost all weights to reach their minimum or maximum values. Our results on dynamics 1.4 partly serve as a complement of their results: under an exactly given condition, we assert that only all-saturated or all-but-one-saturated attractors are possible outcomes of the dynamics, similar to the situation of subtractive enforcement. Competitive learning, however, involves choosing a winner violating P1, and thus does not succumb to the analysis above. Goodhill and Barrow (1994) have shown that for uniformly distributed inputs, divisive enforcement leads to weight vectors becoming evenly distributed through the space, while subtractive enforcement leads to weight vectors tending to the axes of the space. Here, by applying the idea for analyzing dynamics 1.4, we consider fixed-point attractors of competitive learning subject to the constraints of a saturated sigmoidal function. We verify that for saturated sigmoidal function constraints, the outcomes of competitive learning are divided into two essentially different classes. One is stable, which means the weights will finally settle down on extremes of input signals; another is unstable, in which the weights wander around the input space, although with a high probability that a subset of the weights stays at the extremes. In section 2 and section 3, we answer the following questions, respectively: • Under what condition is dynamical system 1.4 globally stable? • Supposing it is stable, what is the form of its final outcome? Section 4 is devoted to applications of results in section 3; Linsker’s model, the Hopfield, and a continuous time model are dealt with. In section 5 we take into account competitive learning. For easy access to material in this article, some technical details, in particular heavily mathematical manipulations, are omitted; they will be partly published in mathematical journals. 2 Global Stability of Dynamics 1.4 The first question we encounter here is the global stability of dynamics 1.4: whether it converges to a fixed-point attractor starting from an arbitrary
Fixed-Point Attractor Analysis
193
state. This is one of the central themes in discussing primary properties of simple dynamics like 1.4 and many results based on Lyapunov functions are available. A difficulty when proving a given function is a Lyapunov function of dynamics 1.4 lies in the fact that the saturated sigmoidal function is not differentiable—at least at two points: x = ymin and x = ymax —and so Taylor series expansion is not valid. This difficulty can be overcome by using Legendre-Fenchel transformation (Feng, 1997), instead of Taylor series expansion. 2.1 Distributed Dynamics. For a given network model, which means we have fixed f, N, A, k1 , K2 , R, it is possible for us to associate substantially different dynamics with it. Two of the most commonly used are synchronous dynamics and asynchronous dynamics. A generalization of them is distributed dynamics (Herz & Marcus, 1993; Marcus & Westervelt, 1989). P For any subset U of {1, . . . , N} we associate a probability p(U) with it and U⊂{1,...,N} p(U) = 1. Subset U is selected with probability p(U), and the distributed dynamics z(t) is defined by ( zi (t + 1) =
PN f ( j=1 (aij + k2 )rj zj (t) + k1 ) zi (t)
if i ∈ U otherwise.
(2.1)
When P(U) = 1 with U = {1, . . . , N} distributed dynamics turns out to be synchronous dynamics, and when U = {1}, . . . {N} distributed dynamics is asynchronous dynamics. There is also a different formulation of dynamics 1.4. The following dynamics, y(t + 1) = (B + K2 )R(F(y(t)))0 + k1 ,
(2.2)
is equivalent to dynamics 1.4; that is, they have the identical trajectories (Feng & Hadeler, 1996). Here B = (bij := aij + δij /ri , i, j = 1, . . . , N). This conclusion implies that all results below have a similar version for dynamics 2.2. The equivalence between dynamics 1.4 and 2.2 will be used in subsection 4.3 again. 2.2 Lyapunov Functions. Under the assumption that the matrix A is symmetric, we are able to define a Lyapunov function for the distributed dynamics defined in the previous subsection. Theorem 1.
Let a¯ ij = aij + k2 . The function
L(z(t)) =
N Z X j=1
zj (t) 0
rj f −1 (y)dy −
N N X 1X aji zj (t)zi (t)rj ri − rj k1 zj (t) 2 j,i=1 j=1
194
Jianfeng Feng and David Brown
is a Lyapunov function (supermartingale) of dynamics 1.4 provided that a¯ ii ≥ 0, i = 1, . . . , N The proof of Theorem 1 is similar to that of Theorem 1 in Feng (1997), although there we consider only asynchronous dynamics. Theorem 1 is thus a generalization of results in Feng (1997) and Herz and Marcus (1993). 3 Fixed-Point Attractors of Dynamics 1.4 The second question we encounter is which kind of attractors will appear provided dynamical system 1.4 converges to a fixed-point attractor. We answer it in this section. 3.1 All-Saturated Fixed-Point Attractors. The following definition and theorem are keys for further development of this article. Without loss of generality, we suppose that ymin = −1 and ymax = 1. As we already pointed out in Feng (1995, 1997), Feng et al. (1996), and Feng and Tirozzi (1995b, 1997b) saturated states in the space {−1, 1}N represent the most common outcome of many learning and retrieval models of neural networks, and so we address the following definition. Definition 1. if k1 +
N X
A fixed-point attractor y ∈ {−1, 1}N is called a saturated attractor
(aij + k2 )rj yj 6= 0 for all i.
(3.1)
j=1
Restriction 3.1 will be relaxed gradually in the following subsection. The case that one unit violates condition 3.1 is dealt with in the next subsection; the case of more than one unit’s not satisfying 3.1 is discussed in subsection 3.3. Motivated by the Hopfield model (see section 4.2) we introduce the following definition: The quantity X X aij rj − aij rj hi (y) :=
Definition 2.
j∈J+ (y)
(3.2)
j∈J− (y)
is called the local field of the ith neuron where J+ (y) = {i, yi = 1}, J− (y) = {i, yi = −1}. We say that there is a local field gap between neurons in J+ (y) and J− (y) if and only if min hi (y) > max hi (y). −
i∈J+ (y)
i∈J (y)
(3.3)
Fixed-Point Attractor Analysis
195
In spite of the fact that dynamics 1.4 is a generalization of what we consider in Feng et al. (1996), the proof of the following theorem is similar to that of Theorem 2 in Feng et al. (1996). However, we sketch a proof of it here (see appendix A) since it is essential for understanding the rest of this article. Theorem 2.
y is a saturated attractor of dynamics 1.4 if and only if
d1 (y) < k1 + k2 c(y) < d2 (y), P
where the slope function c(y) = [
(3.4)
j∈J− (y) rj
−
P
j∈J+ (y) rj ]
and two intercept func-
tions ½ d1 (y) =
maxi∈J+ (y) [−hi (y)] −∞
if J+ (y) 6= φ otherwise
(3.5)
mini∈J− (y) [−hi (y)] ∞
if J− (y) 6= φ otherwise.
(3.6)
and ½ d2 (y) =
In other words, a saturated state y is a saturated fixed-point attractor of dynamics 1.4 if and only if there is a local field gap between neurons in J+ (y) and J− (y). See also Theorem 2 in Feng et al. (1996). These two functions d2 and d1 were introduced in 1993 (see Feng et al. 1996 and references in it), but their physical meaning, extremes of local fields, is clear only after we apply Theorem 2 to the Hopfield model in section 4.2. 3.2 All-But-One-Saturated Attractors. Now we consider the set of allbut-one-saturated attractors. Without loss of generality we assume that y1 ∈ (−1, 1) is the only unsaturated state and yi ∈ {−1, 1} with k1 +
N X
(aij + k2 )rj yj 6= 0
(3.7)
j=1
for i 6= 1. Since yi , i 6= 1, are saturated, fulfilling condition 3.7, our arguments of the previous subsection hold, which imply that d¯1 (y) < k1 + c¯(y)k2 < d¯2 (y),
(3.8)
196
Jianfeng Feng and David Brown
for c¯(y) ¯ d2 (y) d¯1 (y)
=
X j∈J− (y)
= =
rj −
min
i∈J− (y)
rj + y1 r1
j∈J+ (y)
X
aij rj −
j∈J− (y)
max +
i∈J (y)
X
X
X
aij rj + ai1 y1 r1
j∈J+ (y)
aij rj −
j∈J− (y)
X
(3.9)
aij rj + ai1 y1 r1 .
j∈J+ (y)
¯ and d’s: the Note that there is a slight difference between the definition of d’s maximum and minimum for d2 and d1 is taken over a set of N elements, but for d¯2 and d¯1 , it is over a set of N − 1 elements. For y1 we have the following identity, N X (a1j + k2 )rj yj + k1 , y1 = f y1 + j=1
or equivalently d(1) (y) = k1 + c¯(y)k2 = f −1 (y1 ) − y1 −
X
a1j rj +
j∈J+ (y)
X
a1j rj − a11 y1 r1 .
(3.10)
j∈J− (y)
Hence the parameter region of (k1 , k2 ) in which y, an all-but-one-saturated state, is a fixed point of dynamics 1.4 is not empty if and only if d¯1 (y) < d(1) (y) < d¯2 (y).
(3.11)
Under condition 3.11, the parameter region for y to be a fixed point of dynamics 1.4 is the line (see Figure 1) given by {(k1 , k2 ): k1 + c¯(y)k2 = d(1) (y)}.
Theorem 3. Under condition 3.7, an all-but-one-saturated configuration y is a fixed point of dynamics 1.4 if and only if (k1 , k2 ) is in the set {(k1 , k2 ): d¯1 (y) < k1 + c¯(y)k2 = d(1) (y) < d¯2 (y)}.
(3.12)
Fixed-Point Attractor Analysis
197
k2
d1(y)
d(1)(y) d2(y)
k1
Figure 1: The parameter region in which y, an all-but-one-saturated configuration, is a fixed point of dynamics 1.4 is the line(dark line) inside the band (filled region).
Remark 1. For an all-saturated configuration y except one unit, say y1 , which violates restriction 3.1, k1 +
N X
(a1j + k1 )rj yj = 0,
(3.13)
j=1
we have a similar conclusion as Theorem 3, namely, the parameter region in which y is a fixed point is line 3.12 inside a band. 3.3 Other Forms of Attractors. For concreteness of expression, we assume that y1 , y2 ∈ (−1, 1), the only two unsaturated states, and yi ∈ {−1, 1} with the property X (aij + k2 )rj yj 6= 0, (3.14) k1 + j
for i 6= 1, 2. After proceeding as above for all-but-one-saturated configurations, we readily see that a necessary and sufficient condition for y to be a
198
Jianfeng Feng and David Brown
fixed point of dynamics 1.4 is d˜2 (y) > k1 + k2 c˜(y) > d˜1 (y) and
(2) d (y) := k1 + k2 c˜(y) 1 d(2) (y) 2
where c˜(y) ˜ d2 (y) d˜1 (y)
:= k1 + k2 c˜(y)
(3.15) X
=
a1j rj −
j∈J− (y)
−a12 y2 r2 + − y1 X X a2j rj − a1j rj − a21 y1 r1
=
−a22 y2 r2 + X j∈J− (y)
= =
rj −
min
i∈J− (y)
max +
i∈J (y)
X
j∈J+ (y) f −1 (y2 )
X X
j∈J− (y)
(3.16)
− y2 ,
rj + y1 r1 + y2 r2
j∈J+ (y)
aij rj −
j∈J− (y)
a1j rj − a11 y1 r1
j∈J+ (y) f −1 (y1 )
j∈J− (y)
=
X
X
aij rj + ai1 y1 r1 + ai2 y2 r2
j∈J+ (y)
aij rj −
X
(3.17)
aij rj + ai1 y1 r1 + ai2 y2 r2 .
j∈J+ (y)
An interesting new phenomenon occurs: the two lines corresponding to the two unsaturated states defined by equation 3.16 are parallel, which indicates that as long as (2) d(2) 1 (y) 6= d2 (y),
(3.18)
then the parameter region in which y is a fixed point of dynamics 1.4 is empty (see Figure 2). The fulfillment of equation 3.18 is a generic situation essentially depending on the property of the matrix A. When f (x) = x, x ∈ R, this conclusion has been confirmed by Linsker (1986) in his numerical simulation and theoretical proof. Theorem 4.
If and only if (k1 , k2 ) is in the following set,
(2) ˜ {(k1 , k2 ); d˜1 (y) < k1 + c˜(y)k2 = d(2) 1 (y) = d2 (y) < d2 (y)},
(3.19)
an all-but-two-saturated configuration y is a fixed point of dynamics 1.4. Remark 2. If for an all-saturated configuration, there are two saturated units not satisfying equation 3.1 or an all-but-one-saturated configuration with one saturated unit violating equation 3.7, we have a similar conclusion as in Theorem 4.
Fixed-Point Attractor Analysis
199
k2 k +k ~c(y)=d(2) (y) 1 2 1
~ d1(y) ~ d2(y)
k1
(2) k1 +k2~c(y)=d 2 (y)
Figure 2: The parameter region in which y, an all-but-two-saturated configuration, is a fixed point of dynamics 1.4 not empty only when the two parallel lines (dark lines) inside the band (filled region) coincide.
We are able to carry out a cascade study, continuing to consider three unsaturated units and so on. Ensuring the existence of a nonempty parameter region in which y is a fixed point of dynamics 1.4 becomes more and more difficult when the number of unsaturated units is larger and larger since it requires all parallel lines corresponding to unsaturated units to intersect. In (2) general, two parallel lines k1 + c˜(y)k2 = d(2) 2 (y) and k1 + c˜(y)k2 = d1 (y) do not coincide. Hence we stop here and believe that the general outcomes of dynamics 1.4 are all-saturated and all-but-one-saturated attractors. In conclusion, for dynamics 1.4 the full spectrum of its outcomes is summarized in Table 1. 4 Applications In this section we apply our theorems in the previous section to the models: Linsker’s model, the Hopfield network, and a continuous time model. 4.1 Linsker’s Model. Linsker’s model (Feng et al., 1996; Linsker, 1986; MacKay & Miller, 1990) resembles the visual system, with an input feed-
200
Jianfeng Feng and David Brown
Table 1: The General Parameter Region. Attractor Type
Parameter Region
y: all-saturated
A band determined by d2 (y) > k1 + c(y)k2 > d1 (y) (Theorem 2), independent of f A line k1 + c¯(y)k2 = d(1) (y) inside a band d¯1 (y) < k1 + c¯(y)k2 < d¯2 (y) (Theorem 3), dependent on f Intersection of two parallel lines inside a band (Theorem 4), dependent on f
y: all-but-one-saturated
y: all-but-two-saturated
Possibility Most possible
Less possible
Hardly possible
ing onto a number of layers corresponding to the layers of the relay from the retina to the visual cortex. The units of the network are linear and are organized into two-dimensional layers indexed L0 (input), L1 , . . . , and so on. For simplicity of notation, we suppose that each layer has N neurons and has periodic boundary conditions (wrapped up). There are feedforward connections between adjacent layers, with each unit receiving inputs decreasing monotonically with the distance from the neurons belonging to the underlying layer. The dynamics between the nth layer and the (n+1)th layer can be written as (Feng et al., 1996), wi (t + 1) = f wi (t) +
N X
(qij + k2 )rj wj (t) + k1 ,
(4.1)
j=1
where qij is a covariance matrix and wi (t) refers to the synaptic connection between a neuron in the (n + 1)th layer and the ith neuron in the nth layer. It is worthwhile to point out that our formulation of dynamics 4.1 is a generalization (with saturated sigmoidal functions) of the one described in Feng et al. (1996), where the limiter function is used. Theorem 5. w = (wi , i = 1, . . . , N) is a saturated attractor of Linsker’s model defined by dynamics 4.1 if and only if d1 (w) < k1 + c(w)k2 < d2 (w).
(4.2)
w is an all-but-one-saturated fixed point of Linsker’s model if and only if d¯1 (w) < k1 + c¯(w)k2 = d(1) (w) < d¯2 (w)
(4.3)
Fixed-Point Attractor Analysis
201
Our purpose of exploring the parameter space of Linsker’s model is now accomplished. Theorem 5 gives rise to a complete picture of all possible parameter regions in which an all-saturated configuration is an attractor of Linsker’s model, and all possible parameter regions in which an all-but-one saturated configuration is a fixed point. These two kinds of attractor are observed in Linsker’s numerical simulations, and here, for the first time, we give a more complete characterization of the parameter region. Furthermore, we emphasize that Theorem 5 is stated in a more general setting than Linsker’s original model—a dynamical system with a saturated sigmoidal function. Yet we point out one fact that is a simple consequence of our approach. In the parameter space (k1 , k2 ), all-saturated attractors of Linsker’s model are insensitive to the special choice of the limiter function in the development of synaptic connections,2 whereas for an all-but-one-saturated attractor, its parameter region depends on the specific choice of the inputoutput characteristic (see the definition of c¯). Remark 3. We have the following relations (see Linsker, 1986; MacKay & Miller, 1990) for an all-saturated attractor w, k1 d2 (w) d1 (w) − c(w) < < −c(w) + , k2 k2 k2
(4.4)
and for an all-but-one-saturated attractor w d(1) (w) k1 = −¯c(w) + . k2 k2
(4.5)
Biologically (Swindale, 1996), examination of individual ganglion neurons shows different types of receptive fields within the same animal. Using small spots of light to probe the properties of ganglion cell receptive fields, Kuffler (1953) found that these receptive fields vary in size across the retina. In the foveal region of the primate retina, where visual acuity is greatest, the receptive fields are small, with centers that are only a few minutes of arc. At the periphery of the retina, where acuity is low, the fields are larger, with centers of 3 or 5 degrees. And not all ganglion cells have a center-surround receptive organization. For example, a few ganglion cells respond to changes in the overall luminance of the visual field and are important in controlling pupillary reflexes. A fundamental assumption of our investigation here is that some common mechanisms underlie the varied receptive fields and that the different sizes or types of observed receptive fields can be obtained by modifying the parameters that characterize the underlying mechanism. Therefore, a biologically reasonable model explaining the self-organization 2 However, the detailed dynamical behaviors of, say, the structure of the attractive basin of an attractor, do depend on the specific choice of f .
202
Jianfeng Feng and David Brown
of the primary visual system should have the potential to generate a diversity of receptive fields by adjusting only a few parameters. Here we provide a mathematical tool for evaluating one necessary aspect of the tenability of a model that uses a sigmoidal function as its input-output characteristic. 4.2 The Hopfield Model. The properties of the Hopfield model, defined by Pp aij = Tij =
µ µ µ=1 ξi ξj
N
, θ = k1 threshold, h = k2 external field (4.6) µ
for ri = 1, i = 1, . . . , N and given patterns ξ µ = (ξi , i = 1, . . . , N) ∈ {−1, 1}N with the dynamics x(t + 1) = F(x(t) + (A + K2 )(x(t))0 + k1 )
(4.7)
where x(t) = (xi (t), i = 1, . . . , N), xi (t), represents the activity of the ith neuron at time t, have made it an interesting candidate for theoretical studies, models of some brain functions, and technical applications in certain areas of computer development or artificial intelligence. In all cases, one of the first questions that comes to mind is the storage capacity of the model (Amit, 1989; Komlos & Paturi, 1988; MacEliece, Posner, Rodemich, & Venkatesh, 1987; Newman, 1988), namely, the quantity of information that can be stored and effectively retrieved from the network. It is of primary interest to know, for example, how the number of patterns varies with the number of neurons. We refer readers to Pastur, Shcherbina, and Tirozzi (1997) for a complete account of the most recent theoretical, rigorous work on it. Here, in terms of our results in section 3, we briefly revisit the model. In terms of Theorem 2 we have:3 Theorem 6.
x is an attractor of dynamics 4.7 if and only if
d1 (x) < θ + hc(x) < d2 (x).
(4.8)
From Theorem 6 we know that the difference d2 (x) − d1 (x) reflects the stability of a saturated attractor x. If it is negative or equal to zero, x will no longer be a saturated attractor of dynamics 4.7. From Definition 1 we know that hi is the local field of the ith neuron in the Hopfield model, hence the existence of a local field gap for a state x between the neurons in J+ (x) and J− (x) is necessary and sufficient for x to be an attractor of the Hopfield network. From this point of view, we address a definition of the critical capacity of the Hopfield model in terms of the intercept functions d1 and d2 . 3 The stored patterns take values in {−1, 1}N , and this provides the justification for us to concentrate on attractors in {−1, 1}N .
Fixed-Point Attractor Analysis
203
Definition 3. The critical number of stored patterns pc for perfect retrieval of the Hopfield network is pc := inf{p, hd2 (ξ µ )i − hd1 (ξ µ )i = 0 for any µ = 1, . . . , p},
(4.9)
where h·i represents the expectation with respect to the distribution P of ξ µ . Aiming to have a better understanding of our results, we first consider Example 1, a simple case: storage of one pattern ξ 6= (1, . . . , 1), (−1, . . . , −1). Example 1. Tij =
ξi ξj , N
i, j = 1, . . . , N
(4.10)
d1 (ξ ) = −1.
(4.11)
and so d2 (ξ ) = 1,
Hence the parameter region in which ξ is a saturated attractor of the Hopfield is 1 > θ + c(ξ )h > −1.
(4.12)
Furthermore we should note here that if w ∈ {−1, 1}N and w 6= ξ , −ξ , (1, . . . , 1), (−1, . . . , −1), then d2 (w) − d1 (w) < d2 (ξ ) − d1 (ξ ),
(4.13)
namely, ξ is the most stable attractor in the sense that the larger the difference between d2 and d1 , the more stable is the attractor. In order to ensure that equation 4.12 makes sense when N → ∞ let us define h = h/N, and now, according to the law of large numbers equation 4.12, becomes 1 > θ + hζ ih > −1.
(4.14)
P Here ζ = ( N i=1 ξi )/N. Therefore, whether a configuration ξ is an attractor of ¯ is in 1 > θ +hζ ih > the Hopfield model is parameter related: as long as (θ, h) −1, ξ is an attractor of the Hopfield model; otherwise ξ is not an attractor
204
Jianfeng Feng and David Brown
of the model. In the general situation where there are many patterns stored in the network, capacity is parameter related, as shown in Theorem 7. Now we turn to the general situation. It may be the case that synapses in a biological model can hold only a few values, perhaps even as low as two (just a sign) or three (a sign and an absence of connection), although the biological constraint is not very well known (Shinomoto, 1987). With this restriction in mind, we consider the capacity of the Hopfield model for perfect retrieval with the synaptic efficacy defined by √ Ã Pp ξ µ ξ µ ! p µ=1 i j s , (4.15) Tij := √ N p µ
where s is a real function defined on R. As usual we suppose that P(ξi = µ ±1) = 1/2 and ξi , i = 1, . . . , N, µ = 1, . . . , p are idependently identically distributed (i.i.d.) random variables. We will exclusively concentrate on the parameter θ (see equation 4.14, hζ i = 0). Theorem 7. Suppose that s ∈ C2 (R). Then we have pc = N(s0 )2 /(2s2 log N) where s0 = hs0 (ζ )i and s2 = hs2 (ζ )i. If p(2 log N)/N → α, then ξ (1) is an attractor of dynamics 4.7 if and only if q q 0 2 0 (4.16) −s + αs ≤ θ ≤ s − αs2 where ζ is a normally distributed random variable with mean zero and variance one. In particular, when s(u) = u for u ∈ R1 (the standard Hopfield model; see equation 4.15; Hopfield, 1982; Feng & Tirozzi, 1995a) equation 4.16 reduces to √ √ (4.17) −1 + α ≤ θ ≤ 1 − α . Theorem 7 tells us that as long as α < (s¯0 )2 /s¯2 , there is a parameter region of θ in which ξ (1) is an attractor of the Hopfield model; otherwise, the parameter region of θ is empty. The exact relation between the threshold and the capacity, to the best of our knowledge, has not been given before Theorem 7. The proof of Theorem 7 (see Feng & Tirozzi, 1997c for a detailed proof) relies on the distribution of d2 and d1 , which is essentially the distributions of extreme values of a sequence of random variables (see the definition of d2 and d1 ). Fortunately, extreme value theory in statistics is well established. For example we know the following fact (see Leadbetter, Lindgren, & Rootz´en, 1983). Suppose that ζi , i = 1, . . . , are i.i.d. normally distributed random variables with mean zero and variance one. Then we have lim
N→∞
max1≤i≤N ζi p = 1 in Probability. 2 log N
(4.18)
Fixed-Point Attractor Analysis
205
The promising development of our approach here is that we can apply extreme value theory to consider the capacity of the Hopfield model and avoid the spin glass approach. The spin glass approach, after many years of research, still lacks rigor, as recently pointed out by Newman and Stein (1996) (a recent review is Pastur et al., 1997). By introducing an error term for the retrieval of a memory, we are able to reconsider, in conjunction with extreme value theory, the capacity of the Hopfield model (Feng & Tirozzi, 1997d). 4.3 Continuous Time Model. Neural network models involving two states, which are more easily treated theoretically, are often considered oversimplified by both biologists and device designers. Biologists prefer to describe relevant neural activity by firing rates observed in continuous time. Device designers sometimes prefer to think in terms of operation amplifiers. In this article we focus on a discrete time description of computational events. Such an approach applies directly to clocked networks of digital processors. However, our results will also shed light on continuous time systems: as long as the time required for an individual neuron to change its state is much shorter than communication delays and the overall network relaxation time, these individual transitions can be considered quasiinstantaneous. To make everything more precise, we check the relation between a continuous model and dynamics 1.4. A commonly used continuous time model reads Ci
N X Ui dUi = Tij f (Uj ) − + Ii . dt Ri j=1
(4.19)
The right-hand side contains the currents contributing to charging the input capacitance Ci of neuron i by the potential dUi . The first term represents currents induced by the activity of all the other neurons; the second term is the leakage current due to the transmembrane resistance Ri between the interior and the exterior of the neural cell; and the third term represents input currents from sources outside the network. A sketch of a standard circuit of equation 4.19 is contained in Graf et al. (1986). By discretizing the equation with a step size h, we obtain N h X Ui Ui (t)C2i + Ii . Tij f (Uj (t)) − + (4.20) Ui (t + 1) = Ci j=1 Ri h Setting C2i / h = Ri , k1 = Ii , aij + k2 + δij = Tij , we get that N Ci X (aij + δij + k2 ) f (Uj (t)) + k1 . Ui (t + 1) = Ri j=1
(4.21)
206
Jianfeng Feng and David Brown
By applying the equivalence of two dynamics (1.4 and 2.2), we can rewrite dynamics 4.21 in the following form,
N X C i Ui (t + 1) = f (aij + δij + k2 )U¯ j (t) + k1 , Ri j=1
(4.22)
an exact form discussed in this article. 5 Competitive Learning We now turn to competitive learning with a constraint realized by a saturated sigmoidal function. Competitive learning has been shown to produce interesting solutions to many unsupervised learning problems (Hertz et al., 1991). In common with other learning procedures that employ a simple Hebbian-type rule, it is necessary in competitive learning to introduce some form of constraint on the weights to prevent them from growing without bound. Divisive enforcement and subtractive enforcement are discussed in details in Goodhill and Barrow (1994) and Miller and MacKay (1994). Here we consider the constraints using a saturated sigmoidal function. Surprisingly, this kind of constraint is easy to understand in a special case, by adapting our idea in section 3. At time t, suppose we have a network with a single output (for simplicity) connected to a set of inputs ξ (t) = (ξi (t), i = 1, . . . , M) via connections (j)
u(j) (t) = (ui (t), i = 1, . . . , M) ∈ RM . Here j = 1, . . . , N is the number of connections. For each presentation of the input ξ (t + 1), we choose a winner, which is the unit j with the smallest distance between its connections and the input, and then update u(j) (t) according to u(j) (t + 1) = F(u(j) (t) + η ξ (t + 1)), t = 1, . . . ,
(5.1)
where η > 0 is the learning rate. Incorporating the procedure of selecting the winner in dynamics 5.1, the time evolution of u(t) is governed by u(j) (t + 1) = F(u(j) (t) + ηI¯(j) (u(t), ξ (t + 1)) · ξ (t + 1)), t = 1, . . . , (5.2) where I¯(j) (u(t), ξ (t + 1)) = I5(j) (u(t)) (ξ (t + 1)) for I the indicator function and 5(j) (u(t)) the Voronoi tessellation associated with u(t) given by 5(j) (u(t)) := {x, x ∈ RM , ||x − u(j) (t)|| ≤ ||x − u(i) (t)||, i 6= j}.
Fixed-Point Attractor Analysis
207
Remark 4. We restrict ourselves to dynamics 5.2 ( simple competitive learning) for simplicity of notation. However, it would be possible to retain a general framework, that is, to consider the following dynamics: Kohonen feature mapping with parameter (k1 , k2 ), u(j) (t + 1) = F(u(j) (t) + η
X (I¯(i) (u(t), ξ (t + 1)) · ξ (t + 1) + k2 )r(i, j) i
+ k1 ),
t = 1, 2, . . . ,
(5.3)
where r(i, j) is a neighborhood function (see the definitions of r in dynamics 1.4 [Hertz et al., 1991; Feng & Tirozzi, 1997a]) and k2 = k2 (1, . . . , 1). Theorem 8.
For the learning rule defined by equation 5.2, if and only if
N = 2M and u(j) ∈ {−1, 1}M , j = 1, . . . , N,
(5.4)
u is a fixed-point attractor of dynamics 5.2. The final results of competitive learning with any initial states, according to Theorem 8, fall into two classes: stable learning, fulfilling the condition 5.4, and unstable learning, violating the condition. Theorem 8 also indicates that when the dimension of inputs is high, it is hardly possible for competitive learning to be stable. For example, when M = 100 according to Theorem 8, N = 2100 connections are required for competitive learning (see equation 5.2) to approach a stable state. To understand clearly the conclusions above, we present Example 2, with two-dimensional inputs. Example 2. The input signal ξ ∈ [−1, 1]2 is drawn according to a distribution with density function f (x, y) > 0 for (x, y) ∈ [−1, 1]2 . The learning rule will produce two essentially different results. 1. Stable learning. Suppose that we have four weight vectors u(i) (t) = (i) (i) (u(i) 1 (t), u2 (t)), i = 1, . . . , 4 with −1 ≤ uk (0) ≤ 1, k = 1, 2, i = 1, . . . , 4 as shown in Figure 3. For concreteness, let u(1) = (1, 1), u(2) = (−1, 1), u(3) = (−1, −1) and u(4) = (1, −1) then 5(1) (u) = {(x, y), x > 0, y > 0}, 5(2) (u) = {(x, y), x < 0, y > 0}, 5(3) (u) = {(x, y), x < 0, y < 0}, and 5(4) (u) = {(x, y), x > 0, y < 0}. According to Theorem 8, we conclude4 u(i) = lim u(i) (t) ∈ {−1, 1}2 . t→∞
(5.5)
4 This conclusion is not completely proved here with an arbitrary initial condition u(j) (0). For simplicity of notation, we suppose that u(j) (0) ∈ 5(j) (u); then equation 5.5 is true (see the definition of ordering time below).
208
Jianfeng Feng and David Brown
u(1) u(1)
-1
1
t=0
t=
8
-1
Figure 3: An explanation of stable learning. If there are four connections for two-dimensional inputs, then the competitive learning is stable.
In competitive learning with subtractive constraints, we know that if the learning rate is a constant or even tending to zero at an appropriate rate, then competitive learning fails to converge to a stable state (Feng & Tirozzi, 1997a). Here, nevertheless, no matter what the learning rate η is, we easily claim convergence in equation 5.5, a hard topic for competitive learning with divisive or subtractive constrains. 2. Unstable learning. Now we turn to consider a case with the number of weights not equal to four. Let us take into account the cases of N = 2 and N = 5 only. • N = 2. The system behavior is no longer stable at all, and the situation is more complex. See Figure 4 for an explanation. We anticipate proving that the Markov process defined by dynamics 5.3 is an ergodic process,5 which implies that there is a unique invariant measure µ(A) > 0 for any Borel set A ⊂ R4 with Lebesgue measure larger than zero. Roughly speaking, the synaptic connections will take any values with a positive probability except when u(1) = u(2) . But the highest probability for µ is attained in any neighborhood containing extremes. It would be interesting in the future, at least theoretically, to derive an exact form of the invariant measure (Bouton & Pag`es, 1993, 1994; Cottrell & Fort, 1986). 5 The freedom of choice of a sigmoidal function makes this assertion possible. For example, we can suppose that f is a contractive mapping, and then results developed in Loskot and Rudnicki (1995) can be applied here.
Fixed-Point Attractor Analysis
209
u(1)
u(1)
-1
1
t=
t=0
8
-1
Figure 4: If there are only two connections for two-dimensional inputs, the learning rule is not stable (indicated by arrows).
u(1)
u(1)
-1
1
t=0
t=
8
-1
Figure 5: If there are five connections for two-dimensional inputs, the learning rule is not stable (indicated by arrow), and the most possible outcome is shown in this figure. Four weight vectors will tend to represent four extremes of the inputs.
• N = 5. The most probable outcome is shown in Figure 5. The final result will not be stable and is dependent on the choice of learning rate. Due to the constraint of a saturated sigmoidal function, competitive sta-
210
Jianfeng Feng and David Brown
ble learning is simpler than that with a subtractive or divisive constraint. But we would expect that such simplicity will facilitate our understanding of the competitive learning rule in a general sense, comparable to what we learn from the simple perceptron learning rule. More precisely, for instance, we can: 1. Understand the effect of overfitting in this algorithm. 2. Estimate learning curves with inputs subjected to different distributions. 3. Get an estimate for the ordering time—a stopping time defined by (in Example 2), τ := inf{t: u(i) (t) ∈ 5(i) (u), i = 1, . . . , 4}, in the case of stable learning. 4. Obtain an exact expression for the invariant measure, understanding the metastable states of the learning (Erwin, Obermayer, & Schulten, 1992a, 1992b). 6 Conclusions We have studied the dynamics of neural network models with saturated sigmoidal functions as their input-output characteristics. A complete spectrum on the parameter space for all possible outcomes of dynamics 1.4 is obtained. Under a stated condition we have shown that the possible outcomes of dynamics 1.4 are all-saturated or all-but-one-saturated fixed-point attractors. An exact parameter region is given for all-saturated attractors and all-but-one-saturated fixed points. This study enables us to get a complete picture on the parameter space for Linsker’s model, with a saturated sigmoidal function as its input-output characteristic, to revisit the Hopfield model and associate its capacity with the value of the threshold parameter, and to open up a new possibility of rigorously analyzing the Hopfield model. Competitive learning, feasible for numerical realization and with some very successful applications in both data processing and biological modeling (Goodhill, 1993), is a theoretically difficult topic. In spite of many years of work, we know much less than what we know about the Hopfield model (for a recent review on both topics, see Feng & Tirozzi, 1997c; Pastur et al., 1997). Our analysis here may shed more light on the properties of competitive learning: on its invariant measure, ordering time, and metastable states. In a single theoretical framework, we have managed to treat diverse models in neural networks. The significance of this unified treatment lies in that, in addition to some novel discoveries after revisiting these models, we have exposed some common mechanisms behind them (for example,
Fixed-Point Attractor Analysis
211
we have understood the physical meaning of d2 and d1 from the study of the Hopfield model), which will provide useful guidance in designing and understanding new models, for both learning and retrieving. In Feng and Hadeler (1996), we have carried out a detailed investigation for a dynamical system possessing the property P3 listed in section 1 on the uniqueness and existence of fixed-point attractors. In Feng (1995) we have risen to the challenge that the property P1 is violated; there is a nonlinear operation given by the matrix (A + K2 )R (like competitive learning considered in this research), which marks important new dimensions into which our approach may grow. Appendix: Proof of Theorem 2 For a saturated state to be a fixed point, one must have gi (k1 , k2 , y) = PN j=1 (aij + k2 )rj yj + k1 ≥ 0, if yi = 1 and gi (k1 , k2 , y) ≤ 0, if yi = −1, i = 1, . . . , N. Excluding the case gi (k1 , k2 , y) = 0, a fixed point is guaranteed to be stable by the continuity of gi . Then a state y ∈ {−1, 1}N is an attractor of dynamics 1.4 if and only if yi gi (k1 , k2 , y) > 0, i = 1, . . . , N. Hence we have N X yi · (aij + k2 )rj yj + k1 > 0, ∀i = 1, . . . , N.
(A.1)
j=1
So if i ∈ J+ (y), the above inequality reads N X (aij + k2 )rj yj + k1 > 0
(A.2)
j=1
or equivalently k1 + c(y)k2 >
X j∈J+ (y)
aij rj −
X
aij rj .
(A.3)
j∈J− (y)
By noticing that the left-hand side of the inequality above is independent of i, after taking the maximum for i ∈ J+ (y) on both sides of the inequality above, we have that k1 + c(y)k2 > d1 (y).
(A.4)
After repeating the same arguments above, we arrive at k1 + c(y)k2 < d2 (y).
(A.5)
212
Jianfeng Feng and David Brown
Acknowledgments We are grateful to anonymous referees for their useful comments on the manuscript. References Amit, D. (1989). Modeling brain function. Cambridge: Cambridge University Press. Bouton, C., & Pag`es, G. (1993). Self-organization and convergence of the onedimensional Kohonen algorithm with non uniformly distributed stimuli. Stoch. Proc. Appl., 47, 249–274. Bouton, C., & Pag`es, G. (1994). Convergence in distribution of the onedimensional Kohonen algorithms when the stimuli are not uniform. Adv. Appl. Prob., 26, 80–103. Cottrell, M., & Fort, J. C. (1986). Etude d’un algorithme d’auto-organisation. Ann. Inst. H. Poincar´e, 23, 1–20. Erwin, E., Obermayer, K., & Schulten, K. (1992a). Self-organizing maps: Stationary states, metastability and convergence rate. Biol. Cybern., 67, 35–45. Erwin E., Obermayer, K., & Schulten, K. (1992b). Self-organizing maps: Ordering, convergence properties and energy functions. Biol. Cybern., 67, 47–55. Feng, J. (1995). Establishment of topological maps—a model study. Neural Processing Letters, 2, 1–4. Feng, J. (1997). Lyapunov functions for neural nets with nondifferentiable inputoutput characteristics. Neural Computation, 9, 45–51. Feng, J., & Hadeler, K. P. (1996). Qualitative behavior of some simple networks. J. of Phys. A: Math. Gen., 29, 5019–5033. Feng, J., Pan, H., & Roychowdhury, V. P. (1996). On neurodynamics with limiter function and Linsker’s developmental model. Neural Computation, 8, 1003– 1019. Feng, J., & Tirozzi, B. (1995a). The SLLN for the free-energy of the Hopfield and spin glass model. Helvetica Physica Acta, 68, 365–379. Feng, J., & Tirozzi, B. (1995b). An application of the saturated attractor analysis to three typical models. Lecture Notes in Computer Science, 930, 353–360. Feng, J., & Tirozzi, B. (1997a). Convergence theorems for Kohonen feature mapping with VLRPs. Computers and Mathematics with Applications, 33, 45–63. Feng, J., & Tirozzi, B. (1997b). A discrete version of the dynamic link network. Neurocomputing, 14, 91–106. Feng, J., & Tirozzi, B. (1997c). Convergence of learning processes, stability of attractors and critical capacity of neural networks. In A. Bovier (Ed.), Proceedings of WIAS-Workshop on Mathematics of Spin Systems with Random Interactions. Berlin: Springer-Verlag. Feng, J., & Tirozzi, B. (1997d). Capacity of the Hopfield model. J. of Phys. A: Math. Gen., 3383–3391. Goodhill, G. (1993). Topography and ocular dominance: A model exploring positive correlations. Biol. Cybern., 69, 109–118.
Fixed-Point Attractor Analysis
213
Goodhill, G., & Barrow, H. G. (1994). The role of weight normalization in competitive learning. Neural Computation, 6, 255–269. Graf, H. P., Jackel, L. D., Howard, R. E., Straughn, B., Denker, J. S., Hubbard, W., Tennant, D. M., and Schwartz, D. (1986). VLSI implementation of a neural network memory with several hundreds of neurons. SIP Conf. Proc., 151, 182–186. Hertz, J., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Herz, A. V. M., & Marcus, C. M. (1993). Distributed dynamics in neural networks. Phys. Rev. E, 47, 2155–2161. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A., 79, 2554–2558. Hui, S., and Zak, S. H. (1992). Dynamical analysis of the Brain-State-in-a-Box (BSB) neural models. IEEE Transactions on Neural Networks, 3, 86–94. Komlos, J., & Paturi, R. (1988). Convergence results in an autoassociative memory model. Neural Networks, 1, 239-250. Kuffler, S. W. (1953). Discharge patterns and functional organization of mammalian retina. J. Neurophysiol., 16, 37–68. Leadbetter, M. R., Lindgren, G., & Rootz´en, H. (1983). Extremes and related properties of random sequences and processes. New York: Springer-Verlag. Linsker, R. (1986). From basic network principle to neural architecture (series). Proc. Natl. Acad. Sci. USA, 83, 7508–7512, 8390–8394, 8779–8783. Loskot, K., & Rudnicki, R. (1995). Limit theorems for stochastically perturbed dynamical systems. J. Appl. Probab., 32, 459–469. MacKay, D., & Miller, K. (1990). Analysis of Linsker’s application of Hebbian rules to linear networks. Network, 1, 257–297. MacEliece, R. J., Posner, E. C., Rodemich, E. R., & Venkatesh, S. S. (1987). The capacity of the Hopfield associative memory. IEEE Trans. Inform. Theory, 33, 461–482. Marcus, C. M., & Westervelt, R. M. (1989). Dynamics of iterated map networks. Phys. Rev. A, 40, 501–504. Miller, K.D., & MacKay, D. (1994). The role of constraints in Hebbian learning. Neural Computation, 6, 100–126. Newman, C. (1988). Memory capacity and neural network models: Rigorous lower bounds Neural Networks, 1, 223–238. Newman, C., & Stein, D. L. (1996). Non-mean-field behavior of realistic spin glass. Physical Review Letter, 76, 515–518. Pastur, L., Shcherbina, M., & Tirozzi, B. (1997). Replica symmetric solutions. In A. Bovier (Ed.), Proceedings of WIAS–Workshop on Mathematics of Spin Systems with Random Interactions. New York: Springer-Verlag. Sejnowski, T. J. (1995). Time for a new neural code? Nature, 376, 21–22. Shinomoto, S. (1987). A cognitive associative memory. Bio. Cybern., 57, 197–212. Swindale, N. V. (1996). The development of topography in the visual cortex: A review of models. Network: Computation in Neural Systems, 7, 161–247. Received October 28, 1996; accepted April 3, 1997.
Communicated by Helge Ritter
GTM: The Generative Topographic Mapping Christopher M. Bishop Markus Svens´en Christopher K. I. Williams Neural Computing Research Group, Department of Computer Science and Applied Mathematics, Aston University, Birmingham B4 7ET, U.K.
Latent variable models represent the probability density of data in a space of several dimensions in terms of a smaller number of latent, or hidden, variables. A familiar example is factor analysis, which is based on a linear transformation between the latent space and the data space. In this article, we introduce a form of nonlinear latent variable model called the generative topographic mapping, for which the parameters of the model can be determined using the expectation-maximization algorithm. GTM provides a principled alternative to the widely used self-organizing map (SOM) of Kohonen (1982) and overcomes most of the significant limitations of the SOM. We demonstrate the performance of the GTM algorithm on a toy problem and on simulated data from flow diagnostics for a multiphase oil pipeline. 1 Introduction Many data sets exhibit significant correlations between the variables. One way to capture such structure is to model the distribution of the data in terms of latent, or hidden, variables. A familiar example of this approach is factor analysis, which is based on a linear transformation from latent space to data space. In this article, we show how the latent variable framework can be extended to allow nonlinear transformations while remaining computationally tractable. This leads to the GTM (generative topographic mapping) algorithm, which is based on a constrained mixture of gaussians whose parameters can be optimized using the EM (expectation-maximization) algorithm. One of the motivations for this work is to provide a principled alternative to the widely used self-organizing map (SOM) algorithm (Kohonen, 1982) in which a set of unlabeled data vectors tn (n = 1, . . . , N) in a D-dimensional data space is summarized in terms of a set of reference vectors having a spatial organization corresponding to a (generally) two-dimensional sheet. Although this algorithm has achieved many successes in practical applications, it also suffers from some significant deficiencies, many of which are highlighted in Kohonen (1995). They include the absence of a cost function, Neural Computation 10, 215–234 (1998)
c 1997 Massachusetts Institute of Technology °
216
Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams
the lack of a theoretical basis for choosing learning rate parameter schedules and neighborhood parameters to ensure topographic ordering, the absence of any general proofs of convergence, and the fact that the model does not define a probability density. These problems can all be traced to the heuristic origins of the SOM algorithm.1 We show that the GTM algorithm overcomes most of the limitations of the SOM while introducing no significant disadvantages. An important application of latent variable models is to data visualization. Many of the models used in visualization are regarded as defining a projection from the D-dimensional data space onto a two-dimensional visualization space. We shall see that, by contrast, the GTM model is defined in terms of a mapping from the latent space into the data space. For the purposes of data visualization, the mapping is then inverted using Bayes’ theorem, giving rise to a posterior distribution in latent space. 2 Latent Variables The goal of a latent variable model is to find a representation for the distribution p(t) of data in a D-dimensional space t = (t1 , . . . , tD ) in terms of a number L of latent variables x = (x1 , . . . , xL ). This is achieved by first considering a function y(x; W), which maps points x in the latent space into corresponding points y(x; W) in the data space. The mapping is governed by a set of parameters W and could consist, for example, of a feedforward neural network, in which case, W would represent the weights and biases. We are interested in the situation in which the dimensionality L of the latent variable space is less than the dimensionality D of the data space, since we wish to capture the fact that the data set itself has an intrinsic dimensionality that is less than D. The transformation y(x; W) then maps the latent variable space into an L-dimensional non-Euclidean manifold S embedded within the data space.2 This is illustrated schematically for the case of L = 2 and D = 3 in Figure 1. If we define a probability distribution p(x) on the latent variable space, this will induce a corresponding distribution p(y|W) in the data space. We shall refer to p(x) as the prior distribution of x, for reasons that will become clear shortly. Since L < D, the distribution in t-space would be confined to the L-dimensional manifold and hence would be singular. Since in reality the data will only approximately live on a lower-dimensional manifold, it is appropriate to include a noise model for the t vector. We choose the distribution of t, for given x and W, to be a radially symmetric gaussian 1 Biological metaphor is sometimes invoked when motivating the SOM procedure. It should be stressed that our goal here is not neurobiological modeling, but rather the development of effective algorithms for data analysis, for which biological realism need not be considered. 2 We assume that the matrix of partial derivatives ∂y /∂x has full column rank. i k
Generative Topographic Mapping
217
Figure 1: The nonlinear function y(x; W) defines a manifold S embedded in data space given by the image of the latent variable space under the mapping x → y.
centered on y(x; W) having variance β −1 so that µ p(t|x, W, β) =
β 2π
¶D/2
½ ¾ β exp − ky(x; W) − tk2 . 2
(2.1)
Note that other models for p(t|x) might also be appropriate, such as a Bernoulli for binary variables (with a sigmoid transformation of y) or a multinomial for mutually exclusive classes (with a softmax, or normalized exponential transformation of y [Bishop, 1995]), or even combinations of these. The distribution in t-space, for a given value of W, is then obtained by integration over the x-distribution, Z p(t|W, β) = p(t|x, W, β)p(x) dx. (2.2) For a given a data set D = (t1 , . . . , tN ) of N data points, we can determine the parameter matrix W, and the inverse variance β, using maximum likelihood. In practice it is convenient to maximize the log likelihood, given by
L(W, β) = ln
N Y
p(tn |W, β).
(2.3)
n=1
Once we have specified the prior distribution p(x) and the functional form of the mapping y(x; W), we can in principle determine W and β by maximiz-
218
Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams
Figure 2: In order to formulate a latent variable model similar in spirit to the SOM, we consider a prior distribution p(x) consisting of a superposition of delta functions, located at the nodes of a regular grid in latent space. Each node xi is mapped to a corresponding point y(xi ; W) in data space and forms the center of a corresponding gaussian distribution.
ing L(W, β). However, the integral over x in equation 2.2 will, in general, be analytically intractable. If we choose y(x; W) to be a linear function of W, and we choose p(x) to be gaussian, then the integral becomes a convolution of two gaussians, which is itself a gaussian. For a noise distribution p(t|x) that is gaussian with a diagonal covariance matrix, we obtain the standard factor analysis model. In the case of the radially symmetric gaussian given by equation 2.1, the model is closely related to principal component analysis since the maximum likelihood solution for W has columns given by the scaled principal eigenvectors (Tipping & Bishop, 1997). Here we wish to extend this formalism to nonlinear functions y(x; W), and in particular to develop a model similar in spirit to the SOM algorithm. We therefore consider a specific form for p(x) given by a sum of delta functions centered on the nodes of a regular grid in latent space, p(x) =
K 1X δ(x − xi ), K i=1
(2.4)
in which case the integral in equation 2.2 can again be performed analytically. Each point xi is then mapped to a corresponding point y(xi ; W) in data space, which forms the center of a gaussian density function, as illustrated in Figure 2. From equations 2.2 and 2.4, we see that the distribution function
Generative Topographic Mapping
219
in data space then takes the form p(t|W, β) =
K 1X p(t|xi , W, β), K i=1
and the log likelihood function becomes ( ) N K X 1X L(W, β) = ln p(tn |xi , W, β) . K i=1 n=1
(2.5)
(2.6)
For the particular noise model p(t|x, W, β) given by equation 2.1, the distribution p(t|W, β) corresponds to a constrained gaussian mixture model (Hinton, Williams, & Revow, 1992) since the centers of the gaussians, given by y(xi ; W), cannot move independently but are related through the function y(x; W). Note that, provided the mapping function y(x; W) is smooth and continuous, the projected points y(xi ; W) will necessarily have a topographic ordering in the sense that any two points xA and xB that are close in latent space will map to points y(xA ; W) and y(xB ; W), which are close in data space. 2.1 The EM Algorithm. If we now choose a particular parameterized form for y(x; W), which is a differentiable function of W (for example, a feedforward network with sigmoidal hidden units), then we can use standard techniques for nonlinear optimization, such as conjugate gradients or quasi-Newton methods, to find a weight matrix W∗ , and an inverse variance β ∗ , which maximize L(W, β). However, our model consists of a mixture distribution which suggests that we might seek an EM algorithm (Dempster, Laird, & Rubin, 1977; Bishop, 1995). By making a suitable choice of model y(x; W) we will see that the M-step corresponds to the solution of a set of linear equations. In particular we shall choose y(x; W) to be given by a generalized linear regression model of the form y(x; W) = Wφ(x),
(2.7)
where the elements of φ(x) consist of M fixed basis functions φj (x), and W is a D × M matrix. Generalized linear regression models possess the same universal approximation capabilities as multilayer adaptive networks, provided the basis functions φj (x) are chosen appropriately. The usual limitation of such models, however, is that the number of basis functions must typically grow exponentially with the dimensionality L of the input space (Bishop, 1995). In out context, this is not a significant problem since the dimensionality is governed by the number of latent variable variables, which will typically be small. In fact, for data visualization applications, we generally use L = 2.
220
Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams
The maximization of L(W, β) given by equation 2.6 can be regarded as a missing-data problem in which the identity i of the component that generated each data point tn is unknown. We can formulate the EM algorithm for this model as follows. First, suppose that, at some point in the algorithm, the current weight matrix is given by Wold , and the current inverse noise variance is given by βold . In the E-step we use Wold and βold to evaluate the posterior probabilities, or responsibilities, of each gaussian component i for every data point tn using Bayes’ theorem in the form Rin (Wold , βold ) = p(xi |tn , Wold , βold ) p(tn |xi , Wold , βold ) . = PK i0 =1 p(tn |xi0 , Wold , βold )
(2.8) (2.9)
We now consider the expectation of the complete-data log likelihood in the form hLcomp (W, β)i =
K N X X
© ª Rin (Wold , βold ) ln p(tn |xi , W, β) .
(2.10)
n=1 i=1
Maximizing equation 2.10 with respect to W and using equations 2.1 and 2.7, we obtain K N X X
ª © Rin (Wold , βold ) Wnew φ(xi ) − tn φT (xi ) = 0.
(2.11)
n=1 i=1
This can conveniently be written in matrix notation in the form
ΦT Gold ΦWTnew = ΦT Rold T,
(2.12)
where Φ is a K × M matrix with elements 8ij = φj (xi ), T is a N × D matrix with elements tnk , R is a K × N matrix with elements Rin , and G is a K × K diagonal matrix with elements Gii =
N X
Rin (W, β).
(2.13)
n=1
We can now solve equation 2.12 for Wnew using standard matrix techniques, based on singular value decomposition to allow for possible ill conditioning. Note that the matrix Φ is constant throughout the algorithm and so needs only to be evaluated once at the start. Similarly, maximizing equation 2.10 with respect to β, we obtain the following reestimation formula: 1 βnew
=
K N X ° °2 1 X Rin (Wold , βold ) °Wnew φ(xi ) − tn ° . ND n=1 i=1
(2.14)
Generative Topographic Mapping
221
The EM algorithm alternates between the E-step, corresponding to the evaluation of the posterior probabilities in equation 2.9, and the M-step, given by the solution of equations 2.12 and 2.14. Jensen’s inequality can be used to show that at each iteration of the algorithm, the objective function will increase unless it is already at a (local) maximum, as discussed, for example, in Bishop (1995). Typically the EM algorithm gives satisfactory convergence after a few tens of cycles, particularly since we are primarily interested in convergence of the distribution, and this is often achieved much more rapidly than convergence of the parameters themselves. If desired, a regularization term can be added to the objective function to control the mapping y(x; W). This can be interpreted as a MAP (maximum a posteriori) estimator corresponding to a choice of prior over the weights W. In the case of a radially symmetric gaussian prior of the form µ p(W|λ) =
λ 2π
¶MD/2
D M X λX 2 exp − wjk , 2 j=1 k=1
(2.15)
where λ is the regularization coefficient, this leads to a modification of the M-step (equation 2.12) to give (ΦT Gold Φ + (λ|β)I)WTnew = ΦT Rold T,
(2.16)
where I is the identity matrix. 2.2 Data Visualization. One application for GTM is in data visualization, in which Bayes’ theorem is used to invert the transformation from latent space to data space. For the particular choice of prior distribution given by equation 2.4, the posterior distribution is again a sum of delta functions centered at the lattice points, with coefficients given by the responsibilities Rin . These coefficients can be used to provide a visualization of the posterior responsibility map for individual data points in the two-dimensional latent space. If it is desired to visualize a set of data points, then a complete posterior distribution for each data point may provide too much information, and it is often convenient to summarize the posterior by its mean, given for each data point tn by hx|tn , W∗ , β ∗ i = =
Z
p(x|tn , W∗ , β ∗ )x dx
K X
Rin xi .
(2.17) (2.18)
i=1
Keep in mind, however, that the posterior distribution can be multimodal, in which case the posterior mean can give a very misleading summary of
222
Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams
the true distribution. An alternative approach is therefore to evaluate the mode of the distribution, given by imax = arg max Rin . {i}
(2.19)
In practice, it is often convenient to plot both the mean and the mode for each data point, because significant differences between them can be indicative of a multimodal distribution. 2.3 Choice of Model Parameters. The problem of density estimation from a finite data set is fundamentally ill posed, since there exist infinitely many distributions that could have given rise to the observed data. An algorithm for density modeling therefore requires some form of “prior knowledge” in addition to the data set. The assumption that the distribution can be described in terms of a reduced number of latent variables is itself part of this prior. In the GTM algorithm, the prior distribution over mapping functions y(x; W) is governed by the prior over weights W, given, for example, by equation 2.15, as well as by the basis functions. We typically choose the basis functions φj (x) to be radially symmetric gaussians whose centers are distributed on a uniform grid in x-space, with a common width parameter σ , whose value, along with the number and spacing of the basis functions, determines the smoothness of the manifold. Examples of surfaces generated by sampling the prior are shown in Figure 3. In addition to the basis functions φi (x), it is also necessary to select the latent space sample points {xi }. Note that if there are too few sample points in relation to the number of basis functions, then the gaussian mixture centers in data space become relatively independent, and the desired smoothness properties can be lost. Having a large number of sample points, however, causes no difficulty beyond increased computational cost. In particular, there is no overfitting if the number of sample points is increased since the number of degrees of freedom in the model is controlled by the mapping function y(x; W). One way to view the role of the latent space samples {xi } is as a Monte Carlo approximation to the integral over x in equation 2.2 (MacKay, 1995; Bishop, Svens´en, & Williams, 1996). The choice of the number K and location of the sample points xi in latent space is not critical, and we typically choose gaussian basis functions and set K so that, in the case of a two-dimensional latent space, O(100) sample points lie within 2σ of the center of each basis function. Note that we have considered the basis function parameters (widths and locations) to be fixed, with a gaussian prior on the weight matrix W. In principle, priors over the basis function parameters could also be introduced, and these could again be treated by maximum a posteriori (MAP) estimation or by Bayesian integration. We initialize the parameters W so that the GTM model initially approximates principal component analysis (PCA). To do this, we first evaluate the
Generative Topographic Mapping
223
Figure 3: Examples of manifolds generated by sampling from the prior distribution over W given by equation 2.15, showing the effect of the choice of basis functions on the smoothness of the manifold. Here the basis functions are gaussian with width σ = 4s in the left-hand plot (where s is the spacing of the basis function centers) and σ = 2s in the right-hand plot. Different values of λ simply affect the linear scaling of the embedded manifold.
data covariance matrix and obtain the first and second principal eigenvectors, and then we determine W by minimizing the error function, E=
° 1 X° °Wφ(xi ) − Uxi ° , 2 i
(2.20)
where the columns of U are given by the eigenvectors. This represents the sum-of-squares error between the projections of the latent points into data space by the GTM model and the corresponding projections obtained from PCA. The value of β −1 is initialized to be the larger of either the L + 1 eigenvalue from PCA (representing the variance of the data away from the PCA plane) or the square of half of the grid spacing of the PCA-projected latent points in data space. Finally, we note that in a numerical implementation, care must be taken over the evaluation of the responsibilities since this involves computing the exponentials of the distances between the projected latent points and the data points, which may span a significant range of values. 2.4 Summary of the GTM Algorithm. Although the foregoing discussion has been somewhat detailed, the underlying GTM algorithm itself is straightforward and is summarized here for convenience. GTM consists of a constrained mixture of gaussians in which the model parameters are determined by maximum likelihood using the EM algorithm. It is defined by specifying a set of points {xi } in latent space, together
224
Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams
Figure 4: Results from a toy problem involving data (°) generated from a onedimensional curve embedded in two dimensions, together with the projected latent points (+) and their gaussian noise distributions (filled circles). The initial configuration, determined by principal component analysis, is shown on the left, and the converged configuration, obtained after 15 iterations of EM, is shown on the right.
with a set of basis functions {φj (x)}. The adaptive parameters W and β define a constrained mixture of gaussians with centers Wφ(xi ) and a common covariance matrix given by β −1 I. After initializing W and β, training involves alternating between the E-step in which the posterior probabilities are evaluated using equation 2.9, and the M-step in which W and β are reestimated using equations 2.12 and 2.14, respectively. Evaluation of the log likelihood using equation 2.6 at the end of each cycle can be used to monitor convergence. 3 Experimental Results We now present results from the application of this algorithm first to a toy problem involving data in two dimensions and then to a more realistic problem involving 12-dimensional data arising from diagnostic measurements of oil flows along multiphase pipelines. In both examples, we choose the basis functions φj (x) to be radially symmetric gaussians whose centers are distributed on a uniform grid in x-space, with a common width parameter chosen equal to twice the separation of neighboring basis function centers. Results from a toy problem for the case of a two-dimensional data space and a one-dimensional latent space are shown in Figure 4.
Generative Topographic Mapping
225
Figure 5: The left plot shows the posterior-mean projection of the oil flow data in the latent space of the GTM model; the plot on the right shows the same data set visualized using principal component analysis. In both plots, crosses, circles, and plus signs represent stratified, annular, and homogeneous multiphase configurations, respectively. Note how the nonlinearity of GTM gives an improved separation of the clusters.
3.1 Oil Flow Data. Our second example arises from the problem of determining the fraction of oil in a multiphase pipeline carrying a mixture of oil, water, and gas (Bishop & James, 1993). Each data point consists of 12 measurements taken from dual-energy gamma densitometers measuring the attenuation of gamma beams passing through the pipe. Synthetically generated data are used that model accurately the attenuation processes in the pipe, as well as the presence of noise (arising from photon statistics). The three phases in the pipe (oil, water, and gas) can belong to one of three different geometrical configurations, corresponding to laminar, homogeneous, and annular flows, and the data set consists of 1000 points drawn with equal probability from the three configurations. We take the latent variable space to be two-dimensional, since our goal is data visualization. Figure 5 shows the oil data visualized in the latent variable space in which, for each data point, we have plotted the posterior mean vector. Each point has then been labeled according to its multiphase configuration. For comparison, Figure 5 also shows the corresponding results obtained using PCA.
226
Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams
4 Relation to the Self-Organizing Map Since one motivation for GTM is to provide a principled alternative to the SOM, it is useful to consider the precise relationship between GTM and SOM. Focusing on the batch versions of both algorithms helps to make the relationship particularly clear. The batch version of the SOM algorithm (Kohonen, 1995) can be described as follows. A set of K reference vectors zi is defined in the data space, in which each vector is associated with a node on a regular lattice in a (typically) two-dimensional feature map (analogous to the latent space of GTM). The algorithm begins by initializing the reference vectors (for example, by setting them to random values, setting them equal to a random subset of the data points, or using PCA). Each cycle of the algorithm then proceeds as follows. For every data vector tn , the corresponding “winning node” j(n) is identified, corresponding to the reference vector zj having the smallest Euclidean distance kzj − tn k2 to tn . The reference vectors are then updated by setting them equal to weighted averages of the data points given by P hij(n) tn , zi = Pn n hij(n)
(4.1)
in which hij is a neighborhood function associated with the ith node. This is generally chosen to be a unimodal function of the feature map coordinates centered on the winning node, for example, a gaussian. The steps of identifying the winning nodes and updating the reference vectors are repeated iteratively. A key ingredient in the algorithm is that the width of the neighborhood function hij starts with a relatively large value and is gradually reduced after each iteration. 4.1 Kernel versus Linear Regression. As pointed out by Mulier and Cherkassky (1995), the value of the neighborhood function hij(n) depends only on the identity of the winning node j and not on the value of the corresponding data vector tn . We can therefore perform partial sums over the groups Gj of data vectors assigned to each node j, and hence rewrite equation 4.1 in the form zi =
X
Kij mj ,
(4.2)
j
in which mj is the mean of the vectors in group Gj and is given by mj =
1 X tn , Nj n∈G j
(4.3)
Generative Topographic Mapping
227
where Nj is the number of data vectors in group Gj . The result (equation 4.2) is analogous to the Nadaraya-Watson kernel regression formula (Nadaraya, 1964; Watson, 1964) with the kernel functions given by hij Nj . Kij = P j0 hij0 Nj0
(4.4)
Thus the batch SOM algorithm replaces the reference vectors at each cycle with a convex combination of the node means mj , with coefficients determined by the neighborhood function. Note that the kernel coefficients satisfy P K = 1 for every i. ij j In the GTM algorithm, the centers y(xi ; W) of the gaussian components can be regarded as analogous to the reference vectors zi of the SOM. We can evaluate y(xi ; W) by solving the M-step equation (2.12) to find W and then using y(xi ; W) = Wφ(xi ). If we define the weighted means of the data vectors by P Rin tn µi = Pn , n Rin
(4.5)
then we obtain y(xi ; W) =
X
Fij µj ,
(4.6)
j
where we have introduced the effective kernel Fij given by ³ ´−1 φ(xj )Gjj . Fij = φT (xi ) ΦT GΦ
(4.7)
P Note that the effective kernel satisfies j Fij = 1. To see this, we first use P equation 4.7 to show that j Fij φl (xj ) = φl (xi ). Then if one of the basis functions l corresponds to a bias, so that φl (x) = const., the result follows. The solution for y(xi ; W) given by equations 4.6 and 4.7 can be interpreted as a weighted least-squares regression (Mardia, Kent, & Bibby, 1979) in which the target vectors are the µi , and the weighting coefficients are given by Gjj . Figure 6 shows an example of the effective kernel for GTM corresponding to the oil flow problem discussed in section 3. From equations 4.2 and 4.6 we see that both GTM and SOM can be regarded as forms of kernel smoothers. However, there are two key differences. The first is that in SOM, the vectors that are smoothed, defined by equation 4.3, correspond to hard assignments of data points to nodes, whereas the corresponding vectors in GTM, given by equation 4.5, involve soft assignments, weighted by the posterior probabilities. This is analogous
228
Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams
Figure 6: Example of the effective kernel Fij plotted as a function of the node j for a given node i, for the oil flow data set after three iterations of EM. This kernel function is analogous to the (normalized) neighborhood function in the SOM algorithm.
to the distinction between K-means clustering (hard assignments) and fitting a standard gaussian mixture model using EM (soft assignments). The second key difference is that the kernel function in SOM is made to shrink during the course of the algorithm in an arbitrary, handcrafted manner. In GTM, the posterior probability distribution in latent space, for a given data point, forms a localized bubble and the radius of this bubble shrinks automatically during training, as shown in Figure 7. This responsibility bubble governs the extent to which individual data points contribute toward the vectors µi in equation 4.5 and hence toward the updating of the gaussian centers y(xi ; W) via equation 4.6. 4.2 Comparison of GTM with SOM. The most significant difference between the GTM and SOM algorithms is that GTM defines an explicit probability density given by the mixture distribution in equation 2.5. As a consequence there is a well-defined objective function given by the log likelihood (see equation 2.6), and convergence to a (local) maximum of the objective function is guaranteed by the use of the EM algorithm (Dempster et al., 1977). This also provides a direct means to compare different choices of model parameters and even to compare a GTM solution with another density model by evaluating the likelihood of a test set under the generative distributions of the respective models. For the SOM algorithm, however, there is no probability density and no well-defined objective function that is being minimized by the training process. Indeed it has been proved (Erwin,
Generative Topographic Mapping
229
Figure 7: Examples of the posterior probabilities (responsibilities) Rin of the latent space points at an early stage (left), intermediate stage (center), and late stage (right) during the convergence of the GTM algorithm. These have been evaluated for a single data point from the training set in the oil flow problem discussed in section 3 and are plotted using a nonlinear scaling of the form p(x|tn )0.1 to highlight the variation over the latent space. Notice how the responsibility bubble, which governs the updating of the weight matrix, and hence the updating of the data-space vectors y(xi ; W), shrinks automatically during the learning process.
Obermayer, & Schulten 1992) that such an objective function cannot exist for the SOM. A further limitation of the SOM, highlighted in Kohonen (1995, p. 234), is that the conditions under which so-called self-organization of the SOM occurs have not been quantified, and so in practice it is necessary to confirm empirically that the trained model does indeed have the desired spatial ordering. In contrast, the neighborhood-preserving nature of the GTM mapping is an automatic consequence of the choice of a continuous function y(x; W). Similarly, the smoothness properties of the SOM are determined indirectly by the choice of neighborhood function and by the way in which it is changed during the course of the algorithm and is therefore difficult to control. Thus, prior knowledge about the form of the map cannot easily be specified. The prior distribution for GTM, however, can be controlled directly, and properties such as smoothness are governed explicitly by basis function parameters, as illustrated in Figure 3. Finally, we consider the relative computational costs of the GTM and SOM algorithms. For problems involving data in high-dimensional spaces, the dominant computational cost of GTM arises from the evaluation of the Euclidean distances from every data point to every gaussian center y(xi ; W). Since exactly the same calculations must be done for SOM (involving the distances of data points from the reference vectors µi ), we expect one itera-
230
Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams
tion of either algorithm to take approximately the same time. An empirical comparison of the computational cost of GTM and SOM was obtained by running each algorithm on the oil flow data until convergence (defined as no discernible change in the appearance of the visualization map). The GTM algorithm took 1058 sec (40 iterations), while the batch SOM took 1011 sec (25 iterations) using a gaussian neighborhood function. With a simple top-hat neighborhood function, in which each reference vector is updated at each iteration using only data points associated with nearby reference vectors, the CPU time for the SOM algorithm is reduced to 305 sec (25 iterations). One potential advantage of GTM in practical applications arises from a reduction in the number of experimental training runs needed since both convergence and topographic ordering are guaranteed. 5 Relation to Other Algorithms Several algorithms in the published literature have close links with GTM. Here we review briefly the most significant of these. The elastic net algorithm of Durbin and Willshaw (1987) can be viewed as a gaussian mixture density model, fitted by penalized maximum likelihood. The penalty term encourages the centers of gaussians corresponding to neighboring points along the (typically one-dimensional) chain to be close in data space. It differs from GTM in that it does not define a continuous data space manifold. Also, the training algorithm generally involves a handcrafted annealing of the weight penalty coefficient. There are also similarities between GTM and principal curves and principal surfaces (Hastie & Stuetzle, 1989; LeBlanc & Tibshirani 1994), which again involve a two-stage algorithm consisting of projection followed by smoothing, although these are not generative models. It is interesting to note that Hastie and Stuetzle (1989) propose reducing the spatial width of the smoothing function during learning, in a manner analogous to the shrinking of the neighborhood function in the SOM. A modified form of the principal curves algorithm (Tibshirani, 1992) introduces a generative distribution based on a mixture of gaussians, with a well-defined likelihood function, and is trained by the EM algorithm. However, the number of gaussian components is equal to the number of data points, and smoothing is imposed by penalizing the likelihood function with the addition of a derivative-based regularization term. The technique of parameterized self-organizing maps (PSOMs) involves first fitting a standard SOM model to a data set and then finding a manifold in data space that interpolates the reference vectors (Ritter, 1993). Although this defines a continuous manifold, the interpolating surface does not form part of the training algorithm, and the basic problems in using SOM, discussed in section 4.2, remain. The SOM has also been used for vector quantization. In this context it has been shown how a reformulation of the vector quantization problem
Generative Topographic Mapping
231
(Luttrell, 1990; Buhmann & Kuhnel ¨ 1993; Luttrell, 1994; Luttrell, 1995) can avoid many of the problems with the SOM procedure discussed earlier. Finally, the density network model of MacKay (1995) involves transforming a simple distribution in latent space to a complex distribution in data space by propagation through a nonlinear network. A discrete distribution in latent space is again used, which is interpreted as an approximate Monte Carlo integration over the latent variables needed to define the data space distribution. GTM can be seen as a particular instance of this framework in which the sampling of latent space is regular rather than stochastic, a specific form of nonlinearity is used, and the model parameters are adapted using EM. 6 Discussion In this article, we have introduced a form of nonlinear latent variable model that can be trained efficiently using the EM algorithm. Viewed as a topographic mapping algorithm, it has the key property that it defines a probability density model. As an example of the significance of having a probability density, consider the important practical problem of dealing with missing values in the data set (in which some components of the data vectors tn are unobserved). If the missing values are missing at random (Little & Rubin, 1987) then the likelihood function is obtained by integrating out the unobserved values. For the GTM model, the integrations can be performed analytically, leading to a simple modification of the EM algorithm. A further consequence of having a probabilistic approach is that it is straightforward to consider a mixture of GTM models. In this case, the overall density can be written as p(t) =
X
P(r)p(t|r),
(6.1)
r
where p(t|r) represents the rth model, with its own set of independent parameters, and P(r) are mixing coefficients satisfying 0 ≤ P(r) ≤ 1 and P P(r) = 1. Again, it is straightforward to extend the EM algorithm to r maximize the corresponding likelihood function. The GTM algorithm can be extended in other ways, for instance, by allowing independent mixing coefficients πi (prior probabilities) for each of the gaussian components, which again can be estimated by a straightforward extension of the EM algorithm. Instead of being independent parameters, the πi can be determined as smooth functions of the latent variables using a normalized exponential applied to a generalized linear regression model, although in this case the M-step of the EM algorithm would involve nonlinear optimization. Similarly, the inverse noise variance β can be generalized to a function of x. An important property of GTM is the existence of a smooth
232
Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams
manifold in data space, which allows the local magnification factor between latent and data space to be evaluated as a function of the latent space coordinates using the techniques of differential geometry (Bishop, Svens´en, & Williams, in press). Finally, since there is a well-defined likelihood function, it is straightforward in principle to introduce priors over the model parameters (as discussed in section 2.1) and to use Bayesian techniques in place of maximum likelihood. Throughout this article, we have focused on the batch version of the GTM algorithm in which all of the training data are used together to update the model parameters. In some applications, it will be more convenient to consider sequential adaptation in which data points are presented one at a time. Since we are minimizing a differentiable cost function, given by equation 2.6, a sequential algorithm can be obtained by appealing to the Robbins-Monro procedure (Robbins & Monro, 1951; Bishop, 1995) to find a zero of the objective function gradient. Alternatively, a sequential form of the EM algorithm can be used (Titterington, Smith, & Makov, 1985). A Web site for GTM is provided at: http://www.ncrg.aston.ac.uk/ GTM/, which includes postscript files of relevant papers, a software implementation in Matlab (a C implementation is under development), and example data sets used in the development of the GTM algorithm. Acknowledgments This work was supported by EPSRC grant GR/K51808: Neural Networks for Visualization of High-Dimensional Data. We thank Geoffrey Hinton, Iain Strachan, and Michael Tipping for useful discussions. Markus Svens´en thanks the staff of the SANS group in Stockholm for their hospitality during part of this project. References Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Bishop, C. M., & James, G. D. (1993). Analysis of multiphase flows using dualenergy gamma densitometry and neural networks. Nuclear Instruments and Methods in Physics Research, A327, 580–593. Bishop, C. M., Svens´en, M., & Williams, C. K. I. (1996). A fast EM algorithm for latent variable density models. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 465–471). Cambridge, MA: MIT Press. Bishop, C. M., Svens´en, M., & Williams, C. K. I. (in press). Magnification factors for the GTM algorithm. In Proceedings of the Fifth IEE International Conference on Artificial Neural Networks. Cambridge, U.K., IEE, (pp. 64–69). Buhmann, J., & Kuhnel, ¨ K. (1993). Vector quantization with complexity costs. IEEE Transactions on Information Theory, 39(4), 1133–1145.
Generative Topographic Mapping
233
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B 39(1), 1–38. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem. Nature, 326, 689–691. Erwin, E., Obermayer, K., & Schulten, K. (1992). Self-organizing maps: Ordering, convergence properties and energy functions. Biological Cybernetics, 67, 47–55. Hastie, T., & Stuetzle, W. (1989). Principal curves. Journal of the American Statistical Association, 84(406), 502–516. Hinton, G. E., Williams, C. K. I., & Revow, M. D. (1992). Adaptive elastic models for hand-printed character recognition. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 512–519). San Mateo, CA: Morgan Kauffman. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. LeBlanc, M., & Tibshirani, R. (1994). Adaptive principal surfaces. Journal of the American Statistical Association, 89(425), 53–64. Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: John Wiley. Luttrell, S. P. (1990). Derivation of a class of training algorithms. IEEE Transactions on Neural Networks, 1(2), 229–232. Luttrell, S. P. (1994). A Bayesian analysis of self-organizing maps. Neural Computation, 6(5), 767–794. Luttrell, S. P. (1995). Using self-organizing maps to classify radar range profiles. Proc. 5th IEE Conf. on Artificial Neural Networks, 335–340. MacKay, D. J. C. (1995). Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research, A 354(1), 73–80. Mardia, K., Kent, J., & Bibby, M. (1979). Multivariate analysis. New York: Academic Press. Mulier, F., & Cherkassky, V. (1995). Self-organization as an iterative kernel smoothing process. Neural Computation, 7(6), 1165–1177. ´ A. (1964). On estimating regression. Theory of Probability and Its Nadaraya, E. Applications, 9(1), 141–142. Ritter, H. (1993). Parameterized self-organizing maps. In Proceedings ICANN’93 International Conference on Artificial Neural Networks, Amsterdam (pp. 568–575). Berlin: Springer-Verlag. Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Mathematical Statistics, 22, 400–407. Tibshirani, R. (1992). Principal curves revisited. Statistics and Computing, 2, 183– 190. Tipping, M. E., & Bishop, C. M. (1997). Probabilistic Principal Component Analysis. Tech. Rep. NCRG/97/010, Neural Computing Research Group, Dept. of Computer Science & Applied Mathematics, Aston Univ., Birmingham B4 7ET, U.K. Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis of finite mixture distributions. New York: Wiley.
234
Christopher M. Bishop, Markus Svens´en, and Christopher K. I. Williams
Watson, G. S. (1964). Smooth regression analysis. Sankhy¯a: The Indian Journal of Statistics, series A, 26, 359–372. Received April 17, 1996; accepted April 3, 1997.
Communicated by Yossi Erlich
Identification Criteria and Lower Bounds for Perceptron-Like Learning Rules Michael Schmitt Institute for Theoretical Computer Science, Technische Universit¨at Graz, Graz, Austria
Perceptron-like learning rules are known to require exponentially many correction steps in order to identify boolean threshold functions exactly. We introduce criteria that are weaker than exact identification and investigate whether learning becomes significantly faster if exact identification is replaced by one of these criteria: probably approximately correct (PAC) identification, order identification, and sign identification. PAC identification is based on the learning paradigm introduced by Valiant (1984) and is known to be easier than exact identification. Order identification uses the fact that each threshold function induces an ordering relation on the input variables that can be represented by weights of linear size. Sign identification is based on a property of threshold functions known as unateness and requires only weights of constant size. We show that Perceptron-like learning rules cannot satisfy these criteria when the number of correction steps is to be bounded by a polynomial. We also present an exponential lower bound for order identification with the learning rules introduced by Littlestone (1988). Our results show that efficiency imposes severe restrictions on what can be learned with local learning rules. 1 Introduction Perceptron-like learning rules are among the best-investigated learning algorithms for neural networks. Although they have been known for quite a long time, they have not lost much of their attraction. This may be due to several features that make Perceptron learning appear advantageous—at least when training single-neuron architectures: Perceptron-like rules are easy to implement because the weight updates are based on simple operations (addition or subtraction of a constant value). The rules are local to the extent that the only information required at each weight is the input at this weight and the sign of the error. Further, Perceptron-like rules satisfy the definition of online learning in a very strong sense: The corrections are based on single, misclassified examples, and learning may start from any arbitrary initial weight vector. The latter property makes Perceptron learning tolerate even temporary classification errors or follow changing targets. Finally, the famous Perceptron convergence theorem states that any set of Neural Computation 10, 235–250 (1998)
c 1997 Massachusetts Institute of Technology °
236
Michael Schmitt
training examples that has a solution vector will be classified correctly after a finite number of correction steps. These advantages, however, are contrasted by results concerning the efficiency of learning. Complexity considerations make Perceptron learning appear less favorable. The number of correction steps is known to grow in the worst case exponentially in the number of weights. On the other hand, there are training procedures based on algorithms for linear programming that accomplish learning in polynomial time. Recent results by Blum, Frieze, Kannan, and Vempala (in press) also show that Perceptron-like rules can be used to learn even noisy threshold functions efficiently, provided that the training examples are preprocessed appropriately. Hence, in order to prevent learning from being slow, one has two possibilities so far: to perform a transformation on the examples before applying Perceptron-like rules, or abandon these rules completely and employ the more advanced training procedures from linear programming. As far as we know, there is no single efficient training procedure that includes all the positive characteristics shared by Perceptron-like rules. The use of simpler learning procedures seems possible if one weakens the demands on the learning process. In this article, we explore this possibility by investigating three criteria that do not require that learning generates weights that exactly represent the target function. One is rather content with a result that is close to it. The first criterion, called probably approximately correct (PAC) identification, is based on Valiant’s PAC learning paradigm. Exact identifiability in polynomial time implies PAC identifiability, but the converse is not true. It is known that there are classes of functions that can be PAC identified in polynomial time but require exponential time for exact identification. Therefore, it is a good starting point to investigate if the PAC criterion can be met by Perceptron-like rules. In PAC learning, one assumes that the training examples are drawn according to a fixed but unknown probability distribution. After completion of the learning process, the result is tested on a new example drawn according to the same distribution. It is required that for all distributions with high probability (with regard to the training examples) the probability of an error (with regard to the test example) is low. We will show in section 3 that Perceptron-like rules fail to meet this criterion in polynomial time by exhibiting small sets of training examples that require large weights. With order and sign identification, we avoid the need to use large weights. Order identification requires only the representation of the relationship between the magnitudes of the weights. Each boolean threshold function induces an ordering on the input variables that corresponds to the ordering of the absolute values of its weights. The learning process is successful if it yields a function with the same ordering. Thus, learning is not forced to produce large weights because such an ordering can be represented by weights with an absolute value not larger than the number of weights. Nevertheless,
Perceptron-Like Learning Rules
237
in section 4 we present exponential lower bounds for order identification with Perceptron-like learning rules. Finally, the criterion of sign identification is based on a monotonicity property of boolean threshold functions. The learning process is required to produce a weight vector with the same signs as the weight vector of the target function. In order for the signs to be unique, however, we have to restrict the possible targets to nondegenerate functions—those that do not allow zero weights. Sign identification can be considered a further weakening compared to order identification because sign identity can be represented by constant weights, say {−1, 1}, whereas order identification may require as many different values as there are weights. However, it turns out that the criterion is not weak enough. In section 5 we show that Perceptron-like learning rules cannot sign identify nondegenerate threshold functions in polynomial time. In section 6 we investigate the learning rules introduced by Littlestone (1988). In contrast to the rules considered so far, they are based on a multiplicative update of the weights, but they have some features in common with Perceptron-like rules. In particular, exponential lower bounds for exact identification are known for Littlestone’s rules as well. We show that they also do not satisfy the criterion of order identification. The basic concepts of this article are introduced in the following section. In it we also review some results concerning McCulloch-Pitts neurons and Perceptron-like learning rules and give references to related work. 2 Preliminaries A McCulloch-Pitts neuron with n inputs is characterized by a vector of n + 1 real numbers consisting of weights w1 , . . . , wn and threshold t (McCulloch & Pitts, 1943). On binary inputs, it computes a boolean threshold function f : {0, 1}n → {0, 1} defined as f (x1 , . . . , xn ) = 1
iff w1 x1 + · · · + wn xn ≥ t.
It is well known that any boolean threshold function can be represented by integer weights and threshold (see, e.g., Muroga, 1971; Minsky & Papert, 1988; Parberry, 1994). In order to investigate the complexity of learning with Perceptron-like rules, we introduce the weight complexity of a threshold function. Definition 1. The weight complexity of a boolean threshold function is defined as the smallest natural number such that the function can be represented by a weight vector with integer weights and threshold such that each of their absolute values is not larger than this number. The weight complexity of boolean threshold functions where the truth values are represented as {0, 1} is known to be at most 2−(n+1) (n + 1)(n+3)/2 +
238
Michael Schmitt
1/2 (Schmitt, 1994). For bipolar inputs {−1, 1}, the more succinct bound 2−n (n + 1)(n+1)/2 has been shown (see also H˚astad, 1994; Parberry, 1994). Several functions of weight complexity at least 2Ä(n) have been constructed (Muroga, 1971; Minsky & Papert, 1988; Parberry, 1994). By a more involved proof, H˚astad (1994) has defined a function that requires weights at least as large as 2Ä(n log n) (see also Parberry, 1994). Further, it has been shown that a boolean threshold function has weight complexity at least 1.4n in the average (Hampson & Volper, 1986). To simplify notation, we shall assume from now on that all threshold functions have threshold 0. This can be done without loss of generality by introducing a weight wn+1 with value −t and constant input 1. A Perceptron-like learning rule proposes how to modify a weight vector w that represents a hypothesis function g when given an example hx, f (x)i of a target function f with f (x) 6= g(x). Let w0 be the vector after the correction step. Then the rule is written as ½ w+c·x if g(x) = 0 w0 = w−c·x if g(x) = 1. Perceptrons trained by this method were originally known as α-Perceptrons (Rosenblatt, 1958; for a historical perspective, see Nagy, 1991). The parameter c is an arbitrary scalar value greater than zero called correction increment. The initial weight vector may be chosen arbitrarily. The most popular variant of the Perceptron rule uses w = (0, . . . , 0) as an initialization and c = 1 as a correction increment. The Perceptron convergence theorem guarantees that any set of training examples taken from a boolean threshold function leads after finitely many correction steps to a weight vector that produces no further errors on the training set. A considerable number of proofs with subtle differences have been published (Novikoff, 1962; Rosenblatt, 1962; Lewis, 1966; Minsky & Papert, 1988; Parberry, 1994). Maass (1994) has derived an explicit upper bound in terms of the number of weights: at most (n + 1)2 · 2(n+1) log(n+1) corrections may occur. Concerning lower bounds, one immediately concludes that the number of corrections cannot be smaller than the weight complexity. Moreover, from a result of Lewis (1966), it can be derived that for target functions with large weight complexity, the number of correction steps grows almost quadratic with the weight complexity. More precisely, Lewis has shown that for any e is an target function, at least |e w|2 /(n + 1) corrections are necessary, where w integer vector representing this function such that |e w| is minimal. Now |e w|, e, is at least as large as the weight complexity of the the Euclidean length of w function. Thus, given a target function with n inputs and weight complexity ω, at least ω2 /(n + 1) Perceptron correction steps are necessary to identify this function exactly. The upper and lower bounds already noted are valid for any correction increment c > 0, but they hold only if one chooses the zero vector as ini-
Perceptron-Like Learning Rules
239
tialization. However, starting from any other fixed-weight vector cannot improve the worst-case behavior of the learning process. Because we are interested in the coarse distinction between polynomial and exponential behavior, we assume throughout the article that Perceptron learning starts from the zero vector. 3 PAC Identification It is obvious from the lower bounds on the weight complexity that for Perceptron-like learning rules, exact identification is too strong a requirement to be accomplished in polynomial time. As a first step in the search for a weaker criterion that can be met by Perceptron-like rules, we define the so-called PAC identification inspired by the learning paradigm of Valiant (1984). In this model, the training examples are chosen from a target function independently according to an unknown probability distribution. The learning process must with probability at least 1 − δ generate a hypothesis that differs from the target by at most ε where the error is determined according to the same distribution. A class of target functions is said to be PAC learnable in polynomial time if the number of required examples is bounded by a polynomial in the relevant parameters (in the case of boolean threshold functions in n, 1/², and 1/δ) and if the learning algorithm runs in polynomial time (in the case of Perceptron rules, at most polynomially in n, 1/², and 1/δ many correction steps). PAC learning is known to be easier than exact identification. Angluin (1988) has shown that any method for exact identification can be turned into a method for PAC identification under preservation of polynomial time. She has also presented an example of a target class that can be PAC identified in polynomial time but needs exponentially many examples for exact identification in the worst case. On the other hand, it is known that boolean threshold functions are polynomially PAC learnable by methods of linear programming (Blumer, Ehrenfeucht, Haussler, & Warmuth, 1989). Therefore, it is worth considering if the algorithms for linear programming, which are quite complicated, can be replaced by the much simpler Perceptron-like rules such that PAC identification can still be achieved in polynomial time. Definition 2. Let a learning rule for the modification of weight vectors be given. The learning rule is said to PAC identify the class of boolean threshold functions in polynomial time if there exist polynomials p(n, r, s) and q(n, r, s) such that for all n ≥ 1, 0 < ε, δ < 1, and for all probability distributions P on {0, 1}n the following holds: If p(n, 1/ε, 1/δ) examples are drawn from an arbitrary boolean threshold function f independently according to P, then with probability at least 1 − δ the repeated application of the learning rule yields after at most q(n, 1/ε, 1/δ) corrections the weight vector of a function g such that P[{x: f (x) 6= g(x)}] ≤ ε.
240
Michael Schmitt
In the definition we use the polynomial p to bound the number of examples and the polynomial q to bound the number of computation steps. For boolean threshold functions, it is known that a polynomial number of examples is sufficient for PAC learning (Blumer et al., 1989). Thus the existence of p is guaranteed. Consequently, in order to show that PAC identification cannot be achieved by a given algorithm, one has to prove that the number of computation steps cannot be bounded by a polynomial. The main result of this section states that the number of correction steps for Perceptron rules does not decrease substantially if exact identification is replaced by PAC identification. The proof is based on the construction of a specific function that belongs to the class of so-called nested functions introduced by Anthony, Brightwell, and Shawe-Taylor (1995). Definition 3. (Anthony et al., 1995) The class of nested boolean functions is defined recursively over the number n of variables as follows: 1. For n = 1, the functions x and ¬x are nested. 2. For n > 1, f : {0, 1}n → {0, 1} is nested iff there exists a nested function g: {0, 1}n−1 → {0, 1} such that f can be written in the form g ∗ un , where g depends only on the variables x1 , . . . , xn−1 , ∗ ∈ {∧, ∨}, and un ∈ {xn , ¬xn }. Anthony et al. (1995) have shown that any nested boolean function is also a threshold function. Nested functions play an important role in the investigation of the specifiability of boolean threshold functions by examples. Definition 4. (Anthony et al., 1995) Let F be a class of boolean functions with n inputs and let f ∈ F. The specification number of f in F is the cardinality of the smallest set A ⊆ {0, 1}n such that any function g ∈ F that agrees with f on A is equal to f . A major result of Anthony et al. (1995) states that any nested function f : {0, 1}n → {0, 1} has specification number n + 1 in the class of boolean threshold functions with n inputs. We now introduce a particular nested function that has the Fibonacci numbers as weights. Lemma 1. Let n be odd and f : {0, 1}n → {0, 1} be the function represented by the vector (a1 , . . . , an , −an+1 ) where ai denotes the ith Fibonacci number ai = ai−1 + ai−2 ,
a1 = a2 = 1.
Then f is nested and can be written in the form (· · · (((x1 ∨ x2 ) ∧ x3 ) ∨ x4 ) · · ·) ∧ xn .
(3.1)
Perceptron-Like Learning Rules
241
Proof. The proof is by induction on k where n = 2k + 1. The case n = 1 is obvious. Let g: {0, 1}n−2 → {0, 1} be the function represented by the vector (a1 , . . . , an−2 , −an−1 ). The lemma is proved if we can show that f can be written as (g ∨ xn−1 ) ∧ xn . The Fibonacci series satisfies an−1 + an = an+1 and Pn−1 i=1 ai = an+1 − 1. Hence, f (x1 , . . . , xn ) = 1 iff a1 x1 + · · · + an xn ≥ an+1 iff a1 x1 + · · · + an−1 xn−1 ≥ an−1 and xn = 1 iff (a1 x1 + · · · + an−2 xn−2 ≥ an−1 or xn−1 = 1) and xn = 1 iff (g(x1 , . . . , xn−2 ) = 1 or xn−1 = 1) and xn = 1. Thus, f can be written as claimed and is nested. As noted by Hampson and Volper (1986), this function has weight comn n plexity √ at least Ä(1.6 ), more precisely, Ä(φ ), where φ is the golden ratio φ = ( 5+1)/2 (see also Parberry, 1994). A nested function with large weight complexity has also been constructed by Anthony and√Shawe-Taylor (1993). They established the slightly weaker lower bound ( 3)n/2−1 . We are now ready for the main result of this section. Theorem 1. Perceptron-like learning rules cannot PAC identify the class of boolean threshold functions in polynomial time. Proof. We use a method due to Pitt and Valiant (1988). We choose the distribution and the values for ε and δ such that the algorithm is forced to produce hypotheses that agree with all examples. Let f : {0, 1}n → {0, 1} be the nested function defined in lemma 1, which has specification number n + 1 according to Anthony et al. (1995). Let A ⊆ {0, 1}n be the smallest specifying set for f , which implies that A has cardinality n + 1. Assume that a Perceptron-like learning rule PAC identifies the class of boolean threshold functions in polynomial time, and let p and q be polynomials according to definition 2. We define the distribution P on {0, 1}n by P(x) = 1/(n + 1) if x ∈ A, and 0 otherwise. Further we let ε = 1/(n + 2) and choose an arbitrary but fixed δ > 0. We randomly draw p(n, n + 2, 2) examples and apply q(n, n + 2, 2) correction steps. Because P(x) > ε for all x ∈ A, the resulting threshold function g must agree with f on A with probability at least δ. Thus, because f is nested, g must agree with f completely with the same probability. However, the largest weight produced can have absolute value at most q(n, n + 2, 2). This contradicts the fact that f has weight complexity at least Ä(φ n ).
242
Michael Schmitt
4 Order Identification Theorem 1 shows that not only exact but also PAC identification requires the generation of large weights. Therefore, to be able to learn with Perceptronlike rules in polynomial time, one has to look for identification criteria that avoid this obstacle. In this and the following section, we introduce and analyze two criteria based on equivalence relations on boolean threshold functions. The learning task is to identify the class of the target function by producing a representative of this class. The relation considered in this section is based on an ordering of the input variables and can be represented by weights of size at most n. Definition 5. Let β: {xi , xj } → {0, 1} be a truth assignment of the variables xi , xj and f : {0, 1}n → {0, 1} be a threshold function. Let fβ : {0, 1}n−2 → {0, 1} be the function obtained from f by fixing variable xi to β(xi ) and variable xj to β(xj ). The complementary assignment β: {xi , xj } → {0, 1} is defined as β(x) = 1 − β(x). Now let β(xi ) = 1 and β(xj ) = 0. We say that f
xi  xj ∼ is satisfied if and only if fβ ≥ fβ holds (i.e., fβ (x) ≥ fβ (x) for all x). The relations f
xi  xj ,
f
xi ≈ xj f
are defined correspondingly. Finally, we let xi ≈ xi always. For any threshold function, the relation defined above provides a total ordering of the variables. This is stated in the following lemma by Muroga (1971). It also establishes a relationship between the ordering of variables and the ordering of weights. We quote it without proof. Lemma 2. (Muroga, 1971, Theorem 5.1.4) Let (w1 , . . . , wn , −t) be a weight vector for the threshold function f : {0, 1}n → {0, 1}. f
(i) The variables of f are totally ordered with respect toÂ. ∼ f (ii) If xi  xj , then wi > wj . f
(iii) xi ≈ xj if and only if (wi + wj )/2 can be used in place of both wi and wj to give weights for f . f
f
(iv) Conversely, if wi > wj then xi  xj , and if wi = wj then xi ≈ xj . ∼
Perceptron-Like Learning Rules
243
Now we define two functions to be equivalent if the relationships among their weights are identical. This implies that each equivalence class has a representative with small weights. Then we use the equivalence relation to define the criterion of order identification. Definition 6. Let f, g: {0, 1}n → {0, 1} be threshold functions. Then f and g have the same order on the variables if for all 1 ≤ i, j ≤ n f
xi  xj
g
iff xi  xj .
Corollary 1. For each boolean threshold function f , there exists a boolean threshold function g that has the same order on the variables as f and weight complexity at most n. Definition 7. A learning rule for the modification of weight vectors is said to order identify the class of boolean threshold functions in polynomial time if there exist polynomials p(n) and q(n) such that for all n ≥ 1 and all threshold functions f : {0, 1}n → {0, 1} the following holds: On any sequence of training examples for f of length at most p(n), the repeated application of the learning rule yields after at most q(n) corrections the weight vector of a function g that has the same order on the variables as f . The main result of this section shows that order identification requires exponentially many correction steps. Theorem 2. Perceptron-like learning rules cannot order identify the class of boolean threshold functions in polynomial time. Proof. Let f : {0, 1}n+2 → {0, 1} be a threshold function represented by the weight vector (1, 2, a1 , a2 , . . . , an , −an+1 ) where n is odd and ai denotes the ith Fibonacci number as defined in equation 3.1. Thus f is the function from the proof of theorem 1 extended by two additional inputs. First, we show that f
x1 ≺ x2
(4.1)
holds. Let β: {x1 , x2 } → {0, 1} be defined by β(x1 ) = 1, β(x2 ) = 0. Then fβ is represented by w = (a1 , a2 , . . . , an , 1−an+1 ) and fβ by w = (a1 , a2 , . . . , an , 2− an+1 ). Obviously we have fβ ≤ fβ . To show fβ 6= fβ , we consider the element x ∈ {0, 1}n where xi = 1 iff i ∈ {2, . . . , n − 1}. We then have w1 x1 + · · · + wn xn = w1 x1 + · · · + wn xn
244
Michael Schmitt
and w1 x1 + · · · + wn xn =
n−1 X i=2
à ! n−1 X ai = ai − 1 = an+1 − 2. i=1
Thus fβ (x) = 0 and fβ (x) = 1 are shown. For the last equality, we have used P the property an+1 = 1 + n−1 i=1 ai of the Fibonacci series. Given a polynomial q, assume that a Perceptron-like learning rule order identifies f in at most q(n + 2) correction steps. We choose the set A of cardinality n + 1 from the proof of theorem 1 and append to each element a new first and second component, both containing 0. Adding inputs to a threshold function does not decrease its weight complexity. Hence, the construction of f implies that it has weight complexity at least Ä(φ n ). Therefore, we can choose for each sufficiently large n a sequence of training examples from A such that the rule makes q(n + 2) corrections. Starting with the zero vector, the resulting weight vector (w01 , . . . , w0n+3 ) satisfies w01 = w02 = 0 because the first two weights are not modified when using examples from A. According g to lemma 2(iii) the represented function g satisfies x1 ≈ x2 . Together with equation 4.1, this implies that f and g have different orders of the variables. 5 Sign Identification As a means to investigate the power of Perceptron-like learning rules, we finally introduce sign identification. It is weaker than order identification in the sense that the learning process is required just to produce a weight vector that agrees with the target function on the signs of the weights. Thus the result can be represented by weights of constant size (e.g., {−1, 1}). For this criterion to make sense, however, we have to restrict the class of target functions to those that cannot be represented by vectors containing 0s as weights. We call such a function, according to Parberry (1994), nondegenerate. The threshold is excluded from this restriction. A threshold function that has 0 as a weight does not depend on this input; hence the weight can also be an arbitrary small positive or an arbitrary small negative real number. But then the sign of the weight is not unique. The uniqueness of the signs of nondegenerate threshold functions is guaranteed by a property known as unateness. Definition 8. A boolean function is said to be positive in the variable xi if it has a representation in disjunctive normal form where xi does not occur negated. It is called negative in xi if it has a disjunctive normal form where xi occurs only negated. A function f is unate if for each 1 ≤ i ≤ n f is positive or negative in xi .
Perceptron-Like Learning Rules
245
Paull and McCluskey (1960) have shown that all threshold functions are unate. For nondegenerate threshold functions, they have also established a relationship between the property of being positive (resp. negative) in xi and the sign of weight wi . We quote the result, which can also be found in McNaughton (1961), without a proof. Lemma 3. (Paull & McCluskey, 1960, theorem 1) Every boolean threshold function is unate. Let f : {0, 1}n → {0, 1} be nondegenerate and (w1 , . . . , wn+1 ) a weight vector representing f . Then we have for 1 ≤ i ≤ n: f is positive in xi iff wi > 0, and f is negative in xi iff wi < 0. Unateness implies that the signs of the weights of a nondegenerate threshold function are unique. For sign identification, it is sufficient to achieve agreement of signs. Definition 9. A learning rule for the modification of weight vectors is said to sign identify the class of nondegenerate boolean threshold functions in polynomial time if there exist polynomials p(n) and q(n) such that for all n ≥ 1 and all threshold functions f : {0, 1}n → {0, 1} the following holds: On any sequence of training examples for f of length at most p(n), the repeated application of the learning rule yields after at most q(n) corrections the weight vector (u1 , . . . , un+1 ) of a function g such that there exists a weight vector (w1 , . . . , wn+1 ) for f that satisfies u i · wi > 0
for i = 1, . . . , n.
(5.1)
If equation 5.1 holds for a weight vector (w1 , . . . , wn+1 ), then it holds for all weight vectors representing f due to lemma 3. Theorem 3. Perceptron-like learning rules cannot sign identify the class of nondegenerate boolean threshold functions in polynomial time. Proof. As in the proof of theorem 2, we use the principle of leaving certain weights untouched as long as possible. Let f : {0, 1}n+1 → {0, 1} be a threshold function represented by the weight vector w = (1, a1 , a2 , . . . , an , −an+1 ) where n is odd and ai denotes the ith Fibonacci number as in equation 3.1. Thus, f is the function from the proof of theorem 1 with one additional input. First, we show that f is nondegenerate. Let x ∈ {0, 1}n+1 denote the element where xi = 1 iff i ≤ n, and for j = 1, . . . , n let x(j) denote the element that differs from x exactly in the jth component. Then we have w1 x1 + · · · + wn+1 xn+1 = 1 +
n−1 X i=1
ai = an+1 .
246
Michael Schmitt
Therefore f (x) = 1. On the other hand for j = 1, . . . , n, (j)
(j)
w1 x1 + · · · + wn+1 xn+1 < an+1 , which implies f (x(j) ) = 0. Thus f depends on inputs x1 , . . . , xn . Further, consider the element y ∈ {0, 1}n+1 where yi = 1 iff n ≤ i ≤ n + 1, and let z denote the element that differs from y exactly in the component n + 1. We obtain f (y) = 1 and f (z) = 0 in a similar way. Thus f also depends on xn+1 and is therefore nondegenerate. Assume that a Perceptron-like learning rule sign identifies f in at most q(n + 1) correction steps where q is a polynomial. Starting from the zero vector, we give h(1, 0, . . . , 0), 0i as the first example for f . This leads to a correction step resulting in the weight vector (−c, 0, . . . , 0, −c) where c is the correction increment. (Recall that we have constant input 1 for component n + 2 representing the negative of the threshold.) Then we take the set A from the proof of theorem 1 and append to each element 0 as the first component. Because f has weight complexity at least Ä(φ n ), we can choose for each sufficiently large n a sequence of training examples from A such that the rule continuing with (−c, 0, . . . , 0, −c) makes q(n + 1) corrections. The resulting weight vector (u1 , . . . , un+2 ) obviously satisfies u1 = −c because the first weight does not change when the extended examples from A are used. Thus we have u1 · w1 = −c < 0, contradicting equation 5.1. Although we have restricted sign identification to nondegenerate functions, the learning process was started with the zero vector, which represents the function that is 1 everywhere. The proof can easily be adapted to the case where an initial vector is chosen that represents a nondegenerate function. 6 Identification with Littlestone’s Rules Littlestone (1988) has proposed two learning rules that are quite similar to Perceptron-like rules. Instead of changing the weights additively, they use a multiplicative update. These so-called Winnow algorithms share with Perceptron-like rules the property that the weights are updated only if an error has occurred and, if so, that only weights that received a nonzero input are changed. If the output of the current hypothesis is supposed to be 1, then wi is changed to αwi for some fixed constant α > 1. If the output is supposed to be 0, then the two Winnow rules work differently. Winnow1 changes wi to 0, whereas Winnow2 changes it to wi /α. Thus, the Winnow algorithms are local rules and fit into the model of online learning in the same way as Perceptron-like rules. The complexity of learning with Winnow rules has been investigated by Maass and Tur´an (1994). For the more general class of so-called k-bounded learning algorithms, comprising the Winnow rules for a sufficiently large
Perceptron-Like Learning Rules
247
but fixed k, they have established a lower bound of 2Ä(n/k) for exact identification. For the Winnow rules in particular, this implies that there are target functions and sequences of training examples where they make 2Ä(n) corrections. Winnow rules leave weights with input 0 unchanged. We can now make use of this fact as we did in the proof of the lower bound for order identification with Perceptron-like learning rules. We extend the training examples by two components containing 0. We also append two inputs to the initial hypothesis and the target function such that their order with respect to these variables is different. Reasoning in the same way as we did in the proof of theorem 2, we obtain a lower bound for order identification with Littlestone’s rules. Theorem 4. Winnow rules cannot order identify the class of boolean threshold functions with nonnegative weights in polynomial time. Winnow rules do not change the signs of the weights. Therefore, sign identification is out of question for these rules. However, Littlestone (1988) has presented a transformation that allows learning boolean threshold functions with varying signs. This transformation maps a threshold function with n inputs to a threshold function with 2n inputs and only nonnegative weights in such a way that each weight is replaced by two weights. One of them is used in case of a positive sign, the other in case of a negative sign. Learning a function with negative weights is then reduced to learning this extended function. One can then define the criterion of sign identification also for Winnow rules. The learning task would be to identify the sign of a weight by making the corresponding weight larger than its counterpart. Thus, weights of constant size are sufficient. Using the result of Maass and Tur´an (1994) and the constructions presented here, one obtains an exponential lower bound also for this kind of sign identification. 7 Concluding Remarks We have taken a closer look at the capabilities of Perceptron-like learning rules. We have lowered the demand of exact identification, introducing three weaker criteria: PAC identification is a probabilistic criterion that requires finding with high probability hypotheses with small error. Order and sign identification are criteria related to partial knowledge about the representation of the target function. We have shown how these criteria can be used in an online learning task and investigated how many correction steps are required. It turned out that none of the criteria can be met by Perceptronlike learning rules in polynomial time. This raises the question of what Perceptron-like rules can accomplish in polynomial time at all. We mention some results.
248
Michael Schmitt
Maass and Tur´an (1994) have considered subclasses of boolean threshold functions that have weights bounded by a polynomial p(n). They have shown that Perceptron-like rules can identify such a class in at most O(n2 p2 (n)) correction steps. Furthermore, they concluded that this bound is optimal for all k-bounded rules—that is, rules where each weight can be modified using one of k possible instructions and where the weight change operations are commutative in each weight. However, as noted in section 2, the average weight complexity of a boolean threshold function is at least 1.4n (Hampson & Volper, 1986). Thus, the polynomial bound on the weights involves a considerable loss of the computational power of single neurons. The lower bound for PAC identification is based on the distribution independence assumption. Bartlett and Williamson (1991) have shown that for the subclass of so-called λ-reasonable distributions, half-spaces on the surface of the unit hypersphere can be PAC learned in time polynomial in n, 1/ε, 1/δ, and 1/λ. It is not known if this result also holds for boolean threshold functions. Nor is it known whether Littlestone’s rules can PAC identify in polynomial time. Kivinen and Warmuth (1995) have compared Perceptron and Winnow rules when learning k-literal disjunctions. They have constructed training sequences where Perceptron rules make (n − k + 1)/2 corrections, whereas Winnow2 manages with O(k log n) corrections. It would be interesting to know if there is also a class of threshold functions where Perceptron rules perform considerably better than Winnow rules. A further question concerns the separability of the identification criteria. All criteria considered here are met by polynomial-time algorithms for linear programming, some of which can even be implemented as online learning rules (Maass & Tur´an, 1994). Of course, these algorithms achieve exact identification, which is the strongest criterion one can imagine. Maybe there are learning rules somewhat more complicated than Perceptron or Winnow rules but simpler than algorithms for linear programming, which may not be able to identify threshold functions exactly but meet one of the weaker criteria. The latter issue seems to be closely related to the question of locality or distributedness of learning rules. Although there is no clear definition of local, there seems to be a wide agreement that learning rules can be lined up according to their degree of locality.1 As Maass and Tur´an (1994) have argued, Perceptron and Winnow rules belong more to the local ones, whereas rules based on linear programming are considered more global. It appears from the results of this article that there is a trade-off between locality and efficiency of learning rules. Giving up locality may increase efficiency; on the
1 Palm (1991) further distinguishes local two-term rules, like the Hebb rule, from local three-term rules, like the Perceptron rule.
Perceptron-Like Learning Rules
249
other hand, all known rules that learn fast lack the simplicity and lightness of local learning rules. Acknowledgment I thank an anonymous referee for correcting a mistake and for making stimulating comments. References Angluin, D. (1988). Queries and concept learning. Machine Learning, 2, 319–342. Anthony, M., Brightwell, G., & Shawe-Taylor, J. (1995). On specifying boolean functions by labelled examples. Discrete Applied Mathematics, 61, 1–25. Anthony, M., & Shawe-Taylor, J. (1993). Using the Perceptron algorithm to find consistent hypotheses. Combinatorics, Probability and Computing, 2, 385–387. Bartlett, P. L., & Williamson, R. C. (1991). Investigating the distribution assumptions in the PAC learning model. In L. G. Valiant, & M. K. Warmuth (Eds.), Proceedings of the Fourth Annual Workshop on Computational Learning Theory (pp. 24–32). San Mateo, CA: Morgan Kaufmann. Blum, A., Frieze, A., Kannan, R., & Vempala, S. (in press). A polynomial-time algorithm for learning noisy linear threshold functions. In Proceedings of the 37th Annual Symposium on Foundations of Computer Science. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery, 36, 929–965. Hampson, S. E., & Volper, D. J. (1986). Linear function neurons: Structure and training. Biological Cybernetics, 53, 203–217. H˚astad, J. (1994). On the size of weights for threshold gates. SIAM Journal on Discrete Mathematics, 7, 484–492. Kivinen, J., & Warmuth, M. K. (1995). The Perceptron algorithm vs. Winnow: Linear vs. logarithmic mistake bounds when few input variables are relevant. In Proceedings of the Eighth Annual Conference on Computational Learning Theory (pp. 289–296). New York: ACM Press. Lewis II, P. M. (1966). A lower bound on the number of corrections required for convergence of the single threshold gate adaptive procedure. IEEE Transactions on Electronic Computers, 15, 933–935. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2, 285–318. Maass, W. (1994). Perspectives of current research about the complexity of learning on neural nets. In V. Roychowdhury, K.-Y. Siu, & A. Orlitsky (Eds.), Theoretical advances in neural computation and learning (pp. 295–336). Boston: Kluwer. Maass, W., & Tur´an, G. (1994). How fast can a threshold gate learn? In S. J. Hanson, G. Drastal, & R. Rivest (Eds.), Computational learning theory and natural learning systems: Constraints and prospects (pp. 381–414). Cambridge, MA: MIT Press.
250
Michael Schmitt
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133. McNaughton, R. (1961). Unate truth functions. IRE Transactions on Electronic Computers, 10, 1–6. Minsky, M. L., & Papert, S. A. (1988). Perceptrons: An introduction to computational geometry (Exp. ed.). Cambridge, MA: MIT Press. Muroga, S. (1971). Threshold logic and its applications. New York: Wiley. Nagy, G. (1991). Neural networks—then and now. IEEE Transactions on Neural Networks, 2, 316–318. Novikoff, A. (1962). On convergence proofs for Perceptrons. In Symposium on Mathematical Theory of Automata (pp. 615–622). Brooklyn, NY: Polytechnic Institute of Brooklyn. Palm, G. (1991). Memory capacities of local rules for synaptic modification: A comparative review. Concepts in Neuroscience, 2, 97–128. Parberry, I. (1994). Circuit complexity and neural networks. Cambridge, MA: MIT Press. Paull, M. C., & McCluskey, Jr., E. J. (1960). Boolean functions realizable with single threshold devices. Proceedings of the IRE, 48, 1335–1337. Pitt, L., & Valiant, L. G. (1988). Computational limitations on learning from examples. Journal of the Association for Computing Machinery, 35, 965–984. Rosenblatt, F. (1958). The Perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408. Rosenblatt, F. (1962). Principles of neurodynamics: Perceptrons and the theory of brain mechanisms. Washington, DC: Spartan Books. Schmitt, M., (1994). On the size of weights for McCulloch-Pitts neurons. In E. R. Caianiello (Ed.), Proceedings of the Sixth Italian Workshop on Neural Nets WIRN VIETRI-93 (pp. 241–246). Singapore: World Scientific. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27, 1134–1142. Received March 29, 1996; accepted March 14, 1997.
ARTICLE
Communicated by Steven Nowlan and Erkki Oja
Natural Gradient Works Efficiently in Learning Shun-ichi Amari RIKEN Frontier Research Program, Saitama 351-01, Japan
When a parameter space has a certain underlying structure, the ordinary gradient of a function does not represent its steepest direction, but the natural gradient does. Information geometry is used for calculating the natural gradients in the parameter space of perceptrons, the space of matrices (for blind source separation), and the space of linear dynamical systems (for blind source deconvolution). The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters. This suggests that the plateau phenomenon, which appears in the backpropagation learning algorithm of multilayer perceptrons, might disappear or might not be so serious when the natural gradient is used. An adaptive method of updating the learning rate is proposed and analyzed. 1 Introduction The stochastic gradient method (Widrow, 1963; Amari, 1967; Tsypkin, 1973; Rumelhart, Hinton, & Williams, 1986) is a popular learning method in the general nonlinear optimization framework. The parameter space is not Euclidean but has a Riemannian metric structure in many cases. In these cases, the ordinary gradient does not give the steepest direction of a target function; rather, the steepest direction is given by the natural (or contravariant) gradient. The Riemannian metric structures are introduced by means of information geometry (Amari, 1985; Murray and Rice, 1993; Amari, 1997a; Amari, Kurata, & Nagoska, 1992). This article gives the natural gradients explicitly in the case of the space of perceptrons for neural learning, the space of matrices for blind source separation, and the space of linear dynamical systems for blind multichannel source deconvolution. This is an extended version of an earlier article (Amari, 1996), including new results. How good is natural gradient learning compared to conventional gradient learning? The asymptotic behavior of online natural gradient learning is studied for this purpose. Training examples can be used only once in online learning when they appear. Therefore, the asymptotic performance of online learning cannot be better than the optimal batch procedure where all the examples can be reused again and again. However, we prove that natural gradient online learning gives the Fisher-efficient estimator in the sense Neural Computation 10, 251–276 (1998)
c 1998 Massachusetts Institute of Technology °
252
Shun-ichi Amari
of asymptotic statistics when the loss function is differentiable, so that it is asymptotically equivalent to the optimal batch procedure (see also Amari, 1995; Opper, 1996). When the loss function is nondifferentiable, the accuracy of asymptotic online learning is worse than batch learning by a factor of 2 (see, for example, Van den Broeck & Reimann, 1996). It was shown in Amari et al. (1992) that the dynamic behavior of natural gradient in the Boltzmann machine is excellent. It is not easy to calculate the natural gradient explicitly in multilayer perceptrons. However, a preliminary analysis (Yang & Amari, 1997), by using a simple model, shows that the performance of natural gradient learning is remarkably good, and it is sometimes free from being trapped in plateaus, which give rise to slow convergence of the backpropagation learning method (Saad & Solla, 1995). This suggests that the Riemannian structure might eliminate such plateaus or might make them not so serious. Online learning is flexible, because it can track slow fluctuations of the target. Such online dynamics were first analyzed in Amari (1967) and then by many researchers recently. Sompolinsky, Barkai, and Seung (1995), and Barkai, Seung, and Sompolinsky (1995) proposed an adaptive method of adjusting the learning rate (see also Amari, 1967). We generalize their idea and evaluate its performance based on the Riemannian metric of errors. The article is organized as follows. The natural gradient is defined in section 2. Section 3 formulates the natural gradient in various problems of stochastic descent learning. Section 4 gives the statistical analysis of efficiency of online learning, and section 5 is devoted to the problem of adaptive changes in the learning rate. Calculations of the Riemannian metric and explicit forms of the natural gradients are given in sections 6, 7, and 8. 2 Natural Gradient Let S = {w ∈ Rn } be a parameter space on which a function L(w) is defined. When S is a Euclidean space with an orthonormal coordinate system w, the squared length of a small incremental vector dw connecting w and w + dw is given by |dw|2 =
n X (dwi )2 , i=1
where dwi are the components of dw. However, when the coordinate system is nonorthonormal, the squared length is given by the quadratic form |dw|2 =
X
gij (w)dwi dwj .
(2.1)
i,j
When S is a curved manifold, there is no orthonormal linear coordinates, and the length of dw is always written as in equation 2.1. Such a space is
Natural Gradient Works Efficiently in Learning
253
a Riemannian space. We show in later sections that parameter spaces of neural networks have the Riemannian character. The n × n matrix G = (gij ) is called the Riemannian metric tensor, and it depends in general on w. It reduces to ½ 1, i = j, gij (w) = δij = 0, i 6= j in the Euclidean orthonormal case, so that G is the unit matrix I in this case. The steepest descent direction of a function L(w) at w is defined by the vector dw that minimizes L(w + dw) where |dw| has a fixed length, that is, under the constraint |dw|2 = ε2
(2.2)
for a sufficiently small constant ε. Theorem 1. given by
The steepest descent direction of L(w) in a Riemannian space is
˜ w) = −G−1 (w)∇L(w) −∇L(
(2.3)
where G−1 = (gij ) is the inverse of the metric G = (gij ) and ∇L is the conventional gradient, µ ∇L(w) =
the superscript T denoting the transposition. Proof.
We put
dw = ε a, and search for the a that minimizes L(w + dw) = L(w) + ε∇L(w)T a under the constraint | a |2 =
X
¶T
∂ ∂ L(w), . . . , L(w) ∂w1 ∂wn
gij ai aj = 1.
By the Lagrangean method, we have ∂ {∇L(w)T a − λaT Ga} = 0. ∂ai
,
254
Shun-ichi Amari
This gives ∇L(w) = 2λGa or
a=
1 −1 G ∇L(w), 2λ
where λ is determined from the constraint. We call ˜ w) = G−1 ∇L(w) ∇L( ˜ represents the natural gradient of L in the Riemannian space. Thus, −∇L the steepest descent direction of L. (If we use the tensorial notation, this is nothing but the contravariant form of −∇L.) When the space is Euclidean and the coordinate system is orthonormal, we have ˜ = ∇L. ∇L
(2.4)
This suggests the natural gradient descent algorithm of the form ˜ wt ), wt+1 = wt − ηt ∇L(
(2.5)
where ηt is the learning rate that determines the step size. 3 Natural Gradient Learning Let us consider an information source that generates a sequence of independent random variables z 1 , z 2 , . . . , z t , . . ., subject to the same probability distribution q(z ). The random signals z t are processed by a processor (like a neural network) that has a set of adjustable parameters w. Let l(z , w) be a loss function when signal z is processed by the processor whose parameter is w. Then the risk function or the average loss is L(w) = E[l(z , w)],
(3.1)
where E denotes the expectation with respect to z . Learning is a procedure to search for the optimal w∗ that minimizes L(w). The stochastic gradient descent learning method can be formulated in general as
wt+1 = wt − ηt C(wt )∇l(z t , wt ),
(3.2)
Natural Gradient Works Efficiently in Learning
255
where ηt is a learning rate that may depend on t and C(w) is a suitably chosen positive definite matrix (see Amari, 1967). In the natural gradient online learning method, it is proposed to put C(w) equal to G−1 (w) when the Riemannian structure is defined. We give a number of examples to be studied in more detail. 3.1 Statistical Estimation of Probability Density Function. In the case of statistical estimation, we assume a statistical model {p(z , w)}, and the ˆ ) that approximates problem is to obtain the probability distribution p(z , w the unknown density function q(z ) in the best way—that is, to estimate the true w or to obtain the optimal approximation w from the observed data. A typical loss function is l(z , w) = − log p(z , w).
(3.3)
The expected loss is then given by L(w) = −E[log p(z , w)] ¸ · q(z ) + HZ , = Eq log p(z , w) where HZ is the entropy of q(z ) not depending on w. Hence, minimizing L is equivalent to minimizing the Kullback-Leibler divergence Z D[q(z ) : p(z , w)] =
q(z ) log
q(z ) dz p(z , w)
(3.4)
of two probability distributions q(z ) and p(z , w). When the true distribution q(z ) is written as q(z ) = p(z , w∗ ), this is equivalent to obtain the maximum ˆ. likelihood estimator w The Riemannian structure of the parameter space of a statistical model is defined by the Fisher information (Rao, 1945; Amari, 1985) · gij (w) = E
∂ log p(x, w) ∂ log p(x, w) ∂wi ∂wj
¸ (3.5)
in the component form. This is the only invariant metric to be given to the statistical model (Chentsov, 1972; Campbell, 1985; Amari, 1985). The ˆ t. learning equation (see equation 3.2) gives a sequential estimator w 3.2 Multilayer Neural Network. Let us consider a multilayer feedforward neural network specified by a vector parameter w = (w1 , . . . , wn )T ∈ Rn . The parameter w is composed of modifiable connection weights and thresholds. When input x is applied, the network processes it and calculates the outputs f (x, w). The input x is subject to an unknown probability
256
Shun-ichi Amari
distribution q(x). Let us consider a teacher network that, by receiving x, generates the corresponding output y subject to a conditional probability distribution q(y | x). The task is to obtain the optimal w∗ from examples such that the student network approximates the behavior of the teacher. Let us denote by l(x, w) a loss when input signal x is processed by a network having parameter w. A typical loss is given, l(x, y , w) =
1 |y − f (x, w)|2 , 2
(3.6)
where y is the output given by the teacher. Let us consider a statistical model of neural networks such that its output y is given by a noisy version of f (x, w),
y = f (x, w) + n,
(3.7)
where n is a multivariate gaussian noise with zero mean and unit covariance matrix I. By putting z = (x, y ), which is an input-output pair, the model specifies the probability density of z as ½ ¾ 1 p(z , w) = cq(x) exp − |y − f (x, w)|2 , 2
(3.8)
where c is a normalizing constant and the loss function (see equation 3.6) is rewritten as l(z , w) = const + log q(x) − log p(z , w).
(3.9)
Given a sequence of examples (x1 , y 1 ), . . . , (xt , y t ), . . ., the natural gradient online learning algorithm is written as ˜ xt , y t , wt ). wt+1 = wt − ηt ∇l(
(3.10)
Information geometry (Amari, 1985) shows that the Riemannian structure is given to the parameter space of multilayer networks by the Fisher information matrix, · gij (w) = E
¸ ∂ log p(x, y ; w) ∂p(x, y ; w) . ∂wi ∂wj
(3.11)
We will show how to calculate G = (gij ) and its inverse in a later section. 3.3 Blind Separation of Sources. Let us consider m signal sources that produce m independent signals si (t), i = 1, . . . , m, at discrete times t = 1, 2, . . .. We assume that si (t) are independent at different times and that the
Natural Gradient Works Efficiently in Learning
257
expectations of si are 0. Let r(s) be the joint probability density function of s. Then it is written in the product form r(s) =
m Y
r1 (s1 ).
(3.12)
i=1
Consider the case where we cannot have direct access to the source signals s(t) but we can observe their m instantaneous mixtures x(t),
x(t) = As(t)
(3.13)
or xi (t) =
m X
Aij sj (t),
j=1
where A = (Aij ) is an m×m nonsingular mixing matrix that does not depend on t, and x = (x1 , . . . , xm )T is the observed mixtures. Blind source separation is the problem of recovering the original signals s(t), t = 1, 2, . . . from the observed signals x(t), t = 1, 2, . . . (Jutten & H´erault, 1991). If we know A, this is trivial, because we have
s(t) = A−1 x(t). The “blind” implies that we do not know the mixing matrix A and the probability distribution densities ri (si ). A typical algorithm to solve the problem is to transform x(t) into
y (t) = Wt x(t),
(3.14)
where Wt is an estimate of A−1 . It is modified by the following learning equation: Wt+1 = Wt − ηt F(xt , Wt ).
(3.15)
Here, F(x, W) is a special matrix function satisfying E[F(x, W)] = 0
(3.16)
for any density functions r(s) in equation 3.12 when W = A−1 . For Wt of equation 3.15 to converge to A−1 , equation 3.16 is necessary but not sufficient, because the stability of the equilibrium is not considered here. Let K(W) be an operator that maps a matrix to a matrix. Then ˜ x, W) = K(W)F(x, W) F(
258
Shun-ichi Amari
satisfies equation 3.16 when F does. The equilibrium of F and F˜ is the same, but their stability can be different. However, the natural gradient does not alter the stability of an equilibrium, because G−1 is positive-definite. Let l(x, W) be a loss function whose expectation L(W) = E[l(x, W)] is the target function minimized at W = A−1 . A typical function F is obtained by the gradient of l with respect to W, F(x, W) = ∇l(x, W).
(3.17)
Such an F is also obtained by heuristic arguments. Amari and Cardoso (in press) gave the complete family of F satisfying equation 3.16 and elucidated the statistical efficiency of related algorithms. From the statistical point of view, the problem is to estimate W = A−1 from observed data x(1), . . . , x(t). However, the probability density function of x is written as pX (x; W, r) = |W|r(W x),
(3.18)
which is specified not only by W to be estimated but also by an unknown function r of the form 3.12. Such a statistical model is said to be semiparametric and is a difficult problem to solve (Bickel, Klassen, Ritov, & Wellner, 1993), because it includes an unknown function of infinite degrees of freedom. However, we can apply the information-geometrical theory of estimating functions (Amari & Kawanabe, 1997) to this problem. When F is given by the gradient of a loss function (see equation 3.17), where ∇ is the gradient ∂/∂W with respect to a matrix, the natural gradient is given by ˜ = G−1 ◦ ∇l. ∇l
(3.19)
Here, G is an operator transforming a matrix to a matrix so that it is an m2 ×m2 matrix. G is the metric given to the space Gl(m) of all the nonsingular m × m matrices. We give its explicit form in a later section based on the Lie group structure. The inverse of G is also given explicitly. Another important problem is the stability of the equilibrium of the learning dynamics. This has recently been solved by using the Riemannian structure (Amari, Chen, & Chichocki, in press; see also Cardoso & Laheld, 1996). The superefficiency of some algorithms has been also proved in Amari (1997b) under certain conditions. 3.4 Blind Source Deconvolution. When the original signals s(t) are mixed not only instantaneously but also with past signals as well, the prob-
Natural Gradient Works Efficiently in Learning
259
lem is called blind source deconvolution or equalization. By introducing the time delay operator z−1 , z−1 s(t) = s(t − 1),
(3.20)
we have a mixing matrix filter A denoted by
A(z) =
∞ X
Ak z−k ,
(3.21)
k=0
where Ak are m × m matrices. The observed mixtures are X x(t) = A(z)s(t) = Ak s(t − k).
(3.22)
k
To recover the original independent sources, we use the finite impulse response model
W (z) =
d X
Wk z−1
(3.23)
k=0
of degree d. The original signals are recovered by
y (t) = W t (z)x(t),
(3.24)
where W t is adaptively modified by
W t+1 (z) = W t (z) − ηt ∇l{xt , xt−1 , . . . , W t (z)}.
(3.25)
Here, l(xt , xt−1 , . . . , W ) is a loss function that includes some past signals. We can summarize the past signals into a current state variable in the online learning algorithm. Such a loss function is obtained by the maximum entropy method (Bell & Sejnowski, 1995), independent component analysis (Comon, 1994), or the statistical likelihood method. In order to obtain the natural gradient learning algorithm ˜ xt , xt−1 , . . . , W t ), W t+1 (z) = W t (z) − ηt ∇l( we need to define the Riemannian metric in the space of all the matrix filters (multiterminal linear systems). Such a study was initiated by Amari (1987). It is possible to define G and to obtain G−1 explicitly (see section 8). A preliminary investigation into the performance of the natural gradient learning algorithm has been undertaken by Douglas, Chichocki, and Amari (1996) and Amari et al. (1997).
260
Shun-ichi Amari
4 Natural Gradient Gives Fisher-Efficient Online Learning Algorithms This section studies the accuracy of natural gradient learning from the statistical point of view. A statistical estimator that gives asymptotically the best result is said to be Fisher efficient. We prove that natural gradient learning attains Fisher efficiency. Let us consider multilayer perceptrons as an example. We study the case of a realizable teacher, that is, the behavior of the teacher is given by q(y | x) = p(y | x, w∗ ). Let DT = {(x1 , y 1 ), . . . , (xT , y T )} be T-independent input-output examples generated by the teacher network having parameter w∗ . Then, minimizing the log loss, l(x, y ; w) = − log p(x, y ; w), ˆ T that minimizes the training error over the training data DT is to obtain w Ltrain (w) =
T 1X l(xt , y t ; w). T t=1
(4.1)
Q This is equivalent to maximizing the likelihood Tt=1 p(xt , y t ; w). Hence, ˆ T is the maximum likelihood estimator. The Cram´er-Rao theorem states w that the expected squared error of an unbiased estimator satisfies ˆ T − w∗ )T ] ≥ ˆ T − w∗ )(w E[(w
1 −1 G , T
(4.2)
where the inequality holds in the sense of positive definiteness of matrices. An estimator is said to be efficient or Fisher efficient when it satisfies equation 4.2 with equality for large T. The maximum likelihood estimator is Fisher efficient, implying that it is the best estimator attaining the Cram´erRao bound asymptotically, ˆ T − w∗ )(w ˆ T − w∗ )T ] = G−1 , lim TE[(w
T→∞
(4.3)
where G−1 is the inverse of the Fisher information matrix G = (gij ) defined by equation 3.11. Examples (x1 , y 1 ), (x2 , y 2 ) . . . are given one at a time in the case of online ˜ t be an online estimator at time t. At the next time, t + 1, the learning. Let w ˜ t+1 based on the current ˜ t is modified to give a new estimator w estimator w observation (xt , y t ). The old observations (x1 , y 1 ), . . . , (xt−1 , y t−1 ) cannot ˜ t+1 , so the learning rule is written as be reused to obtain w ˜ t ). ˜ t+1 = m(xt , y t , w w
Natural Gradient Works Efficiently in Learning
261
˜ t } is Markovian. Whatever learning rule m is chosen, the The process {w ˜ t is never better than that of the optimal batch behavior of the estimator w ˆ t because of this restriction. The gradient online learning rule estimator w ˜ t − ηt C ˜ t+1 = w w
˜ t) ∂l(xt , y t ; w , ∂w
was proposed where C is a positive-definite matrix, and its dynamical behavior was studied by Amari (1967) when the learning constant ηt = η is fixed. Heskes and Kappen (1991) obtained similar results, which ignited research into online learning. When ηt satisfies some condition, say, ηt = c/t, ˜ t is for a positive constant c, the stochastic approximation guarantees that w a consistent estimator converging to w∗ . However, it is not Fisher efficient in general. There arises a question of whether there exists a learning rule that gives an efficient estimator. If it exists, the asymptotic behavior of online learning is equivalent to that of the best batch estimation method. This article answers the question affirmatively, by giving an efficient online learning rule (see Amari, 1995; see also Opper, 1996). Let us consider the natural gradient learning rule, 1˜ ˜ t − ∇l( ˜ t ). ˜ t+1 = w w xt , y t , w t
(4.4)
Theorem 2. Under the learning rule (see equation 4.4), the natural gradient ˜ t is Fisher efficient. online estimator w Proof.
˜ t by Let us denote the covariance matrix of estimator w
˜ t+1 − w∗ )(w ˜ t+1 − w∗ )T ]. V˜ t+1 = E[(w
(4.5)
This shows the expectation of the squared error. We expand ˜ t) ∂l(xt , y t ; w∗ ) ∂ 2 l(xt , y t ; w∗ ) ∂l(xt , y t ; w ˜ t − w∗ ) = + (w ∂w ∂w ∂ w∂ w ˜ t − w∗ |2 ). + O(|w By subtracting w∗ from the both sides of equation 4.4 and taking the expectation of the square of the both sides, we have µ ¶ 2˜ 1 −1 1 ˜ ˜ (4.6) Vt+1 = Vt − Vt + 2 G + O 3 , t t t where we used ¸ · ∂l(xt , y t ; w∗ ) = 0, E ∂w
(4.7)
262
Shun-ichi Amari
·
¸ ∂ 2 l(xt , y t ; w∗ ) = G(w∗ ), ∂ w∂ w µ ¶ 1 ˜ t ) = G(w∗ ) + O , G(w t E
(4.8)
˜ t converges to w∗ as guaranteed by stochastic approximation because w under certain conditions (see Kushner & Clark, 1978). The solution of equation 4.6 is written asymptotically as µ ¶ 1 1 V˜ t = G−1 + O 2 , t t proving the theorem. The theory can be extended to be applicable to the unrealizable teacher case, where ·
¸
∂2 l(x, y ; w) K(w) = E ∂ w∂ w
(4.9)
should be used instead of G(w) in order to obtain the same efficient result as the optimal batch procedure. This is locally equivalent to the NewtonRaphson method. The results can be stated in terms of the generalization error instead of the covariance of the estimator, and we can obtain more universal results (see Amari, 1993; Amari & Murata, 1993). Remark. In the cases of blind source separation and deconvolution, the models are semiparametric, including the unknown function r (see equation 3.18). In such cases, the Cram´er-Rao bound does not necessarily hold. Therefore, Theorem 2 does not hold in these cases. It holds when we can estimate the true r of the source probability density functions and use it to define the loss function l(x, W). Otherwise equation 4.8 does not hold. The stability of the true solution is not necessarily guaranteed either. Amari, Chen, & Cichocki (in press) have analyzed this situation and proposed a universal method of attaining the stability of the equilibrium solution. 5 Adaptive Learning Constant The dynamical behavior of the learning rule (see equation 3.2) was studied in Amari (1967) when ηt is a small constant η. In this case, wt fluctuates around the (local) optimal value w∗ for large t. The expected value and variance of wt was studied, and the trade-off between the convergence speed and accuracy of convergence was demonstrated. When the current wt is far from the optimal w∗ , it is desirable to use a relatively large η to accelerate the convergence. When it is close to w∗ , a
Natural Gradient Works Efficiently in Learning
263
small η is preferred in order to eliminate fluctuations. An idea of an adaptive change of η was discussed in Amari (1967) and was called “learning of learning rules.” Sompolinsky et al. (1995) (see also Barkai et al., 1995) proposed a rule of adaptive change of ηt , which is applicable to the pattern classification problem where the expected loss L(w) is not differentiable at w∗ . This article generalizes their idea to a more general case where L(w) is differentiable and analyzes its behavior by using the Riemannian structure. We propose the following learning scheme: ˜ xt , y t ; w ˆ t) wt+1 = wt − ηt ∇l( ˆ t ) − ηt ]}, ηt+1 = ηt exp{α[βl(xt , y t ; w
(5.1) (5.2)
where α and β are constants. We also assume that the training data are generated by a realizable deterministic teacher and that L(w∗ ) = 0 holds at the optimal value. (See Murata, Muller, ¨ Ziehe, and Amari (1996) for a more general case.) We try to analyze the dynamical behavior of learning by using the continuous version of the algorithm for the sake of simplicity, ∂ d wt = −ηt G−1 (wt ) l(xt , y t ; wt ), dt ∂w d ηt = αηt [βl(xt , z t ; wt ) − ηt ]. dt
(5.3) (5.4)
In order to show the dynamical behavior of (wt , ηt ), we use the averaged version of equations 5.3 and 5.4 with respect to the current input-output pair (xt , y t ). The averaged learning equation (Amari, 1967, 1977) is written as ¿ À ∂ d wt = −ηt G−1 (wt ) l(x, y ; wt ) , (5.5) dt ∂w d ηt = αηt {βhl(x, y ; wt )i − ηt }, (5.6) dt where h i denotes the average over the current (x, y ). We also use the asymptotic evaluations À ¿ À ¿ À ¿ ∂ ∂2 ∂ l(x, y ; wt ) = l(x, y ; w∗ ) + l(x, y ; w∗ )(wt − w∗ ) ∂w ∂w ∂ w∂ w = G∗ (wt − w∗ ), 1 hl(x, y ; wt )i = (wt − w∗ )T G∗ (wt − w∗ ), 2 where G∗ = G(w∗ ) and we used L(w∗ ) = 0. We then have d wt = −ηt (wt − w∗ ), dt
(5.7)
264
Shun-ichi Amari
½ ¾ d β ηt = αηt (wt − w∗ )T G∗ (wt − w∗ ) − ηt . dt 2
(5.8)
Now we introduce the squared error variable, et =
1 (wt − w∗ )T G∗ (wt − w∗ ), 2
(5.9)
where et is the Riemannian magnitude of wt − w∗ . It is easy to show d et = −2ηt et , dt d ηt = αβηt et − αηt2 . dt
(5.10) (5.11)
The behavior of equations 5.10 and 5.11 is interesting. The origin (0, 0) is its attractor. However, the basin of attraction has a boundary of fractal structure. Anyway, starting from an adequate initial value, it has the solution of the form a et = , t b ηt = . t The coefficients a and b are determined from a = 2ab b = −αβab + αb2 . This gives 1 , 2 µ ¶ 1 1 1 − , a= β 2 α
b=
α > 2.
This proves the 1/t convergence rate of the generalization error, that is, ˆ t converging to w∗ . The adaptive ηt the optimal order for any estimator w shows a nice characteristic when the target teacher is slowly fluctuating or changes suddenly. 6 Natural Gradient in the Space of Perceptrons The Riemannian metric and its inverse are calculated in this section to obtain the natural gradient explicitly. We begin with an analog simple perceptron whose input-output behavior is given by y = f (w · x) + n,
(6.1)
Natural Gradient Works Efficiently in Learning
265
where n is a gaussian noise subject to N(0, σ 2 ) and f (u) =
1 − e−u . 1 + e−u
(6.2)
The conditional probability density of y when x is applied is ½ ¾ 1 1 exp − 2 [y − f (w · x)]2 . p(y | x; w) = √ 2σ 2πσ
(6.3)
The distribution q(x) of inputs x is assumed to be the normal distribution N(0, I). The joint distribution of (x, y) is p(y, x; w) = q(x)p(y | x; w). In order to calculate the metric G of equation 3.11 explicitly, let us put w2 = |w|2 =
X
w2i
(6.4)
where |w| is the Euclidean norm. We then have the following theorem. Theorem 3.
The Fisher information metric is
G(w) = w2 c1 (w)I + {c2 (w) − c1 (w)}wwT , where c1 (w) and c2 (w) are given by ½ ¾ Z 1 1 { f 2 (wε) − 1}2 exp − ε 2 dε, c1 (w) = √ 2 4 2πσ 2 w2 ½ ¾ Z 1 1 { f 2 (wε) − 1}2 ε2 exp − ε 2 dε. c2 (w) = √ 2 4 2πσ 2 w2 Proof.
We have
√ 1 log p(y, x; w) = log q(x) − log( 2π σ ) − [y − f (w · x)]2 . 2σ 2 Hence, ∂ 1 log p(y, x; w) = 2 {y − f (w · x)} f 0 (w · x)xi ∂wi σ 1 = 2 nf 0 (w · x)xi . σ
(6.5)
266
Shun-ichi Amari
The Fisher information matrix is given by ·
¸ ∂ ∂ log p log p gij (w) = E ∂wi ∂wj 1 = 2 E[{ f 0 (w · x)}2 xi xj ], σ where E[n2 ] = σ 2 is taken into account. This can be written, in the vectormatrix form, as G(w) =
1 E[( f 0 )2 xxT ]. σ2
In order to show equation 6.5, we calculate the quadratic form r T G(w)r for arbitrary r . When r = w,
wT Gw =
1 E[{ f 0 (w · x)}2 (w · x)2 ]. σ2
Since u = w · x is subject to N(0, w2 ), we put u = wε, where ε is subject to N(0, 1). Noting that f 0 (u) =
1 {1 − f 2 (u)}, 2
we have, w2 w G(w)w = √ 4 2πσ 2 T
Z
½ 2¾ ε dε, ε { f (wε) − 1} exp − 2 2
2
2
which confirms equation 6.5 when r = w. We next put r = v , where v is an arbitrary unit vector orthogonal to w (in the Euclidean sense). We then have
v T G(w)v =
1 E[{ f 2 (w · x) − 1}2 (v · x)2 ]. 4σ 2
Since u = w · x and v = v · x are independent, and v is subject to N(0, 1), we have 1 E[(v · x)2 ]E[( f 2 {w · x) − 1}2 ] 4σ 2 ½ 2¾ Z ε 1 2 2 dε. { f (wε) − 1} exp − = √ 2 2 4 2πσ
v T G(w)v =
Natural Gradient Works Efficiently in Learning
267
Since G(w) in equation 6.5 is determined by the quadratic forms for nindependent w and v ’s, this proves equation 6.5. To obtain the natural gradient, it is necessary to have an explicit form of G−1 . We can calculate G−1 (w) explicitly in the perceptron case. The inverse of the Fisher information metric is ¶ µ 1 1 1 1 I+ 4 − wwT . G−1 (w) = 2 w c1 (w) w c2 (w) c1 (w)
Theorem 4.
(6.6)
This can easily be proved by direct calculation of GG−1 . The natural gradient learning equation (3.10) is then given by
wt+1 = wt + ηt {yt − f (wt .xt )} f 0 (wt · xt ) # " ¶ µ 1 1 1 1 − (wt · xt )wt . xt + 4 w2t c1 (wt ) wt c2 (wt ) c1 (wt )
(6.7)
We now show some other geometrical characteristics of the parameter space of perceptrons. The volume Vn of the manifold of simple perceptrons is measured by Z p |G(w)|dw (6.8) Vn = where |G(w)| is the determinant of G = (gij ), which represents the volume density by the Riemannian metric. It is interesting to see that the manifold of perceptrons has a finite volume. Bayesian statistics considers that w is randomly chosen subject to a prior distribution π(w). A choice of π(w) is the Jeffrey prior or noninformative prior given by π(w) =
1 p |G(w)|. Vn
(6.9)
The Jeffrey prior is calculated as follows. Theorem 5. tively, by
The Jeffrey prior and the volume of the manifold are given, respec-
q w |G(w)| = c2 (w){c1 (w)}n−1 , Vn Z q c2 (w){c1 (w)}n−1 wn dw, Vn = an−1 p
respectively, where an−1 is the area of the unit (n − 1)-sphere.
(6.10) (6.11)
268
Shun-ichi Amari
The Fisher metric G can also be calculated for multilayer perceptrons. Let us consider a multilayer perceptron having m hidden units with sigmoidal activation functions and a linear output unit. The input-output relation is y=
X
vi f (wi · x) + n,
or the conditional probability is
· ¸ X 1 2 vi f (wi · x)} . p(y | x; v , w1 , . . . , wm ) = c exp − {y − 2
(6.12)
The total parameter w consist of {v , w1 , . . . , wm }. Let us calculate the Fisher information matrix G. It consists of m + 1 blocks corresponding to these wi ’s and v . From ∂ ∂ wi
log p(y | x; w) = nvi f 0 (wi · x)x,
we easily obtain the block submatrix corresponding to wi as ¸ · ∂ 1 ∂ log p log p = 4 E[n2 ]v2i E[{ f 0 (wi · x)}2 xxT ] E ∂ wi ∂ wi σ 1 = 2 v2i E[{ f 0 (wi · x)}2 xxT ]. σ This is exactly the same as the simple perceptron case except for a factor of (vi )2 . For the off-diagonal block, we have · E
∂ ∂ wi
log p
¸ 1 log p = 2 vi vj E[ f 0 (wi · x) f 0 (wj · x)xxT ]. ∂ wj σ ∂
In this case, we have the following form, Gwi wj = cij I + dii wi wTi + dij wi wjT + dji wj wTi + djj wj wjT ,
(6.13)
where the coefficients cij and dij ’s are calculated explicitly by similar methods. The v block and v and wi block are also calculated similarly. However, the inversion of G is not easy except for simple cases. It requires inversion of a 2(m + 1) dimensional matrix. However, this is much better than the direct inversion of the original (n + 1)m-dimensional matrix of G. Yang and Amari (1997) performed a preliminary study on the performance of the natural gradient learning algorithm for a simple multilayer perceptron. The result shows that natural gradient learning might be free from the plateau phenomenon. Once the learning trajectory is trapped in a plateau, it takes a long time to get out of it.
Natural Gradient Works Efficiently in Learning
269
7 Natural Gradient in the Space of Matrices and Blind Source Separation We now define a Riemannian structure to the space of all the m × m nonsingular matrices, which forms a Lie group denoted by Gl(m), for the purpose of introducing the natural gradient learning rule to the blind source separation problem. Let dW be a small deviation of a matrix from W to W + dW. The tangent space TW of Gl(m) at W is a linear space spanned by all such small deviations dWij ’s and is called the Lie algebra. We need to introduce an inner product at W by defining the squared norm of dW ds2 = hdW, dWiW =k dW k2 . By multiplying W −1 from the right, W is mapped to WW −1 = I, the unit matrix, and W + dW is mapped to (W + dW)W −1 = I + dX, where dX = dWW −1 .
(7.1)
This shows that a deviation dW at W is equivalent to the deviation dX at I by the correspondence given by multiplication of W −1 . The Lie group invariance requires that the metric is kept invariant under this correspondence, that is, the inner product of dW at W is equal to the inner product of dWY at WY for any Y, hdW, dWiW = hdWY, dWYiWY .
(7.2)
When Y = W −1 , WY = I. This principle was used to derive the natural gradient in Amari, Cichocki, and Yang (1996); see also Yang and Amari (1997) for detail. Here we give its analysis by using dX. We define the inner product at I by hdX, dXiI =
X (dXij )2 = tr(dXT dX).
(7.3)
i,j
We then have the Riemannian metric structure at W as hdW, dWiW = tr{(W −1 )T dW T dWW −1 }.
(7.4)
We can write the metric tensor G in the component form. It is a quantity having four indices Gij,kl (W) such that ds2 =
X
Gij,kl (W)dWij dWkl , X −1 −1 δik Wjm Wlm , Gij,kl (W) = m
(7.5)
270
Shun-ichi Amari
−1 where Wjm are the components of W −1 . While it may not appear to be ˜ straightforward to obtain the explicit form of G−1 and natural gradient ∇L, in fact it can be calculated as shown below.
Theorem 6.
The natural gradient in the matrix space is given by
˜ = (∇L)W T W. ∇L
(7.6)
Proof. The metric is Euclidean at I, so that both G(I) and its inverse, G−1 (I), are the identity. Therefore, by mapping dW at W to dX at I, the natural gradient learning rule in terms of dX is written as ∂L ∂L dX = −ηt G−1 (I) = −ηt , dt ∂X ∂X
(7.7)
where the continuous time version is used. We have from equation 7.1 dX dW −1 = W . dt dt
(7.8)
The gradient ∂L/∂X is calculated as ∂L(W) ∂L = ∂X ∂W
µ
∂W T ∂X
¶ =
∂L T W . ∂W
Therefore, the natural gradient learning rule is dW ∂L T = −ηt W W, dt ∂W which proves equation 7.6. The dX = dWW −1 forms a basis of the tangent space at W, but this is not integrable; that is, we cannot find any matrix function X = X(W) that satisfies equation 7.1. Such a basis is called a nonholonomic basis. This is a locally defined basis but is convenient for our purpose. Let us calculate the natural gradient explicitly. To this end, we put l(x, W) = − log det |W| −
n X
log fi (yi ),
(7.9)
i−1
where y = W x and fi (yi ) is an adequate probability distribution. The expected loss is L(W) = E[l(x, W)],
Natural Gradient Works Efficiently in Learning
271
which represents the entropy of the output y after a componentwise nonlinear transformation (Nadal & Parga, 1994; Bell & Sejnowski, 1995). The independent component analysis or the mutual information criterion also gives a similar loss function (Comon, 1994; Amari et al., 1996; see also Oja & Karhunen, 1995). When fi is the true probability density function of the ith source, l(x, W) is the negative of the log likelihood. The natural gradient of l is calculated as follows. We calculate the differential dl = l(x, W + dW) − l(x, W) = −d log det |W| −
X
d log fi (yi )
due to change dW. Then, d log det |W| = log det |W + dW| − log det |W| = log det |(W + dW)W −1 | = log(det |I + dX|) = trdX. Similarly, from dy = dW x, X d log fi (yi ) = −ϕ(y )T dW x = −ϕ(y )T dXy , where ϕ(y ) is the column vector ϕ(y ) = [ϕ1 (y1 ), . . . , ϕm (ym )], d ϕi (yi ) = − log fi (yi ). dy
(7.10)
This gives ∂L/∂X, and the natural gradient learning equation is dW = ηt (I − ϕ(y )T y )W. dt
(7.11)
The efficiency of this equation is studied from the statistical and information geometrical point of view (Amari & Kawanabe, 1997; Amari & Cardoso, in press). We further calculate the Hessian by using the natural frame dX, ˙ y )dXy + ϕ(y )T dXdXy , d2 l = y T dXT ϕ(
(7.12)
where ϕ( ˙ y ) is the diagonal matrix with diagonal entries dϕi (yi )/dyi . Its expectation can be explicitly calculated (Amari et al., in press). The Hessian is decomposed into diagonal elements and two-by-two diagonal blocks (see also Cardoso & Laheld, 1996). Hence, the stability of the above learning rule is easily checked. Thus, in terms of dX, we can solve the two fundamental problems: the efficiency and the stability of learning algorithms of blind source separation (Amari & Cardoso, in press; Amari et al., in press).
272
Shun-ichi Amari
8 Natural Gradient in Systems Space The problem is how to define the Riemannian structure in the parameter space {W(z)} of systems, where z is the time-shift operator. This was given in Amari (1987) from the point of view of information geometry (Amari, 1985, 1997a; Murray & Rice, 1993). We show here only ideas (see Douglas et al., 1996; Amari, Douglas, Cichocki, & Yang, 1997, for preliminary studies). In the case of multiterminal deconvolution, a typical loss function l is given by l = − log det |W0 | −
XZ
p{yi ; W (z)} log fi (yi )dyi ,
(8.1)
i
where p{yi ; W (z)} is the marginal distribution of y (t) which is derived from the past sequence of x(t) by matrix convolution W (z) of equation 3.24. This type of loss function is obtained from maximization of entropy, independent component analysis, or maximum likelihood. The gradient of l is given by ∇m l = −(W0−1 )T δ0m + ϕ(y t )xT (t − m),
(8.2)
where ∇m =
∂ , ∂Wm
and ∇l =
d X
(∇m l)z−m .
(8.3)
m=0
In order to calculate the natural gradient, we need to define the Riemannian metric G in the manifold of linear systems. The geometrical theory of the manifold of linear systems by Amari (1987) defines the Riemannian metric and a pair of dual affine connections in the space of linear systems. Let dW (z) =
X
dW m z−m
(8.4)
m
be a small deviation of W (z). We postulate that the inner product hdW (z), dW (z)i is invariant under the operation of any matrix filter Y (z), hdW (z), dW (z)iW (z) = hdW (z)Y (z), dW (z)Y (z)iW Y ,
(8.5)
Natural Gradient Works Efficiently in Learning
273
where Y (z) is any system matrix. If we put
Y (z) = {W (z)}−1 , which is a general system not necessarily belonging to FIR,
W (z){W (z)}−1 = I (z), which is the identity system
I (z) = I not including any z−m terms. The tangent vector dW (z) is mapped to dX(z) = dW (z){W (z)}−1 .
(8.6)
The inner product at I is defined by hdX(z), dX(z)iI =
X (dXm,ij )2 ,
(8.7)
m,ij
where dXm,ij are the elements of matrix dXm . The natural gradient ˜ = G−1 ◦ ∇l ∇l of the manifold of systems is given as follows. Theorem 7.
The natural gradient of the manifold of systems is given by
˜ = ∇l(z)W T (z−1 )W (z), ∇l
(8.8)
where operator z−1 should be operated adequately. ˜ does not belong to The proof is omitted. It should be remarked that ∇l the class of FIR systems, nor does it satisfy the causality condition either. Hence, in order to obtain an online learning algorithm, we need to introduce time delay to map it to the space of causal FIR systems. This article shows only the principles involved; details will published in a separate article by Amari, Douglas, and Cichocki.
274
Shun-ichi Amari
9 Conclusions This article introduces the Riemannian structures to the parameter spaces of multilayer perceptrons, blind source separation, and blind source deconvolution by means of information geometry. The natural gradient learning method is then introduced and is shown to be statistically efficient. This implies that optimal online learning is as efficient as optimal batch learning when the Fisher information matrix exists. It is also suggested that natural gradient learning might be easier to get out of plateaus than conventional stochastic gradient learning. Acknowledgments I thank A. Cichocki, A. Back, and H. Yang at RIKEN Frontier Research Program for their discussions. References Amari, S. (1967). Theory of adaptive pattern classifiers. IEEE Trans., EC-16(3), 299–307. Amari, S. (1977). Neural theory of association and concept-formation. Biological Cybernetics, 26, 175–185. Amari, S. (1985). Differential-geometrical methods in statistics. Lecture Notes in Statistics 28. New York: Springer-Verlag. Amari, S. (1987). Differential geometry of a parametric family of invertible linear systems—Riemannian metric, dual affine connections and divergence. Mathematical Systems Theory, 20, 53–82. Amari, S. (1993). Universal theorem on learning curves. Neural Networks, 6, 161– 166. Amari, S. (1995). Learning and statistical inference. In M. A. Arbib (Ed.), Handbook of brain theory and neural networks (pp. 522–526). Cambridge, MA: MIT Press. Amari, S. (1996). Neural learning in structured parameter spaces—Natural Riemannian gradient. In M. C. Mozer, M. I. Jordan, & Th. Petsche (Eds.), Advances in neural processing systems, 9. Cambridge, MA: MIT Press. Amari, S. (1997a). Information geometry. Contemporary Mathematics, 203, 81–95. Amari, S. (1997b). Superefficiency in blind source separation. Unpublished manuscript. Amari, S., & Cardoso, J. F. (In press). Blind source separation—Semi-parametric statistical approach. IEEE Trans. on Signal Processing. Amari, S., Chen, T.-P., & Cichocki, A. (In press). Stability analysis of learning algorithms for blind source separation. Neural Networks. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation, in NIPS’95, vol. 8, Cambridge, MA: MIT Press. Amari, S., Douglas, S. C., Cichocki, A., & Yang, H. H. (1997). Multichannel blind deconvolution and equalization using the natural gradient. Signal Processing
Natural Gradient Works Efficiently in Learning
275
Advance in Wireless Communication Workshop, Paris. Amari, S., & Kawanabe, M. (1997). Information geometry of estimating functions in semiparametric statistical models, Bernoulli, 3, 29–54. Amari, S., Kurata, K., & Nagaoka, H. (1992). Information geometry of Boltzmann machines. IEEE Trans. on Neural Networks, 3, 260–271. Amari, S., & Murata, N. (1993). Statistical theory of learning curves under entropic loss criterion. Neural Computation, 5, 140–153. Barkai, N., Seung, H. S., & Sompolinsky, H. (1995). Local and global convergence of on-line learning. Phys. Rev. Lett., 75, 1415–1418. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bickel, P. J., Klassen, C. A. J., Ritov, Y., & Wellner, J. A. (1993). Efficient and adaptive estimation for semiparametric models. Baltimore: Johns Hopkins University Press. Campbell, L. L. (1985). The relation between information theory and the differential-geometric approach to statistics. Information Sciences, 35, 199–210. Cardoso, J. F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44, 3017–3030. Chentsov, N. N. (1972). Statistical decision rules and optimal inference (in Russian). Moscow: Nauka [translated in English (1982), Rhode Island: AMS]. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Douglas, S. C., Cichocki, A., & Amari, S. (1996). Fast convergence filtered regressor algorithms for blind equalization. Electronics Letters, 32, 2114–2115. Heskes, T., & Kappen, B. (1991). Learning process in neural networks. Physical Review, A44, 2718–2762. Jutten, C., & H´erault, J. (1991). Blind separation of sources, an adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1), 1–31. Kushner, H. J., & Clark, D. S. (1978). Stochastic approximation methods for constrained and unconstrained systems. Berlin: Springer-Verlag. Murata, N., & Muller, ¨ K. R., Ziehe, A., & Amari, S. (1996). Adaptive on-line learning in changing environments. In M. C. Mozer, M. I. Jordan, & Th. Petsche (Eds.), Advaces in neural processing systems, 9. Cambridge, MA: MIT Press. Murray, M. K., & Rice, J. W. (1993). Differential geometry and statistics. New York: Chapman & Hall. Nadal, J. P. & Parga, N. (1994). Nonlinear neurons in the low noise limit—A factorial code maximizes information transfer. Network, 5, 561–581. Oja, E., & Karhunen, J. (1995). Signal separation by nonlinear Hebbian learning. In M. Palaniswami et al. (Eds.), Computational intelligence—A dynamic systems perspective (pp. 83–97). New York: IEEE Press. Opper, M. (1996). Online versus offline learning from random examples: General results. Phys. Rev. Lett., 77, 4671–4674. Rao, C. R. (1945). Information and accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37, 81–91. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. Parallel Distributed Processing (Vol. 1, pp. 318– 362). Cambridge, MA: MIT Press.
276
Shun-ichi Amari
Saad, D., & Solla, S. A. (1995). On-line learning in soft committee machines. Phys. Rev. E, 52, 4225–4243. Sompolinsky, H., Barkai, N., & Seung, H. S. (1995). On-line learning of dichotomies: Algorithms and learning curves. In J.-H. Oh et al. (Eds.), Neural networks: The statistical mechanics perspective (pp. 105–130). Proceedings of the CTP-PBSRI Joint Workshop on Theoretical Physics. Singapore: World Scientific. Tsypkin, Ya. Z. (1973). Foundation of the theory of learning systems. New York: Academic Press. Van den Broeck, C., & Reimann, P. (1996). Unsupervised learning by examples: On-line versus off-line. Phys. Rev. Lett., 76, 2188–2191. Widrow, B. (1963). A statistical theory of adaptation. Oxford: Pergamon Press. Yang, H. H., & Amari, S. (1997). Application of natural gradient in training multilayer perceptrons. Unpublished manuscript. Yang, H. H., & Amari, S. (In press). Adaptive on-line learning algorithms for blind separation—Maximum entropy and minimal mutual information. Neural Computation.
Received January 24, 1997; accepted May 20, 1997.
NOTES
Communicated by E. J. W. Boers
Adding Lateral Inhibition to a Simple Feedforward Network Enables It to Perform Exclusive-Or Leslie S. Smith Centre for Cognitive and Computational Neuroscience, Department of Computing Science, University of Stirling, Stirling FK9 4LA, Scotland
A simple laterally inhibited recurrent network that implements exclusiveor is demonstrated. The network consists of two mutually inhibitory units with logistic output function, each receiving one external input and each connected to a simple threshold output unit. The mutually inhibitory units settle into a point attractor. We investigate the range of steepness of the logistic and the range of inhibitory weights for which the network can perform exclusive-or. 1 Background It is well known that exclusive-or cannot be solved by a single-layer feedforward network with units whose output function is monotonic (Minsky & Papert, 1969), but can be solved using hidden units (Rumelhart, Hinton, & Williams, 1986). We show that it can be solved using lateral inhibition to form a simple recurrent network. 2 Method The network used is shown in Figure 1a. We write IL and IR for the left and right inputs, SL and SR for the stimulation of the hidden units, L and R for the outputs of the hidden units, SY for the stimulation of the output unit, and Y for its output. We use a simple threshold output unit with variable threshold θ. Since the problem we are trying to solve is symmetric, we make the network symmetric. Because we choose θ , we can set hidden-to-output weights to be 1. Writing F for the monotonically increasing output function for the hidden units, SY = L + R = F(IL + wR) + F(IR + wL).
(2.1)
This is a recurrent equation, describing a recurrent network; such a network may or may not settle to a stable state. Neural Computation 10, 277–280 (1998)
c 1998 Massachusetts Institute of Technology °
278
Leslie S. Smith
(a)
Log(|w|)
3 Un
sta
ble
2E
xc
lus
ive
-or
no
ts
1
olv
ed
0
0.2
0.4 0.6 0.8 Logistic Gradient
1
(b)
Figure 1: (a) The network used. All weights are 1, except for the lateral weight, w. (b) Results using a logistic output function at the hidden units. y-axis is log(|w|). The network solves the exclusive-or problem for values of w between the two lines. Actual values of w become very large for small β.
In order for the network to solve the exclusive-or problem, we need to be able to set θ so that SY (0, 0), SY (1, 1) < θ ≤ SY (0, 1), SY (1, 0).
(2.2)
In addition all the values in equation 2.2 must be stable. It is clear that equation 2.2 cannot hold for w = 0. An increase in IL will result in an increase in the SL , leading (by monotonicity) to an increase (or no change) in L. For w > 0, this leads to an increase in SR , and hence an
Adding Lateral Inhibition
279
increase (or no change) in R. Thus, changing IL from 0 to 1 cannot decrease either L or R, making equation 2.2 unsatisfiable. Equation 2.2 can be satisfied only if w < 0. If the hidden units are linear, the stable state of the value of the stimulation of the output unit, SY = L + R, depends linearly on the inputs. Thus, satisfying equation 2.2 requires F to be nonlinear. The problems of satisfying equation 2.2 and maintaining stability arise primarily at IL = IR = 1. We need to choose F and make w sufficiently negative so that L(0, 1) + R(0, 1) > L(1, 1) + R(1, 1),
(2.3)
while retaining stability. When IL = IR = 1, the fixed point is at L = F(1 + wR) = F(1 + wF(1 + wL)).
(2.4)
We use lockstep parallel updating so that if the network settles, L = R by symmetry. We can find the fixed point by solving L = F(1 + wL)
(2.5)
and check for stability by considering L0 (L) = F(1 + wF(1 + wL)).
(2.6)
0 The fixed point is stable if | dL dL |
< 1 (Phillips & Taylor, 1973). We can proceed by choosing some F and then iteratively solving equation 2.5 varying w to 0 find the point at which | dL dL | = 1 in order to find the value for w at which the solution becomes unstable. We can then check equation 2.2 to see if the solution is valid. If so, we can reduce |w| to find (by simulation) the least negative value for w for which the solution remains valid. If not, there are no symmetric solutions. 3 Results We applied the technique just explained, using the logistic function F(x) =
1 . 1 + exp(−βx)
The results are shown in Figure 1b. For small β there is a range of values of w such that the lateral inhibition is sufficient to fulfill equation 2.3 before oscillation occurs. Above the upper line, the network is unstable. Below the lower line, the network does not solve the exclusive-or problem. As w moves toward the upper line from below, the number of iterations can be quite large. The top line (the boundary between solving exclusive-or and instability) was found analytically and checked by simulation. The bottom line (the boundary between not solving and solving exclusive-or) was found by simulation. For β > 1.153 there are no solutions.
280
Leslie S. Smith
4 Discussion We have shown that exclusive-or can be solved using a small, laterally inhibited network. The implementability and stability of this relation have been shown to depend on the steepness of the transfer function and the magnitude of the inhibitory weight. Lateral inhibition usually results in one unit’s winning or in increasing contrast between output units (reviewed in Anderson, 1995). One can interpret its effect here as an indirect result of contrast enhancement. When IL and IR differ (say, IL = 1 and IR = 0), the lateral inhibition enhances the contrast by decreasing R more than L. When IL = IR = 1, both L and R are decreased by the same amount. Exclusive–or is solved when the overall result of these decreases is that equation 2.3 is fulfilled. When the symmetric solution discussed here exists, it is the unique fixed point for IL = IR = 1 and is found whatever the initial values of L and R and whatever update scheme is used. As w is made more negative, so that 0 | dL dL | > 1, this fixed point splits into two stable fixed points and one unstable fixed point. This results in asymmetric solutions in which for IL = IR = 1, either L > R or R > L and the behavior of the network depends on the initial values of L and R and on the update scheme. References Anderson, J. A. (1995). An introduction to neural networks. Cambridge, MA: MIT Press. Minsky, M. L., & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press. Phillips, G. M., & Taylor, P. J. (1973). Theory and applications of numerical analysis. New York: Academic Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Received November 18, 1996; accepted June 24, 1997.
Communicated by Steven Nowlan
Combined Learning and Use for a Mixture Model Equivalent to the RBF Classifier David J. Miller Hasan S. Uyar Department of Electrical Engineering, Pennsylvania State University, University Park, PA 16802-2701, U.S.A.
We show that the decision function of a radial basis function (RBF) classifier is equivalent in form to the Bayes-optimal discriminant associated with a special kind of mixture-based statistical model. The relevant mixture model is a type of mixture-of-experts model for which class labels, like continuous-valued features, are assumed to have been generated randomly, conditional on the mixture component of origin. The new interpretation shows that RBF classifiers effectively assume a probability model, which, moreover, is easily determined given the designed RBF. This interpretation also suggests a statistical learning objective as an alternative to standard methods for designing the RBF-equivalent models. The statistical objective is especially useful for incorporating unlabeled data to enhance learning. Finally, it is observed that any new data to classify are simply additional unlabeled data. Thus, we suggest a combined learning and use paradigm, to be invoked whenever there are new data to classify. 1 Introduction Radial basis function (RBF) neural networks (Powell, 1985; Broomhead & Lowe, 1988; Moody & Darken, 1989; Bishop, 1991) are simple, modular, nonlinear networks that have found widespread use in data modeling, classification, and control applications. These networks approximate a complicated nonlinear function by a linear combination of simpler functions, each of which has some localized region of principal support. The localized functions are often called basis functions or receptive fields. The RBF output yˆ ∈ R for some input vector x ∈ Rm is typically given in one of the following two forms: yˆ =
M X
λl f (x; θl ).
(1.1)
l=1
Neural Computation 10, 281–293 (1998)
c 1998 Massachusetts Institute of Technology °
282
David J. Miller and Hasan S. Uyar
yˆ =
M X l=1
à λl
f (x; θl )
PM
m=1
f (x; θm )
! .
(1.2)
Equation (1.1) will be denoted the RBF structure, and equation 1.2 will be denoted the normalized RBF structure. In both equations, λl is the scalar weight for basis l, with the basis functions { f (x; θl ), l = 1, . . . , M} sharing a common parametric form, each with a parameter set θl . Typically the function f (x; θl ) can be written in the form φ(kx − xl k); i.e., the localized response is “centered” at location x = xl and takes on its maximum value there. A common choice for f (·) is a density function or a function that differs from one by a constant factor. In particular, the most common RBFs use a gaussian basis, θl = {xl , 6l }, with xl the center and 6l the associated covariance matrix. Note that this choice for f (·) suggests the possibility of statistical interpretations for the RBF models. For example, the term in parentheses in equation 1.2 is identical in form to the conditional probability associated with a mixture model, one consisting of M equally likely components. Thus, for the input-output pair (x, y), the normalized RBF output yˆ can be viewed as the expected value of y given x, associated with the following statistical model for data generation: 1. Select a mixture component k randomly, according to a uniform probability mass function. 2. Given the selected component, produce the input x according to the density function f (x; θk ) and the output, y, in a deterministic fashion, y = λk . This conditional mean interpretation is relevant in particular when the RBF is used in a regression function approximation context. Note, though, that this interpretation does not hold for equation 1.1. While this conditional mean perspective on RBFs is well known, in this article we identify a new statistical interpretation that applies when the RBF structure is used as a statistical classifier. The new interpretation is valid for both equations 1.1 and 1.2. However, it will be inspired by considering the basic structure (see equation 1.1). Therefore, we focus on this structure in the sequel. The key observation that will lead to our result is that whereas equation 1.2 looks like the conditional mean estimator associated with a mixture model, equation 1.1 looks like the density function associated with a mixture. However, although similar observations have been made before (Lowe, 1991), they are generally not emphasized, primarily because the scalar weights {λk } can be positive or negative—hence, they do not fulfill the role of probability masses for the mixture components. Thus, generally, our previous statistical interpretation for equation 1.2 notwithstanding, RBFs are viewed as just a particular class of neural networks for ap-
Combined Learning and Use for a Mixture Model
283
proximating a nonlinear mapping. This point of view is borne out by the standard supervised training approaches for designing the RBF structures in both equations. These methods (e.g., Moody & Darken, 1989), similar to standard supervised training methods for other models such as multilayer perceptrons, ultimately aim to minimize the sum of squared distances to target output values.1 While this function approximation perspective on neural networks continues to pervade, there has been much recent interest in new classes of networks that are amenable to both a probabilistic description and statistical training methods. The mixture-of-experts (Jacobs, Jordan, Nowlan, & Hinton, 1991) and hierarchical-mixture-of-experts models (Jordan & Jacobs, 1994) involve nonlinear processing and distributed data representation like other network structures. However, unlike other networks, mixture-ofexperts effectively hypothesizes a probability model for the generation of data. Moreover, the learning is based on maximum likelihood estimation (MLE). In this work, we show that for the problem of statistical classification, the RBF structures (see equations 1.1 and 1.2) are in fact equivalent to a model with a similar probabilistic description. In particular, the RBF classifier is equivalent in form to the Bayes classifier associated with a specific type of mixture-based model, one that is really a type of mixture-of-experts model. After demonstrating this equivalence, we then consider the advantages of this alternate description. The primary advantage will be seen to relate to RBF training. In particular, unlike standard RBFs, the RBF-equivalent probability model is amenable to likelihood-based training, which is particularly advantageous for incorporating unlabeled data, as well as labeled data, within the learning. While the resulting training approach will appear at first glance to be limited to tackling problems where a large unlabeled training set exists, the scope of application will then be extended by a suggested combined learning and use paradigm, which can be invoked whenever there are new data to classify. Several benchmark data sets will be used to demonstrate the combined learning and use approach. 2 RBF Classifiers and Mixture Models 2.1 Formulation. In this section, we consider classifiers based on the RBF structure given in equation 1.1, with f (·) assumed to be a density function. For a J-class problem, there is one RBF output per class, gj (x) =
M X
f (x; θk )λkj ,
j = 1, . . . , J.
(2.1)
k=1
1 The method of Moody and Darken (1989) uses unsupervised clustering initially to learn the “centers” and the “width” parameters that specify 6k . However, the ultimate objective is regression to target outputs.
284
David J. Miller and Hasan S. Uyar
The associated decision function, S : Rm → {1, 2, . . . , J}, is the winner-takeall rule: S(x) = arg max gj (x).
(2.2)
j
We wish to show that this function is equivalent to the Bayes decision function associated with a particular mixture model. This will be accomplished by applying simple transformations to equation 2.2 that do not change the P decision function. First, define λmin = mink,j λkj . Then, subtracting λmin k f (x; θk ) from each class output, we obtain the equivalent rule, X S(x) = arg max f (x; θk )λ˜ kj , (2.3) j
k
where λ˜ kj ≡ λkj − λmin . Note that λ˜ kj ≥ 0. Next, we divide each output by P the constant, m,n λ˜ mn , to obtain S(x) = arg max
X
j
f (x; θk )qkj ,
(2.4)
k
where qkj ≡ P
λ˜ kj ˜
m,n λmn
.
Now, we have 0 ≤ qkj ≤ 1 and
P
kj qkj = 1. Finally, we can normalize each P PJ ˜ transformed class output g˜ j (x) ≡ k f (x; θk )qkj by the sum l=1 gl (x) to yield the rule P k f (x; θk )qkj S(x) = arg max P j m,n f (x; θm )qmn P P Pqkj )( n qkn ) k f (x; θk )( q kn n P P . (2.5) = arg max j m f (x; θm )( n qmn )
The motivation behind the final equivalent form for the decision function is next seen. Consider the following mixture-based statistical model for generating feature/class label data pairs (x, c), x ∈ Rm , c ∈ {1, 2, . . . , J} (Ghahramani & Jordan, 1994; Miller & Uyar, 1996): 1. Randomly select one of M mixture components according to the probability mass function {αl , l = 1, . . . , M}. 2. Given the selected component, k, choose: (1) a feature vector x according to the component density f (x; θk ) and (2) a class label c according to the conditional probabilities {βj|k , j = 1, . . . , J}.
Combined Learning and Use for a Mixture Model
285
This model is a generalization of a standard mixture model wherein, like the features {x}, the class labels are also generated randomly, conditional on the selected mixture component. Note that this is different from the usual assumption that mixture components are deterministically “owned” by classes. For this model, the a posteriori class probabilities take the form P[c | x] =
X
µ βc|k
k
¶ α f (x; θk ) Pk . l αl f (x; θl )
(2.6)
These probabilities have a mixture-of-experts structure, where the “gating units” are the conditional probabilities of mixture components given feature vectors (in parentheses), and with the “expert” for component k just the conditional probability {βc|k }.The associated Bayes decision rule is P k αk βj|k f (x; θk ) . Sbayes (x) = arg max P j k αk f (x; θk )
(2.7)
Now, comparing equations 2.5 and 2.7, we see that they are equivalent. In particular, since J µ X j=1
qkj P n qkn
¶ = 1,
we may identify qkj P n qkn with βj|k . Similarly, since à J M X X k=1
! qkn
= 1,
n=1
P we may identify n qkn with αk . Thus, the decision function of an RBF classifier is equivalent to that associated with a particular mixture-based statistical model. Note, too, that while we have shown this result for the RBF structure in equation 1.1, it also holds for equation 1.2, since normalization does not alter the decision rule. 2.2 Discussion. There are two general, practical implications of this result. First, while the derivation shows the equivalence between RBF and mixture-based classifiers in general, it also provides a simple procedure for obtaining the equivalent mixture model, given a designed RBF. This may
286
David J. Miller and Hasan S. Uyar
be helpful to the designer in interpreting RBF solutions since it identifies the modeling assumptions that are implicitly made. This equivalence also allows the designer to apply techniques and theory from statistical estimation in evaluating RBF solutions. Moreover, in some cases, one is interested in both hard classification decisions and a probabilistic assessment of class ownership. The RBF-equivalent model directly provides this information via the posterior probabilities {P[c | x]}. The second practical consequence overcomes some weaknesses with the first one and also provides the possibility for improved classifier performance. Note that since the RBF model is typically designed for a regression cost to target class values, the RBF-equivalent mixture parameters will not be (even locally) optimal in terms of statistical criteria such as data likelihood. Thus, it may not be appropriate to evaluate RBF solutions in terms of statistical criteria (e.g., bias and variance of the parameter estimates) or to rely on the accuracy of the posterior probabilities. However, the equivalent statistical model is naturally suited to MLE-based training. Thus, our result suggests the possibility of statistical training (and evaluation) for RBF-equivalent models, when such training is warranted. Regarding the choice of the “right” training criterion, we note that several different supervised learning approaches have been suggested for training RBF classifiers, including methods that minimize a regression objective to class targets (Moody & Darken, 1989; Tarassenko & Roberts, 1994), as well as methods that more directly aim to minimize an error count measure (Juang & Katagiri, 1992; Miller, Rao, Rose, & Gersho, 1996). It is not clear in general that statistical training (MLE) is more appropriate than these methods when the ultimate goal is classification. However, one situation where statistical training appears to be superior to standard supervised learning is the scenario where training data are deficient. In particular, for applications such as classification of medical or remotely sensed images, the amount of training data is in principle voluminous, yet the amount of labeled data (which must be extracted through an expensive, time-consuming process) may be inadequate. Standard supervised learning approaches have no direct way to utilize unlabeled data and are forced to discard them. For classifiers based on statistical models, on the other hand, unlabeled data may be used to improve parameter estimates, which in turn leads to more accurate a posteriori probabilities (Shashahani & Landgrebe, 1994; Miller & Uyar, 1996). In the remainder of this article, we focus on this label-deficient learning problem. In the next section, we summarize a method for incorporating unlabeled data within learning for the RBF-equivalent mixture model. When a significant amount of unlabeled data is available, this learning approach achieves substantial performance benefits over standard supervised methods for the conventional RBF structure, which are forced either to discard the unlabeled data or to make limited use of them (Miller & Uyar, 1996). We then show that this new learning approach in fact has more general applicability.
Combined Learning and Use for a Mixture Model
287
3 Learning for the RBF-Equivalent Model 3.1 A Mixed Training Approach. We consider the classification learning problem where the training set X is assumed to consist of two subsets, X = {Xl , Xu }, where Xl = {(x1 , c1 ), (x2 , c2 ), . . . , (xNl , cNl )} is the labeled subset, and Xu = {xNl +1 , . . . , xN } is the unlabeled subset.2 Here, xi ∈ Rk is the feature vector and ci ∈ I is the class label from the label set I = {1, 2, . . . , J}. Lippmann (1989) pointed out the practical significance of this learning problem, given the time and expertise required for extracting labeled training samples. Castelli and Cover (1995, 1996) have analyzed this problem for the case of an infinite amount of unlabeled data and a finite amount of labeled data. Their results are less than optimistic concerning the value of unlabeled data (relative to labeled data) for classification. However, Shashahani and Landgrebe (1994) found that unlabeled data could be used effectively to improve performance in label-deficient situations. They suggested a statistical model for the data naturally suited to a maximum likelihood learning scheme involving both the labeled and unlabeled data subsets. Alternative approaches for this mixed training problem were also suggested by Towell (1996) and de Sa (1994). In Miller and Uyar (1996), we built on the work of Shashahani and Landgrebe, suggesting improvements to both the classifier structure and the learning method. We also introduced the notion of combined learning and use for classification. First, while Shashahani and Landgrebe’s model assumed that mixture components are exclusively “owned” by classes, we suggested a more general probability model with an associated classifier structure that in this article has been shown to be equivalent to the RBF classifier. Second, while a conditional data likelihood criterion was suggested in Shashahani and Landgrebe (1994), we proposed an alternative learning criterion based on the joint data likelihood, over both the labeled and unlabeled data subsets. In particular, we considered the criterion log L =
X xi ∈Xu
log
L X l=1
αl f (xi /θl ) +
X xi ∈Xl
log
L X
αl βci / l f (xi /θl ).
(3.1)
l=1
Two distinct expectation-maximization (EM) (Dempster, Laird, & Rubin, 1977) algorithms were suggested for maximizing log L. These approaches differ in the EM formalism applied to unlabeled data (Miller & Uyar, 1996). We found that the suggested probability model and learning approach achieved significant performance gains over conventional, supervised learning approaches for designing the RBFs. Moreover, this approach was also seen to outperform the method of Shashahani and Landgrebe (1994), with 2 This is a type of missing data problem, wherein the missing items are class labels. As such, it is related to supervised learning involving missing or noisy feature components, addressed by Ghahramani and Jordan (1994) and Tresp, Neuneier, & Ahmad (1995).
288
David J. Miller and Hasan S. Uyar
the gains most significant when the labeled subset is small. No significant performance differences were observed for the two EM approaches. 3.2 Combined Learning and Use. While the learning approach summarized appears to be applicable only when there are label-deficient training data, the potential range of application is substantially extended by the following observation: test data (with labels withheld), or for that matter any new batch of data to be classified, can be viewed as a new, unlabeled data set. Hence, these new data can be taken to be Xu and used for learning prior to classification. We thus suggest a combined learning-classification operation that can be applied whenever there are new data to classify. In the usual supervised learning setting, there is a clear division between the learning and classification (use) phases. In this setting, modification of the classifier for data to be classified is not typically done since standard supervised methods cannot easily incorporate unlabeled data within the learning. However, the mixed training approach naturally incorporates unlabeled data. In the scheme we suggest, this learning for new data can be viewed simply as part of the classification operation. This is somewhat analogous to adaptive filtering, where new data are both filtered and used to adapt the filter. An even more apt analogy may be to image segmentation, where there is a common energy function minimized for each new image to be segmented. Each such minimization determines a model local to the image and a segmentation for the image. Our “segmentation” operation is just classification, with the loglikelihood playing the role of the energy function. This objective function consists of one term that is always fixed (based on a given labeled training set) and one term that is modified based on each new batch of unlabeled data to classify (see equation 3.1). We can envision several distinct learning contexts where this scheme can be used, as well as different ways of realizing the combined learningclassification operation. One use is in classification of an image speech archive, where each image speaker segment is a separate data batch. Each batch to classify can be used as an unlabeled training set, either in concert with a representative labeled data set or to modify a design based on such a set. Effectively, this scheme adapts the classifier to each new data batch. A second application is supervised learning wherein the total amount of data is fixed. Here, we need to divide the data into training and test sets with the conflicting goals of achieving a good design and accurately measuring generalization performance. Combined learning and classification can be used here to mitigate the loss in performance associated with the choice of a large test set. More generally, our scheme can be used effectively in any setting where the new data to classify are either sizable or innovative relative to the existing training set. We next describe several experiments involving this suggested scheme, including one for the problem of vowel recognition given an archive of examples from different speakers. In this context, given a limited training set, a conventional classification approach
Combined Learning and Use for a Mixture Model
289
is of necessity speaker-independent. By contrast, we show that combined learning and use provides a way to achieve speaker-dependent adaptation (and use) of the model, with resulting gains in classification performance. 4 Experimental Results 4.1 Basic Combined Learning and Classification. In a basic application of the paradigm, we simply use the new unlabeled features to augment an existing training set. After training on the combined data set using the scheme in Miller and Uyar (1996), the new data set is classified. As one example, we considered the 3-class, 21-dimensional, 5000-sample waveform.data set from the University of California at Irvine machine learning repository. The number of labeled training samples, Nl , was varied, with the test set (Xu ) composed of the remaining 5000 − Nl samples. We assumed L = 15 components basis functions and compared three different schemes: 1. A deterministic annealing method for RBFs (DA-RBF) that seeks to minimize an error count measure directly while avoiding local optima of the cost. Although this scheme has been found to achieve very good designs elsewhere (Miller et al., 1996), it only utilizes Xl . 2. The method of Moody and Darken (1989) (MD-RBF). For this approach, we also only utilized Xl for training. 3. Combined learning and use (CL-RBF)—that is, mixed training based on Xl and the unlabeled test set samples.3 As seen in Figure 1, CL-RBF provides a significant performance advantage over the other two methods.4 Note in particular that the test set performance for CL-RBF is nearly constant as the labeled subset size is varied, while DA-RBF and, in particular, MD-RBF suffer a large loss in performance as Nl is decreased. 4.2 Speaker-Dependent Vowel Recognition. Another application for our scheme is in the classification of a speech archive. The archive includes examples of vowels from different speakers, with the speaker identity known. While examples of the same vowel may share similar statistics across speakers, examples of different vowels from the same speaker may also share a common statistical character. This suggests that it may be sensible to separate the data into speaker-dependent batches. Each such batch 3 For each choice of N , the various learning approaches produced five solutions based l on random parameter initialization, for each of six different labeled subset realizations. The test set performance was then averaged over these 30 trials. The exception was the DA method, for which only a single initialization was performed. 4 The particular choices for the initialization and data subset realization effect the nonmonotonicity of the performance curves.
290
David J. Miller and Hasan S. Uyar
Misclassification fraction for the test set
0.35
0.3
0.25
MD−RBF 0.2 DA−RBF
0.15
0.1 0
CL−RBF
200
400
600 800 1000 1200 1400 Number of labelled training samples
1600
1800
2000
Figure 1: Test set misclassification fraction versus labeled subset size for three RBF classifier learning approaches. The data set is the 21-dimensional waveform.dat set, from the UC Irvine machine learning repository. The combined learning and use approach (CL-RBF) outperforms the other two methods over the entire range of labeled subset sizes.
to classify can be taken as an unlabeled set for combined learning and use, either in concert with a labeled training set (derived from either the entire archive or an independent data set) or to modify a previous design based on such a labeled training set. The success of this scheme rests with the potential for adapting the classifier to each data batch, based on the unlabeled batch features. We have tested this idea on Deterding’s vowel.dat set, consisting of log area ratios, derived from linear predictive coding coefficients, representing 11 different vowels. In this set, there are six examples of each vowel from each of 15 different speakers (990 examples in all). We used two examples of each vowel from each speaker as the labeled training set (330 samples in all), with the remaining four examples from each speaker used as the test set. Note that the data set is too small to design separate classifiers for each speaker in a conventional way. We chose to compare two different combined learning approaches, along with the MD-RBF method. In the speaker-independent (SI) scheme, the test set of 660 samples was viewed collectively as Xu and used in concert with Xl for combined learning. We then classified Xu . In the speaker-dependent (SD) method, the 44
Combined Learning and Use for a Mixture Model
291
Table 1: Average Test Set Misclassification Fraction for Deterding’s vowel.dat set. Model Size
Method
Train-err/Test-err
11 components
SD SI MD-RBF SD SI MD-RBF SD SI MD-RBF
0.58/0.58 0.61/0.61 0.63/0.64 0.45/0.46 0.49/0.50 0.50/0.52 0.37/0.38 0.41/0.42 0.44/0.44
22 components
33 components
test set samples from each speaker were viewed as distinct data batches. We thus performed combined learning separately for each speaker, based on Xl and the speaker-specific test set. Each batch was then classified based on its speaker-specific model. Note the trade-off between the two designs: the speaker-specific scheme designs each of the 15 classifiers using only the 330 labeled samples and 44 speaker-specific unlabeled samples, while the speaker-independent scheme uses the entire data set for its single design. However, this latter classifier must perform well for all speakers rather than just a single one. Results were obtained for models of size 11, 22, and 33 mixture components. The performance measure chosen was the average test set error fraction, computed based on different choices for the parameter initialization and for the data subset realization. We designed classifiers for all possible realizations (15) of the training and test sets. For each realization, for each model size, 20 classifiers were designed based on random parameter initialization, with the test set performance averaged over the 300 solutions. As shown in Table 1, the speaker-specific scheme provides a consistent performance advantage over the speaker-independent one,5 with both methods outperforming MD-RBF. 5 Conclusion We have shown that the decision function of an RBF classifier is equivalent to that of the Bayes classifier for a certain mixture-based statistical model. One implication of this result is the conclusion that RBF classifiers effectively assume a probability model for (feature, label) data pairs. Moreover, this model can be identified by applying simple transformations to the de5 The high error rates observed for this experiment are consistent with prior results reported for this data set.
292
David J. Miller and Hasan S. Uyar
signed RBF. The most significant advantage of this perspective relates to the supervised learning problem. Unlike standard RBFs, the RBF-equivalent probability model is amenable to likelihood-based training, which is the key to assimilating mixed labeled and unlabeled training data within the learning. While this mixed training approach appeared at first glance to be limited to problems such as medical or remotely sensed image classification where a large, label-deficient training set may exist, the range of applications was then extended by a suggested combined learning and use paradigm, which can be applied whenever there are new data to classify. Future work may investigate the extension of this approach to the problems of regression and time-series prediction. Acknowledgments DJM thanks Anand Rangarajan for a stimulating discussion. This work was supported in part by National Science Foundation Career Award IRI9624870. References Bishop, C. (1991). Improving the generalization properties of radial basis function neural networks. Neural Computation, 3, 579–588. Broomhead, D. S., & Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2(3), 321–355. Castelli, V., & Cover, T. M. (1995). On the exponential value of labeled samples. Pattern Recognition Letters, 16, 105–111. Castelli, V. & Cover, T. M. (1996). The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Transactions on Information Theory, 42(6), 2102–2117. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Roy. Stat. Soc., Ser. B, 39, 1–38. Ghahramani, Z., & Jordan, M. I. (1994). Supervised learning from incomplete data via an EM approach. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Neural information processing systems, 6 (pp. 120–127). Cambridge, MA: MIT Press. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3(1), 79–87. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Juang, B. H., & Katagiri, S. (1992). Discriminative learning for minimum error classification. IEEE Transactions on Signal Processing, 40, 3043–3054. Lippmann, R. P. (1989). Pattern classification using neural networks. IEEE Communications Magazine, 27, 47–64. Lowe, D. (1991). What have neural networks to offer statistical signal processing? In Proceedings of the SPIE Conference on Adaptive Signal Processing (pp. 460–471).
Combined Learning and Use for a Mixture Model
293
Miller, D. J., & Uyar, H. S. (1996). A mixture of experts classifier with learning based on both labelled and unlabelled data. In M. C. Mozer, M. I. Jordan, T. Petsche (Eds.), Neural information processing systems, 9. Cambridge, MA: MIT Press. Miller, D., Rao, A., Rose, K., & Gersho, A. (1996). A global optimization method for statistical classifier design. IEEE Transactions on Signal Processing, 44(12), 3108–3122. Moody, J., & Darken, C. J. (1989). Fast learning in locally-tuned processing units. Neural Computation, 1, 281–294. Powell, M. J. D. (1985). Radial basis functions for multivariable interpolation: A review. In Algorithms for approximation (pp. 143–167). New York: Oxford University Press. Shashahani, B., & Landgrebe, D. (1994). The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon. IEEE Transactions on Geoscience and Remote Sensing, 32, 1087–1095. de Sa, V. (1994). Learning classification with unlabeled data. Neural Information Processing Systems, 6, 112–119. Tarassenko, L., & Roberts, S. (1994). Supervised and unsupervised learning in radial basis function classifiers. IEEE Proc.-Vis. Image Sig. Proc., 141, 210–216. Towell, G. (1996). Using unlabeled data for supervised learning. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Neural information processing systems, 8 (pp. 647–653). Cambridge, MA: MIT Press. Tresp, V., Neuneier, R., & Ahmad, S. (1995). Efficient methods for dealing with missing data in supervised learning. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Neural information processing systems, 7 (pp. 689–696). Cambridge, MA: MIT Press. Received January 2, 1997; accepted June 10, 1997.
LETTERS
Communicated by Steven Nowlan
Modeling the Surround of MT Cells and Their Selectivity for Surface Orientation in Depth Specified by Motion Lin Liu Marc M. Van Hulle K.U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, Leuven, Belgium
The projective transformation onto the retina loses the explicit 3D shape description of a moving object. Theoretical studies show that the reconstruction of 3D shape from 2D motion information (shape from motion, SFM) is feasible provided that the first- and second-order directional derivatives of the 2D velocity field are available. Experimental recordings have revealed that the receptive fields of the majority of the cells in macaque area middle temporal (MT) display an antagonistic (suppressive) surround and that a sizable portion of these surrounds are asymmetrical. This has led to the conjecture that these cells provide a local measure for the directional derivatives of the 2D velocity field. In this article, we adopt a nonparametric and biologically plausible approach to modeling the role played by the MT surrounds in the recovery of the orientation in depth (the slant and tilt) of a moving (translating) plane. A three-layered neural network is trained to represent the slant and tilt from the projected motion vectors. The hidden units of the network have speed-tuning characteristics and represent the MT model neurons with their surrounds. We conjecture that the MT surround results from lateral inhibitory connections with other MT cells and that populations of these cells, with different surround types, code linearly for slant and tilt of translating planes. 1 Introduction There is a growing body of evidence that the middle temporal (MT, V5) cortical area, located in the posterior bank of the superior temporal sulcus, systematically represents the direction and velocity of visual motion (Allman & Kaas, 1971; Dubner & Zeki, 1971; Albright, 1984; Maunsell & Van Essen, 1983; Movshon, Adelson, Gizzi, & Newsome, 1985; Lagae, Raiguel, & Orban, 1993). More than 90% of the MT neurons have a surround that envelops the classical receptive field (CRF) (Allman, Miezen, & McGuinness, 1985), and it reveals its presence only by the effect on the CRF response. Since the MT surrounds are extensive in size, it is believed that MT neurons are able to integrate local stimulus conditions within a global context (localto-global comparison) (Allman et al., 1985). Furthermore, the majority of Neural Computation 10, 295–312 (1998)
c 1998 Massachusetts Institute of Technology °
296
Lin Liu and Marc M. Van Hulle
the MT surrounds in the macaque monkey are antagonistic (suppressive) (Raiguel, Van Hulle, Xiao, Marcar, & Orban, 1995; Tanaka et al., 1986). This has led to a series of speculations about the role of MT surrounds: figureground segregation, preattentive vision, depth perception, detection of motion discontinuities, and differentiation of object motion from egomotion (Albright, 1984, 1993; Allman et al., 1985; Tanaka et al., 1986). Lesion studies performed by Andersen and Siegel (1986) have shown that MT plays a prominent role in the perception of 3D shape in moving stimuli (shape from motion, SFM). Furthermore, evidence indicating that human SFM perception involves surface interpolation (Hussain, Treue, & Andersen, 1989; Treue, Andersen, Ando, & Hildreth, 1995) has led to the hypothesis that the spatial structure of the receptive field (RF) is involved in this process. Hence, in this view, SFM is redefined as the problem of characterizing interpolating surfaces. Spurred by the progress in optic flow theory, physiological predictions were formulated about the RF profiles needed to allow for an SFM extraction (Droulez & Cornilleau-P´er`es, 1990; Koenderink & van Doorn, 1992) but without making clear statements whether these models applied to MT neurons. Buraˇcas and Albright (1994) have modeled the velocity selectivity curves of MT neurons as sums and differences of two gaussian-distributed velocity filters: one for the center and another for the surround (a parametric model). However, they did not model the spatial layout of the MT RF itself and assumed it to be radially symmetrical, but they also qualified their assumption by predicting the existence of asymmetrical surrounds (but, “possibly in areas FST or MST,” p. 975). Recently, it was found that the majority of the MT surrounds are not radially symmetrical (Xiao, Raiguel, Marcar, Koenderink, & Orban, 1995, 1997; Xiao, Marcar, Raiguel, & Orban, 1997), in contrast with what was previously assumed (Tanaka et al., 1986). The type of surrounds obtained led the experimentalists to conjecture that MT could locally compute the firstand second-order directional derivatives of the velocity field induced by the moving (translating) object (Xiao et al., 1995). Such a conjecture has important implications since theoretical studies show that SFM is feasible when these directional derivatives are available (Droulez & CornilleauP´er`es, 1990; Koenderink & van Doorn, 1992). Inspired by this new experimental evidence, Buraˇcas and Albright (1996) introduced asymmetrical surrounds in their parametric model and discussed the implication on SFM extraction. Finally, we should add that to date, no published experimental study has demonstrated selectivity for surface curvature in single MT cells; existing studies have looked for orientation in depth selectivity (slant and tilt selectivity) of translating planes (Treue et al., 1995; Xiao et al., 1995, 1997). (In addition to curvature, slant and tilt are commonly used geometrical descriptors of surface shape.) In order to investigate the putative contribution of MT cells to SFM, and in particular to the recovery of the slant and tilt of a translating plane, we develop in this article a biologically plausible model of MT RFs by using the
MT Cells and Their Selectivity
297
known RF properties for guidance. However, unlike Buraˇcas and Albright (1996), we avoid adopting a parametric approach that considers the MT CRFs and their surrounds as differential motion filters for estimating space differentials from which local surface shape can be determined. Instead, we adopt a nonparametric approach and train a three-layered recurrent network wherein the MT model neurons comprise the hidden units and whose output units represent the desired slant and tilt of the translating plane. Furthermore, since speed tuning is one of the most important properties of MT cells (Lagae, Raiguel, & Orban, 1993; Maunsell & Van Essen, 1983), and also of their surrounds (Allman et al., 1985; Tanaka et al., 1986; Lagae, Guly´as, Raiguel, & Orban, 1989, Xiao et al., 1997), we assume that our hidden units are endowed with a speed tuning characteristic by their input units. The shapes and characteristics of the CRFs and surrounds developed are in accordance with those found in single-cell recordings primarily in four ways: (1) the surrounds of our MT neurons are predominantly antagonistic and reveal their presence only when the CRF is stimulated; (2) the surrounds developed are radially symmetric, bilaterally symmetric, or asymmetric; (3) the axis of the bilaterally symmetric surrounds is orthogonal to the long axis of the elliptically shaped CRF; and (4) the surround inhibition is speed tuned. We conjecture that the surround results from lateral inhibitory connections within MT and that populations of MT cells code linearly for slant and tilt of translating surfaces. 2 Velocity Gradient as a Cue for Slant and Tilt Most computational approaches to SFM are rooted in the observation that the 3D shape of a solid, smooth object can be locally approximated by the first-order geometrical descriptors slant and tilt. We consider a translating plane, on which dots are densely distributed, and project the corresponding motion vectors onto the fronto-parallel plane (orthogonal to the line of sight). Hence, the slant and tilt of the translating surface are the sole variables in our SFM problem, up to a scaling factor, and they are to be determined from the gradient of the projected motion vectors (to a first approximation, the velocity field caused by translational motion of a smooth surface is proportional to the surface shape). As a result of the projective transformation, the observer will have the impression that dots located nearer to him or her are moving faster, while dots farther away are moving more slowly. (Because the surface undergoes only translational motion, we will not distinguish between observer ego motion and surface motion.) The stimulus setup and coordinate system are shown in Figure 1. The origin (0, 0, 0) is set at the fixation point on the moving surface. The observer, located at O(0, 0, b), is at a distance b from the fixation point. The plane through R(0, 0, a), and orthogonal to the z-axis (line of sight), is taken as the projection plane. For a given position Q(x, y, z) on the moving surface, the line OQ intersects the projection plane in P(x0 , y0 , a). (The arbitrary choice of
298
Lin Liu and Marc M. Van Hulle
Z O(0,0,b)
P(x0,y0,a)
R(0,0,a) N slant
Q(x,y,z) Y Vp
RA tilt
X Figure 1: The stimulus setup and coordinate system used. The observer is at point O, and the fixation point, on the moving plane, is at the origin of the coordinate system XYZ. The line of sight corresponds to the z-axis. The line OQ, with point Q on the moving plane, intersects at point P on the projection plane. The projection plane is orthogonal to the line of sight and fixed at a distance a from the fixation point. The slanted and tilted plane undergoes translational motion in the direction of the y-axis with velocity Vp . The slant and tilt, as well as the inward surface normal N⊥ , are indicated. The axis RA (dashed line) is needed in the definition of slant and tilt.
the line of sight implies that all computations are supposed to be local.) The slant and tilt are defined as follows. Consider first the moving surface to be parallel to the projection plane and define an axis RA, on the moving surface, that at first coincides with the x-axis. The slant θ is the angle defined by a rotation of the moving surface around RA. The tilt α is the angle defined by a rotation of RA itself around the z-axis. (Note that the slant also corresponds to the angle between the line of sight and the inward surface normal N⊥ ; the tilt also corresponds to the angle between the y-axis and the direction
MT Cells and Their Selectivity
299
of decreasing distance.) The parameter equation of the line OPQ is: x y z
= = =
x0 t y0 t b + (a − b)t
(2.1)
where, for point P t = 1, and for point Q: t=
b cos θ . x0 sin α sin θ + y0 cos α sin θ − (a − b) cos θ
(2.2)
Assuming that the plane moves with velocity Vp in the direction of the y-axis, the velocity V at point P(x0 , y0 ), in the projection plane, is then: V(x0 , y0 ) = Vp
x0 sin α sin θ + y0 cos α sin θ − (a − b) cos θ . b cos θ
(2.3)
3 Multilayered Recurrent Network Model We consider a three-layered network (see Figure 2) that we will train with a supervised learning algorithm. The projected motion field (the “retina”) is discretized into 13 × 13 pixels. We take a = 8 and b = 30 so that the input to the network spans a window 15.5 × 15.5 degrees of visual angle. To each pixel corresponds an input unit with an (idealized) speed-tuning characteristic, as shown in Figure 3. The output units of the network code for the slant and tilt of the moving plane. The hidden units represent the MT neurons. They receive input from the input units and from other hidden units, and relay their output activations to the output units. Since our MT neurons receive direct input from the speed-tuned input units, their CRFs are endowed with this type of speed selectivity. The bandwidth of our tuning curve corresponds to that of the tuned MT cells recorded experimentally (Lagae et al., 1993). In any case, the speed selectivity of 58 out of 147 MT cells reported in Lagae et al. (1993) belongs to this (prevailing) type. We also assume that the hidden units are direction selective, with their preferred direction along the y-axis (see Figure 1), so that we can confine ourselves to translational motion along the y-axis only. The hidden units are arranged in a 7 × 7 lattice. The output functions of these units are sigmoids (including a bias term). The connection weights from the input layer to each hidden unit are initialized with a connection kernel, that is, a smooth, decreasing function of the distance in projection plane coordinates. This is done in order to constrain the learning process so that it will preferentially develop localized CRFs with no motion gradient
300
Lin Liu and Marc M. Van Hulle
slant output layer
3
tilt 3
hidden layer
W jk
7
j
( MT layer )
W mj W ij
7
13
input layer
13
1.0
|
0.5
|
input unit response
Figure 2: General outline of the three-layered network model. The gray-shaded ellipses represent the gaussian inter- and intralayer connection kernels of hidden unit j, prior to learning.
|
0.0 | 0.06
| 0.12
| 0.25
| 0.50
| 1.00
| | 2.00 4.00 velocity V
Figure 3: Idealized speed-tuning curve of an input unit. The response is plotted as a function of the velocity V present at the retinal coordinate (projected motion field coordinate) represented by the unit. The preferred (optimal) velocity Vopt = 0.75 and the lowest and highest velocities are 3/32 and 3, respectively (plotted on a log scale). Both the velocity and response scales are expressed in arbitrary units, without loss of generality.
MT Cells and Their Selectivity
301
tuning of their own.1 For this localized connection kernel, we have used an elliptical gaussian since the great majority of MT cells have elliptical CRFs (Raiguel et al., 1995). The weights connecting the input units to a hidden unit are initialized according to a rotated elliptical gaussian connection kernel: µ ¶ x2 y2 (3.1) Wij = exp − r2 − r2 , A B with Wij the weight from input unit i to hidden unit j, A and B the halflengths of the long and short axes of the ellipse, and xr and yr the rotated projection plane coordinates, rotated over an angle ω in the projection plane: ¸ · · ¸ · ¸ cos ω sin ω xr x = . (3.2) yr − sin ω cos ω y with x and y the horizontal and vertical relative projection plane coordinates, relative with respect to the center of unit j’s CRF. The long and short axes are chosen randomly within the intervals [3, 4.5) and [1.5, 3), respectively (expressed in input layer coordinates); the rotation angle ω is chosen randomly within [0, π ). Each hidden unit also receives input from all other hidden units (see Figure 2). The intralayer connection weights are defined by a negative, circularsymmetrical gaussian kernel, modulated by the unit’s activation: µ 2¶ x (3.3) Wmj = − exp − 2 · Uj , σ with Wmj the connection weight hidden unit m makes with hidden unit j, x the Euclidean distance in hidden-layer coordinates, σ the radius of the gaussian, and Uj the current activation of unit j. The last is required in order to ensure that the surround will reveal its presence only by the effect on the CRF response.2 Note that by these intralayer connections, the hidden units 1 Treue and Andersen (1993) reported that approximately one-third of the MT neurons display a weak motion gradient tuning, but the stimuli they used were confined to the CRF only. Recently, Orban and coworkers have conducted a detailed analysis in macaque area MT, which clearly shows that motion gradient tuning does not arise from the CRF but rather from the asymmetric surround (Xiao et al., 1997). Hence, in our model, we want to examine the role of the surround in motion gradient tuning, and not that of the CRF. 2 By this requirement, the hidden units become highly nonlinear and hence we cannot consider the newer (unsupervised) learning algorithms for developing localized RFs (Olshausen & Field, 1996, p. 609). These algorithms assume linear response properties and work well when modeling simple cells in the primary visual cortex; however, it remains an open question how these algorithms could provide an account of other, more complex and nonlinear response properties of neurons at later stages in the visual pathway. Hence, by the nonlinear center-surround interaction in the MT cells and by the restriction that the CRF itself may not display any motion gradient tuning (see footnote 1), we have opted for an initialization with localized connection kernels for the CRF and the surround separately.
302
Lin Liu and Marc M. Van Hulle
form a recurrent network. In order to avoid boundary effects, we assume a toroidal intralayer connection pattern. Finally, each hidden unit projects to all output units. The 9 output units have linear output functions and code for nine slant-tilt combinations (in degrees): 60/0, 0/0, 60/180, 60/45, 60/90, 60/135, 60/315, 60/270, and 60/225, respectively. The network is trained with a modified backpropagation algorithm operating in batch mode (see the Appendix); the radius σ in equation 3.3 is determined by cross-validation in steps of 0.05 (best obtained result is for σ = 0.25). We consider two speeds at which the plane moves, Vp = 0.5 and 1.3 The training set is obtained by calculating, for the nine slant and tilt combinations, the corresponding projected motion fields using equation 2.3. From these motion fields, the output of each input unit is calculated according to its speed-tuning characteristic (see Figure 3). For a given slant-tilt combination, the desired output of the network is 1 for the corresponding output unit and 0 for all other units (1-out-of-N coding). One epoch consists of applying each training pattern 100 times, and learning continues until the mean squared training error is below 10−2 or slowly decreasing. Typically, about 50,000 training epochs are needed. 4 Simulation Results Here we focus on the response properties of the hidden units and map their RFs in a manner similar to single-cell recording. In order to map the CRF, we apply a localized velocity stimulus Vopt (one pixel in size), which optimally activates the corresponding input unit, and take the response of the hidden unit as the magnitude of its CRF at that location (in projection plane coordinates). We repeat this procedure for all input units in order to obtain the hidden unit’s CRF. In order to map the surround, we first apply a localized velocity stimulus Vopt to the input unit corresponding to the center of the CRF and denote the hidden unit’s response as RC . We then choose an additional (a different) input unit, apply to it a localized velocity stimulus Vopt , and denote the response of the hidden unit to both stimuli as RC+S . This response results from the interaction between CRF and surround and is assigned to the position of the additional input unit (in projection plane coordinates). The neurophysiologists then consider the net contribution from the surround as the difference RC+S − RC (Raiguel et al., 1995). We repeat the previous procedure for all input units different from the one representing the CRF center. After the connection weights are initialized, the network is run until the average squared error between the activity of the output units, averaged for the nine slant-tilt combinations, and the desired activities for these units
3 We limit the speed range also since the scaling factor in our SFM problem depends on Vp .
MT Cells and Their Selectivity
303
Figure 4: Evolution of the CRF (top) and the surround (bottom) map of hidden unit 31 during training. From left to right: The CRFs and the surrounds at 0 (initialization), 300, 4500, 23,123, and 46,035 epochs (convergence). The CRF and surround maps are represented by gray levels. For the sake of representation, the CRF maps are normalized between 0 (black) and 1 (white); the surround maps are normalized between their minimal (negative) value (black) and maximal value (white). The response range represented by the initial surround map is at least 5 to 10 times smaller than that of the other surround maps. Note that, by definition, the position of the CRF center in each surround map always corresponds to a zero response value (cf. the isolated pixel in the initial surround map).
is lower than 10−2 . This is achieved after 46,035 training epochs. The output units are active only for a particular slant-tilt combination (1-out-of-N coding), but it is clear that these units could have been trained to represent any other type of coding or slant-tilt tuning curve. Figure 4 depicts the evolution of the CRF and the surround of a hidden unit during training; the other hidden units undergo a similar evolution. This shows that although the CRFs and intralayer connection kernels are initialized with gaussians, the learning algorithm continues to change the connection weights until the desired slant-tilt selectivity is achieved at the output stage. From the CRFs and surrounds of all hidden units obtained at convergence, and for which a selection of surrounds is shown in the first column of Figure 6, the following observations can be made: 1. The CRFs obtained are all excitatory and correspond almost completely to the spatial layout of the forward connections. Furthermore, the CRFs do not display an internal structure that could favor slant detection (see footnote 1). Their shapes range from radially symmetrical to strongly elliptical. 2. The surrounds are predominantly antagonistic and, for most hidden units, the surround inhibition is confined to restricted regions on op-
304
Lin Liu and Marc M. Van Hulle
S5
C
S1
Figure 5: Motion field configuration used in the surround asymmetry tests (SAT1 and SAT2). Nine patches of 3 × 3 pixels are shown: one central patch (C) and eight satellite patches, of which two are labeled (S1 , S5 ). The gray shading filling these patches is an indication for the uniform speed applied in the corresponding motion fields (lighter shadings correspond to higher speeds).
posite sides (unit 43 in Figure 6) or on one side of the CRF (unit 46). These surround types are called bilaterally symmetric and asymmetric, respectively. Others display a more radially symmetric surround (unit 26) or a slightly more elongated one (unit 31). In addition, the axis connecting the inhibitory blobs of the bilaterally symmetric surround tends to be orthogonal to the long axis of the elliptically shaped CRF. In order to classify the surrounds, we adopt the same procedure used for classifying experimentally recorded MT cells (Xiao et al., 1995). First, the surround asymmetry test (SAT) is applied to test eight satellite patches arranged symmetrically around the CRF center (see Figure 5). Patch C is centered on the CRF center and is stimulated uniformly at the optimal speed (Vopt in Figure 3). The response to patch C is recorded and used as a reference for determining the degree of inhibition obtained when stimulated both the central patch and one of the satellite patches (SAT1) or two opposite satellite patches at the same time (e.g., S1 , S5 ) (SAT2). The degree of inhibition elicited by each stimulus is then plotted in polar coordinates (second column in Figure 6). Second, for each SAT1 and SAT2 plot, the ratio between the strongest and weakest degree of inhibition is determined. If this ratio
MT Cells and Their Selectivity
305
Table 1: Proportion of Surround Types Found for the Model (49 units) versus Area MT (86 cells).
Model Area MT
RS
BS
A
18% 23
29% 32
53% 45
Note: Radially symmetric (RS), bilaterally symmetric (BS), and asymmetric (A).
is smaller than 2:1 for both the SAT1 and SAT2 plots, then the surround is classified as radially symmetric; otherwise, the surround is considered to be heterogeneous. In the latter case, if the largest ratio is obtained for the SAT1 plot, then the surround is classified as asymmetric; otherwise, it is classified as bilaterally symmetric. The result of this classification is listed in Table 1. In the experimental results described in Xiao et al. (1995), the surrounds of 86 MT cells were classified as indicated in Table 1. Furthermore, the axis of the bilateral symmetric surrounds of these cells tends to be orthogonal to the long axis of the elliptically shaped CRF. Hence, the surround types obtained with our model are in accordance with the experimental evidence. In order to show that the surrounds are speed-tuned, we apply a test procedure similar to the one used by Tanaka and coworkers (1986) for monitoring the speed-tuning of the surround inhibition in MT cells. A patch sized 3 × 3 pixels is centered on the CRF center and stimulated uniformly at the hidden unit’s optimal speed (Vopt ). The remaining part of the motion field (retina) is stimulated uniformly at lower or higher speeds. The ratio of the response magnitude elicited by the combined CRF and surround stimulation is then divided by the response to the CRF alone. The resulting tuning curves for all hidden units were observed to belong to three classes (see the third column of Figure 6): the V(U)-shaped tuning curve (units 26 and 43), the wedge-shaped tuning curve (unit 46), and the monotonically decreasing tuning curve (unit 31). Except for the wedge-shaped curve, which to some extent resembles the monotonically decreasing tuning curve, the tuning curves obtained correspond to the ones described by Tanaka and coworkers. Finally, we determine the tilt tuning curves of our hidden units and parameterize them with respect to the slant. We apply to the network the projected motion fields corresponding to a slanted and tilted surface moving at speed Vp = 1 and record the hidden unit’s response magnitude (last column in Figure 6). The tilt tuning curves are largely in accordance with what is expected from the SATs: a larger degree of inhibition at a certain tilt angle usually corresponds to a lower response magnitude in the tilt tuning curve at the same angle (except for the tilt tuning curve for slant θ = 0 degree, which is always circular, by definition). The correspondence
306
Lin Liu and Marc M. Van Hulle
need not be perfect because of the differences in the techniques used for measuring the SATs and the tilt tuning curves. In order to verify if the tilt angle of the moving plane can be predicted from the SATs (Xiao et al., 1997), we have plotted, for the units with an asymmetric surround, the average tilt direction of the tilt tuning curves and the average direction of inhibition in the SATs (see the caption of Figure 6). In the ideal case, both vectors should point to opposite directions, which is the case for unit 31 but much less for unit 46. This behavior is also observed with actual MT cells (Xiao et al., 1997). 5 Discussion Since the majority of the MT cells display a surround that envelops the classical receptive field, the presence of this surround and its role has received
MT Cells and Their Selectivity
307
much attention in the neuroscience community. In this article, we have addressed the putative role of MT surrounds by developing a computational model with a clear purpose in mind. A three-layered neural network was trained to represent the slant and tilt of a translating flat surface from its projected motion vectors. The training procedure was used solely as a technique for developing the network and was not intended to be a model of learning in MT. The surrounds of the hidden units emerged through the training procedure as appropriate for extracting slant and tilt. Our model therefore has a direct parallel with the neural network model of Lehky and Sejnowski (1988) for extracting shape from shading (SFS). Their network was trained to compute curvature of Lambertian surfaces using only shading information. The hidden units developed RFs similar to those of simple cells in the primary visual cortex. This differs from our model in that the
Figure 6: Facing page. Surround maps and motion processing properties of hidden units 26, 31, 46, and 43 at convergence (row-wise, from top to bottom). First column: The surround maps of the four hidden units. Based on the SAT plots (second column) the surrounds are classified as radially symmetric (unit 26), asymmetric (units 31 and 46), and bilaterally symmetric (unit 43). Note that the surround map of unit 31 corresponds to the right-most one in Figure 4. The surround map conventions are as in Figure 4. Second column: The SAT1 plots (except for unit 43) and SAT2 plot (unit 43). The SAT2 plot is shown since this test was decisive in classifying unit 43. The SAT plots are shown for three satellite patch speeds: V = 0.75 (thick full line), 1 (thin full line), and 1.125 (dashed line). The plots are normalized to unit length radius with respect to the maximal radius obtained for V = 0.75, the unit’s optimal speed Vopt . The origin represents zero inhibition. Third column: The speed-tuning curves of the surround inhibition plotted (on a log scale) for different ratios between the surround speed Vs and the CRF speed Vc , which is always Vopt . The curves are normalized with respect to the CRF response alone, without stimulating the surround. Fourth column: The tilt tuning curves. Each set of curves represents the unit’s response plotted in polar coordinates in steps of 20-degree tilt angle. The curves shown are parameterized with respect to five slant angles θ = 0, 20, . . . , 80 degrees and normalized with respect to the largest response vector in the set. The curve for θ = 0 degree is a circle, by definition, and the other curves become progressively less circular and more restricted in size when θ increases. For the sake of exposition, the tilt angle is defined as the angle between the direction of increasing speed gradient and the horizontal axis. For units 31 and 46 only: The full line arrow represents the average tilt direction, defined as the vector sum of all response vectors taken over all tilt and slant angles shown. Similarly, the dashed arrow represents the average direction of inhibition, defined as the vector sum of all degree of inhibition vectors shown in the corresponding SAT plots. The result of vector summation is then normalized to unit length.
308
Lin Liu and Marc M. Van Hulle
output units of the SFS model had biological counterparts as well. As far as we know, the output units in our case do not have biological counterparts, but it is clear that they can be endowed with slant-tilt tuning characteristics other than those assumed (i.e., a 1-out-of-N coding). Single-cell recordings show that 95% of the MT neurons exhibit a marked directional selectivity (Zeki, 1974; Maunsell & Van Essen, 1983; Albright, 1984). Furthermore, MT has been shown to contain a systematic columnar arrangement of directionally selective neurons (Albright, Dessimone, & Gross, 1984). All model neurons we have developed are assumed to belong to the same direction column and to be stimulated along their preferred direction of motion. Hence, since we do not have to consider any direction of motion in the input units, the overall network structure is greatly simplified and the training time dramatically reduced. In the presence of a rotational component in the motion of the slanted plane, for example, around the line of sight, the projected motion vectors are a function of both the slant and the rotational velocity of the plane. Since MT cells respond to rotation-induced motion vectors, but without being rotationally selective (Lagae et al., 1994), the individual MT cell’s response confounds slant- and rotation-induced motion stimuli. In our model, the MT neurons are also prone to this confusion; in addition, since the purpose of the model was to develop (i.e., to train) slant- and-tilt selective units from the gradient of the projected motion vectors, we had to confine ourselves to translational motion. This is a limitation of the training procedure, which was, as we said, not intended to be a model of learning in MT. Raiguel et al. (1995) studied the spatial organization of RFs in area MT of anesthetized and paralyzed macaque monkeys. The size and shape of the RF was found by fitting with a generalized gaussian. Most of these RFs are elongated, and the axis of elongation tends to be orthogonal to the preferred direction of motion. The extent of the surround at half-height equaled, on average, three to four times that of the CRF. Raiguel et al. then suggested that antagonistic surrounds are constructed in MT, most likely through horizontal connections. The model developed in this article also contains horizontal connections for the MT units, and the antagonistic surrounds obtained are in accordance with those found experimentally. Hence, our model supports the hypothesis about the putative origin of the antagonistic surrounds. Finally, although the hidden units in our network model are broadly tuned, only one additional (output) layer, with linear units and linear connections, suffices to recover the slant and tilt of a translating plane. Hence, we conjecture that populations of MT cells, with different surround types, can code linearly for slant and tilt of translating planes. Appendix: Learning Algorithm /* Let {Wij }, {Wmj }, be the set of hidden unit j’s forward and
MT Cells and Their Selectivity
309
feedback (intralayer) connections; Let {Wjk } be the set of output unit k’s forward connections; Let M be the number of training (input) patterns; Let V µ be training pattern µ determined using equation 2.3 and Figure 3; Let Uj be hidden unit j’s output activation; Let E be the training error */ /* Initialization */ Initialize {Wij } using equation 3.1, and {Wmj } using equation 3.3 given σ Initialize {Wjk } randomly within the interval [0, 1) Set learning rate η ← 0.3 Set training error value Epast >> /* Backpropagation-through-time algorithm (Rumelhart, Hinton, & Williams, 1986), operating in batch mode, and modified: 1) to encompass weighted and modulated intralayer connections Wmj (Uj ) equation 3.3; 2) to speed up training time by modifying η and by adding noise to it */ Do until training error E < 10−2 or slowly decreasing { Set {{1Wij ← 0, ∀i ∈ input layer}, ∀j ∈ hidden layer }; {{1Wjk ← 0, ∀j ∈ hidden layer}, ∀k ∈ output layer} Do 100 times /* the size of one epoch */ { Do µ = 1 to M /* ∀ training patterns */ { Take input pattern V µ Propagate activity from input to hidden layer Calculate initial activations of hidden neurons; Uj , ∀j ∈ hidden layer Propagate Uj , ∀j, back to hidden units, through Wmj (Uj ) equation 3.3 Calculate new activations of hidden neurons; Uj , ∀j Propagate activity from hidden to output layer /* Backprop step */ Determine output and hidden unit training errors and: Do ∀j ∈ hidden layer { ∂E 1Wij ← 1Wij − η ∂W , ∀i ∈ input layer ij
}
}
}
∂E , ∀k ∈ output layer 1Wjk ← 1Wjk − η ∂W jk
310
Lin Liu and Marc M. Van Hulle /* Update all connection weights */ Do ∀j ∈ hidden layer { Wij ← Wij + 1Wij , ∀i ∈ input layer Wjk ← Wjk + 1Wjk , ∀k ∈ output layer } Epresent ← E /* Store training error */ /* Modify learning rate η */ If (Epresent < Epast ) { Lower η to nearest discrete value on log2 scale (starting from η = 0.01 to 0.5) Store present weight configuration and training error Epresent } else { Restart with best weight configuration obtained thus far, including its Epresent value } /* To help escape from small local minima (or plateaus): */ Add noise to η proportional to its magnitude
}
Epast ← Epresent
Acknowledgments We express our gratitude to D.-K. Xiao, Laboratorium voor Neuro- en Psychofysiologie, K. U. Leuven, whose input was indispensable in shaping the model. We also thank G. A. Orban and S. Raiguel, of the same laboratory, for helpful discussions, and H.-J. Sun, Institute of Biophysics, Academia Sinica, Beijing, for his help in developing the learning algorithm. The first author is supported by a postdoctoral grant received from the Research Fund of the K. U. Leuven (F/95/138). The second author is a research associate of the Fund for Scientific Research—Flanders (Belgium) and is supported by research grants received from the Fund for Scientific Research (G.0185.96) and the European Commission (ECVnet EP8212). References Albright, T. D. (1984). Direction and orientation selectivity of neurons in visual area MT of the macaque. J. Neurophysiol., 52, 1106–1130. Albright, T. D. (1993). Cortical processing of visual motion and its role in the stabilization of gaze. In F. A. Miles & J. Wallman (Eds.), Visual motion and its role in the stabilization of gaze (pp. 177–201). Amsterdam: Elsevier.
MT Cells and Their Selectivity
311
Albright, T. D., Desimone, R., & Gross, C. G. (1984). Columnar organization of directionally selective cells in visual area MT of the macaque. J. Neurophysiol., 51, 16–31. Allman, J. M., & Kaas, J. H. (1971). A representation of the visual field in the caudal third of the middle temporal gyrus of the owl monkey (Aotus trivirgatus). Brain Res., 31, 85–105. Allman, J., Miezin, F., & McGuinness, E. (1985). Direction- and velocity-specific responses from beyond the classical receptive field in the middle temporal are (MT). Perception, 14, 105–126. Andersen, R. A., & Siegel, R. M. (1986). Two- and three-dimensional structure from motion sensitivity in monkeys and humans. Soc. Neurosci. Abstr., 12, 1183. Buraˇcas, G. T., & Albright, T. D. (1994). The role of MT neuron receptive field surrounds in computing object shape from velocity fields. Advances in Neural Information Processing Systems, 6, 969–976. Buraˇcas, G. T., & Albright, T. D. (1996). Contribution of area MT to perception three-dimensional shape: A computational study. Vis. Res., 36, 869–888. Droulez, J., & Cornilleau-P´er`es (1990). Visual perception of surface curvature: The spin variation and its physiological implementation. Biol. Cybern., 62, 211–224. Dubner, R., & Zeki, S. M. (1971). Response properties and receptive fields of cells in an anatomically defined region of the superior temporal sulcus in the monkey. Brain Res., 35, 528–532. Hussain, M., Treue, S., & Andersen, R. A. (1989). Surface interpolation in threedimensional structure-from-motion perception. Neural Computation, 1, 324– 333. Koenderink, J. J., & van Doorn, A. J. (1992). Second order optic flow. J. Opt. Soc. Am. A, 9, 530–538. Lagae, L., Guly´as, B., Raiguel S., & Orban, G. A. (1989). Laminar analysis of motion information processing in macaque V5. Brain Res., 496, 361–367. Lagae, L., Maes, H., Raiguel S., Xiao, D.-K., & Orban, G. A. (1994). Responses of macaque STS neurons to optic flow components: A comparison of areas MT and MST. J. Neurophysiol., 71, 1597–1626. Lagae, L., Raiguel S., & Orban, G. A. (1993). Speed and direction selectivity of macaque middle temporal neurons.J. Neurophysiol., 69, 19–39. Lehky, S. R., & Sejnowski, T. J. (1988). Network model of shape-from-shading: Neural function arises from both receptive and projective fields. Nature, 333, 452–454. Maunsell, J. H. R., & Van Essen, D. C. (1983). Functional properties of neurons in middle temporal visual area of the macaque monkey. I. Selectivity for stimulus direction. J. Neurophysiol., 49, 1127–1147. Movshon, J. A., Adelson, E. H., Gizzi, M. S., & Newsome, W. T. (1985). The analysis of moving visual patterns. In C. Chagass, R. Gattas, & C. Gross (Eds.), Pattern recognition mechanisms (pp. 117-151). Heidelberg: Springer-Verlag. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609.
312
Lin Liu and Marc M. Van Hulle
Raiguel, S., Van Hulle, M. M., Xiao, D.-K., Marcar, V. L., & Orban, G. A. (1995). Shape and spatial distribution of receptive fields and antagonistic motion surrounds in the middle temporal area (V5) of the macaque. Eur. J. Neurosci., 7, 2064–2082. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J.L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Tanaka, K., Hikosaka, H., Saito, H., Yukie, Y., Fukada, Y., & Iwai, E. (1986). Analysis of local wide-field movements in the superior temporal visual area of the macaque monkey. J. Neurosci., 6, 134–144. Treue, S., & Andersen, R. A. (1993). Tuning of MT cells to velocity gradients. Invest. Ophthalmol. Vis. Sci., 34, 813. Treue, S., Andersen, R. A., Ando, H., & Hildreth, E. C. (1995). Structure from motion: Perceptual evidence for surface interpolation. Vis. Res., 35, 139–148. Xiao, D.-K., Marcar, V. L., Raiguel, S. E., & Orban, G. A. (1997). Selectivity of macaque MT/V5 neurones for surface orientation in depth specified by motion. Eur. J. Neurosci., 9, 956–964. Xiao, D.-K., Raiguel, S., Marcar, V., Koenderink, J., & Orban, G. A. (1995). Spatial heterogeneity of inhibitory surrounds in the middle temporal visual area. Proc. Natl. Acad. Sci. USA., 92, 11303–11306. Zeki, S. M. (1974). Functional organization of a visual area in the posterior bank of the superior temporal sulcus of the rhesus monkey. J. Physiol., 236, 549–573. Received August 28, 1996; accepted May 6, 1997.
Communicated by Steven Nowlan
A Self-Organizing Neural Network Architecture for Navigation Using Optic Flow Seth Cameron Stephen Grossberg Frank H. Guenther Department of Cognitive and Neural Systems and Center for Adaptive Systems, Boston University, Boston, MA 02215, U.S.A.
This article describes a self-organizing neural network architecture that transforms optic flow and eye position information into representations of heading, scene depth, and moving object locations. These representations are used to navigate reactively in simulations involving obstacle avoidance and pursuit of a moving target. The network’s weights are trained during an action-perception cycle in which self-generated eye and body movements produce optic flow information, thus allowing the network to tune itself without requiring explicit knowledge of sensor geometry. The confounding effect of eye movement during translation is suppressed by learning the relationship between eye movement outflow commands and the optic flow signals that they induce. The remaining optic flow field is due to only observer translation and independent motion of objects in the scene. A self-organizing feature map categorizes normalized translational flow patterns, thereby creating a map of cells that code heading directions. Heading information is then recombined with translational flow patterns in two different ways to form maps of scene depth and moving object locations. Most of the learning processes take place concurrently and evolve through unsupervised learning. Mapping the learned heading representations onto heading labels or motor commands requires additional structure. Simulations of the network verify its performance using both noise-free and noisy optic flow information. 1 Introduction: Optic Flow, Heading, and Visual Navigation As we move through the world, we experience flowing patterns of light on our retinas. Scientists have studied the ability of humans to use this optic flow for a variety of tasks, including determination of heading (e.g., Crowell & Banks, 1993; Van den Berg, 1992, 1993; Warren & Hannon, 1988, 1990; Warren & Kurtz, 1992), observer velocity (e.g., Monen & Brenner, 1994), three-dimensional structure (e.g., Cornilleau-Peres & Droulez, 1993; Hildreth, Grzywacz, Adelson, & Inada, 1990; Treue, Andersen, Ando, & Hildreth, 1995), the locations and speeds of self-moving objects (e.g., Brenner, Neural Computation 10, 313–352 (1998)
c 1998 Massachusetts Institute of Technology °
314
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
1991), and distances to objects (e.g., Prazdny, 1980; Simpson, 1988, 1993). With the exception of determining observer velocity (Monen & Brenner, 1994), humans have proved to be very proficient in using optic flow to perform these tasks. This article describes a neural network that uses optic flow information and eye position to determine the heading of a moving observer. This heading network is embedded in a larger architecture that includes neural networks for determining distances to visible objects and the locations of moving objects. The resulting representations of heading, scene depth, and moving object locations are then used to navigate reactively around visible obstacles and to pursue moving targets. The aim of this project is to use insights from biology to design a robust system that can be used for navigation in autonomous mobile vehicles. To this end, the system is designed as an interconnected collection of self-organizing neural networks. The primary advantage of this approach is that it removes the need to hand-calibrate visual navigation systems for mobile robots. The network will automatically adjust to a camera system’s focal length, pixel layout, sensor size, and tracking speed without writing new software or performing carefully controlled calibrations. This competence also renders the system tolerant to error in sensor responses and changes in sensor and motor properties due to wear. A central problem addressed by the current architecture is the determination of the heading direction of a moving observer from the optic flow field. Gibson (1950) remarked that the optical flow pattern experienced by an observer moving along a straight line through a static environment contains a singularity he termed the focus of expansion (FOE). Gibson further hypothesized that the visual system might use the global pattern of radial outflow originating from the singularity to determine the translational heading of the observer. However, psychophysical experiments rejected this hypothesis because the flow pattern on the retina is radically altered by eye movements (Regan & Beverly, 1982). If the observer’s eyes rotate during translational movement, the resulting flow pattern is a superposition of two vector fields such that the FOE no longer necessarily coincides with heading direction. The current architecture learns to cancel the effects of eye rotations so that a moving robot can maintain an accurate sense of heading while visually searching a scene or visually tracking objects. The issue of whether humans use extraretinal signals from eye rotations to derotate the flow field has been hotly debated in the experimental literature, since it raises difficult methodological questions concerning the nature of the motion, its range of speeds, and the types of environmental cues that are available (Banks, Ehrlich, Backus, & Crowell, 1996; Regan & Beverly, 1982; Rieger & Toet, 1985; Royden, 1994; Royden, Banks, & Crowell, 1992; Royden, Crowell, & Banks, 1994; van den Berg, 1993; Warren, 1995; Warren & Hannon, 1988, 1990; Warren, Li, Ehrlich, Crowell, & Banks, 1996). Warren (1996) has summarized much of the relevant literature by noting that extraretinal information is especially important in determining heading from
Self-Organizing Navigation Using Optic Flow
315
impoverished visual environments, but less so in naturalistic scenes with a rich, discontinuous three-dimensional structure. Our model indicates the types of self-tuning mechanisms that are able to make effective use of extraretinal information. Since Gibson wrote, vision researchers have proposed a wide variety of methods to extract ego motion from optical flow information. The differential invariants method (Koenderink & van Doorn, 1975; Waxman & Ullman, 1985) solves for the motion of planar surfaces. A rigidly moving planar surface is a special case that gives rise to flow fields in which the flow vectors vary as a quadratic function of image position. Waxman and Ullman (1985), for example, first subdivided the flow field into patches that are approximately planar. The three-dimensional structure was then recovered in closed form from the parameters of the second-order flow field. An advantage of this technique is that the motion in each patch is computed independently, so scenes with multiple moving objects can be better interpreted. One drawback is that it requires dense optic flow fields to compute derivatives. Systems that use differential invariants are also vulnerable to error in the flow field (Hatsopoulos & Warren, 1991). Optic flow error is essentially unavoidable due to the aperture problem (e.g., Perrone, 1992). According to the local differential motion method (Cutting, 1986; Hildreth, 1992; Longuet-Higgins & Prazdny, 1981; Rieger & Lawton, 1985), if two points have the same image location but are at different depths, then the vector difference between the two flow vectors is oriented toward the FOE. This technique relies on locating adjacent image features separated in depth so that their flow vector difference is oriented approximately toward the FOE. It fails to operate when approaching a wall with no depth variation, and as with the differential invariant techniques, it does not work well on sparse or noisy flow fields. Heeger and Jepson (1990) proposed a solution to the heading calculation problem that uses least-square minimization. Their technique, termed the subspace algorithm, is robust and comparatively insensitive to noise. It has the advantage of not requiring proprioceptive information from eye muscles to cancel the effects of eye rotations, but it does not learn its operating parameters. Several researchers have recently posited neural network models for determining optic flow. Because of the parallel processing capabilities embodied in these networks, neural network solutions tend to be more tolerant of noise in the flow field. Hatsopoulos and Warren (1991) describe a neural network motivated by neurophysiological results from the middle temporal (MT) and medial superior temporal (MST) areas in primate visual cortex. Although the model is tolerant to both speed and directional noise in the flow field, the authors point out that it has two major shortcomings: (1) unlike humans, the model’s accuracy degrades drastically in the presence of eye rotations, and (2) the model relies on supervised learning (i.e., it requires an external teaching signal that would not be available to a developing ani-
316
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
mal). Perrone (1992) incorporated both translational and rotational detector cells in a model that offers an approximate solution to heading detection during eye rotations. Lappe and Rauschecker (1993) describe a neural network based on the algorithm of Heeger and Jepson (1990), but this network requires tracking fixation eye movements to determine heading and operates only within a hemisphere of heading directions. The Heeger and Jepson algorithm is forced to operate within a hemisphere of headings because it cannot distinguish between motions separated by 180 degrees. It relies on locating a minimal inner product between sets of heading weights and incoming flow vectors. The minimum occurs for weights that are orthogonal to the flow vectors. However, if a heading is reversed, the flow vectors will reverse, yet remain orthogonal to the weight vector. The Heeger and Jepson algorithm cannot distinguish opposite motions, making it difficult to use on robots with full translational capabilities. The current system can operate over the full sphere of heading directions. In addition to determining heading, the current architecture represents the locations of visible obstacles for the purpose of navigating around them. The machine vision community has published hundreds of articles about recovering three-dimensional structure from image motion. In general, the focus of this effort has been to build scene representations that reflect ground truth as accurately as possible. Over the years, a multiplicity of algorithms have emerged that attempt to deal with the inherent ambiguity in the task (e.g., sparse data and the aperture problem). Since the goal has been to reproduce scene geometry as accurately as possible, algorithmic simplicity and self-tuning have not been emphasized. Rather, most techniques assume that all relevant camera parameters are available. In fact, camera calibration is itself an active research topic. Most camera calibration techniques rely on imaging features at known locations (Holt & Netravali, 1991). Other techniques have been designed to use arbitrary scenes but require known movements (Dron, 1993), and more recently, a few techniques have attempted calibration without specific objects or movements (Luong & Faugeras, 1992). Despite the impressive amount of depth-from-motion research, it is still unclear whether identically reproducing the three-dimensional structure of the world is either feasible or necessary for autonomous robot navigation. The complexity, fragility, and computational load of most of these algorithms have not produced the desired results in mobile robotics. Only recently have researchers begun to concern themselves with how to solve visual navigation problems using simple, biologically plausible calculations embedded in self-tuning systems. Although some initial work has been done on very simple problems (Heikkonnen & Oja, 1993), the current system stands apart by using simple self-tuning calculations to learn representations of heading, scene structure, and moving object locations in the presence of eye rotations. The architecture uses parallelism to capitalize on the redundancy in the flow field in order to achieve noise tolerance. No claim is made that this system will provide complete and accurate results
Self-Organizing Navigation Using Optic Flow
317
Figure 1: Viewer-centered coordinate frame and perspective projection.
in all situations. In fact, no system that relies solely on optic flow can produce a veridical representation of the world that it can use for completely error-free navigation. However, the current system’s performance on simulations with noisy flow information suggests that its parallel, self-tuning, and computationally efficient calculations are sufficient to provide useful information for an autonomous robot. 2 The Optic Flow Field Optic flow results from the projection of object movement in the threedimensional world onto a two-dimensional image plane. The motion of a point in three-dimensional space has six degrees of freedom: translational velocity T = (Tx , Ty , Tz )t and rotation Ä = (Äx , Äy , Äz )t . If the position of an observer’s eye is R = (X, Y, Z)t , then motion may be described by Vobserver = T+(Ä×R). If the observer is moving through a static environment, a viewercentered coordinate frame may be established in which the observer appears to be standing still and each point in the environment is moving with the opposite motion, V = −(T + Ä × R). Figure 1 shows how a point, P, in three-dimensional space is projected onto a point r = (x, y)t = f (X/Z, Y/Z)t in the image plane, where f denotes the focal length of the lens ( f = 1 was used in the simulations). Image velocity is defined as the derivative, with respect to time, of the x and y components of scene point projections and can be written as follows (Heeger & Jepson, 1990): θ(x, y) = (vx , vy ) = p(x, y)A(x, y)T + B(x, y)Ä,
(2.1)
318
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
where p(x, y) = 1/Z is the inverse depth of points in the image plane, and A(x, y) and B(x, y) are matrices that depend only on image position and focal length, not on any unknowns: · ¸ −f 0 x A(x, y) = . 0 −f y ³ ´ xy x2 − f + y f f . (2.2) B(x, y) = y2 xy −f −x f+ f The various processing stages of the current system can be thought of as decomposing the right side of equation 2.1 until we are left with a map representing the direction of the vector T, which is the observer’s heading direction. In the following description, we will be interested in the speed and direction of the flow components at each retinal location since these components form the input to the system. The speed at a retinal position is: q (2.3) v = kθk = v2x + v2y , and the direction is: µ ¶ vy . φ = atan vx
(2.4)
3 System Description Figure 2 shows an overview of the neural network architecture. The following paragraphs describe the major components of this system. 3.1 Motion Field. The input layer is a population of cells sensitive to image flow on the retina. Several researchers have proposed physiologically plausible models that yield population encoding of optic flow vectors (e.g., Wang, Mathur, & Koch, 1989). A typical approach uses a separate floret of cells for each sampling point on the retina. Each cell of a floret is tuned to a preferred direction, ek = [cos(2π k/n), sin(2π k/n)], where k = 1, . . . , n. Florets of this form constitute the first stage of the system, called the motion detector field. Motion at the ith retinal location with speed v and direction φ generates a response Ski according to the following equation: ¶¸ · µ 2πk Ski = v cos φ − , 1 ≤ k ≤ n, 1 ≤ i ≤ m. (3.1) n For robotic applications, it suffices to represent the flow field with its x and y components: Sxi = v cos(φ)
Syi = v sin(φ).
(3.2)
Self-Organizing Navigation Using Optic Flow
319
Figure 2: Overview of the system. Cell activities are represented by uppercase letters, and the modifiable synaptic weights projecting to these cells are represented by the same letters in lowercase. Each floret of velocity-tuned cells corresponds to a single retinal location. Simulations reported in this article utilized simplified versions of these florets that contained cells for only the x and y components of the flow fields. See text for details.
320
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
The simulations in this report used a motion detector field that consisted of these two directionally tuned cells at each of 49 (7 × 7) retinal locations unless noted otherwise. Although the primary goal of the present model is not to achieve detailed biological fidelity, it is of some interest that responses of many directionally tuned cells in the cortical motion processing stream—for example, cells in the MST area of monkeys (Orban, Lagae, Raiguel, Xiao, & Maes, 1995; Tanaka & Saito, 1989)—increase with speed. 3.2 Eye Velocity Vector. The eye velocity vector consists of two pairs of opponent cells that represent pitch velocity (O1 , O2 ) and yaw velocity (O3 , O4 ) of the eye when it is rotating in its orbit. For positive rotations about the x-axis, O1 is linearly related to the rotational speed and O2 is zero. For negative rotations, O1 is zero and O2 is linearly related to rotational speed. O3 and O4 code rotations about the y-axis in a similar manner. The eye velocity vector activities project through adaptive inhibitory weights to the translation field as described below. This inhibitory input learns to cancel the portion of the flow field corresponding to eye rotations. It is expected that the system could easily be expanded to cancel rotational flow due to other sources, such as head or wheel rotations, as long as the appropriate velocity vectors are available. Only eye rotations are included in the current implementation for the sake of simplicity. The eye velocity vector in the current system is most naturally interpreted as either an efference copy of eye velocity commands or a proprioceptive representation of eye velocity. In a mobile robot with an active vision system, this vector would correspond to motor velocity commands or velocity feedback from eye position sensors or both. In biological models of eye and arm movement control, processing stages exist that can generate an efference copy of commanded movement speed and direction (Bullock & Grossberg, 1988; Grossberg & Kuperstein, 1989). In most situations, people fixate on a point as they move. This behavior generates a mixed optic flow signal resulting from the simultaneous translation of the body and rotation of the eyes. Unless the point of fixation coincides with the direction of heading, the FOE that results when fixating during egomotion does not correspond to the direction of body translation. The current system uses an estimate of eye rotational velocity to negate internally the rotation part of the signal from the flow field. There are at least two other approaches to solving the heading problem during eye rotations. A less accurate approach is to sample a large part of the flow field and subtract off any component that is common at all depths. This, however, has the advantage of not requiring explicit knowledge of eye rotations (Perrone, 1992). Another approach, exemplified by the Heeger and Jepson (1990) algorithm, uses more sophisticated heading calculations that work in the presence of rotational components, again without requiring explicit knowledge of eye rotations. These more complex algorithms could be used
Self-Organizing Navigation Using Optic Flow
321
in place of the heading calculations performed by the current model while maintaining the current depth and independent motion algorithms. 3.3 Translation Field. As shown in Figure 2, each cell Oj in the eye velocity vector has a set of inhibitory weights wjki that project to each directionally tuned cell k in the floret at every position i in the translation field. Each translation field cell also receives excitatory input from the corresponding cell in the motion detector field. After learning (see section 4), the net result of these inputs is a field of cells that represents only the component of the flow field that is due to translational motion. That is, the translation field activity contains only the p(x, y)A(x, y)T component of equation 2.1. A variant of the vector associative map (VAM) neural architecture (Gaudiano & Grossberg, 1991) is used to adjust the weights projecting from the eye velocity vector in such a way as to cancel out the portion of flow corresponding to voluntary rotations of the eyes as registered at the eye velocity vector. The VAM allows both error-based learning and performance to take place simultaneously. This property has enabled VAMs to control self-adapting mobile robots autonomously (Gaudiano, Zalama, & LopezCoronado, 1996; Zalama, Gaudiano, & Lopez-Coronado, 1995). Using such a VAM, the activity of a cell in the translation field is formed by subtracting the inhibitory input from the eye velocity vector cells (corresponding to the rotational component of the flow field) from the excitatory input projecting from the motion detector field, X Oj wjki , (3.3) Wki = Ski − j
and the learning equation for the weights projecting from the eye velocity vector is: dwjki = α1 Wki Oj , dt
(3.4)
where α1 is a learning-rate parameter ranging from 0.9 for noise-free conditions to 0.01 for noisy conditions. Before learning, the weights are set equal to zero. The eye rotation cancellation map learns best when the incoming flow is purely rotational. If desired, this can be guaranteed by gating learning off if translational motion commands (e.g., limb movement or wheel rotation) are active. During purely rotational movements of the eye, the learning law of equation 3.4 adjusts the weights so that the inhibitory input from the eye velocity vector exactly cancels the excitatory input from the motion detector field. When the two inputs are equal, Wki in equation 3.4 goes to zero, and learning stops. At this point the system is tuned such that the rotational component of the flow field is effectively “filtered out” by eye velocity vector projections at the translation field.
322
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
During movements with both a translational and rotational component, only the rotational component is reliably correlated with the presynaptic activity of the eye velocity vector cells. The translational flow component will vary depending on heading direction and will not be strongly correlated with eye velocity vector activity. This translational component thus amounts to a form of noise in the learning process. A slower learning rate can be used to “average out” the noisy translational component, as well as other forms of noise, such as noisy sensor information. The ability to learn correct parameters in the face of sensor noise or translational motion is demonstrated in the simulations of section 4. 3.4 Normalization Field. The next stage of cells in the system performs a normalization operation on the outputs of the translation field. This is done because the magnitude of the flow vectors at the translation field contains information pertinent to observer speed and object distances only, not observer heading direction. In other words, we are not interested in the magnitude of the p(x, y)A(x, y)T term from equation 2.1. Only the direction of vector T is needed. Normalization removes the unpredictable changes in flow vector magnitudes due to varying object distances and translation speeds. The normalization calculation is: [Wki ]+ , Nki = rX ([Wmi ]+ )2
(3.5)
m
where [ ]+ denotes a rectification function such that [x]+ = x for x > 0, [x]+ = 0 otherwise. All Nki are set to zero for retinal locations where the total flow signal is zero. Although not implemented in the simulations, it may be useful to have a threshold for eliminating small but nonzero activities in the normalization map because optic flow detectors operating on a real image will be noisy. These locations typically include points with no visible objects, very distant objects, or points near the FOE for a given heading. If we assume that the detectors are more accurate at detecting the direction of large-flow signals than small-flow signals, then setting the threshold slightly higher would allow the network to base its heading decision on larger, more reliable signals. 3.5 Heading Map. The next stage of cells is a self-organizing feature map (SOFM) that encodes heading. Cells in the heading map receive weighted excitatory input projections from the normalized flow field. Heading can be determined by classifying the pattern across these inputs. Kohonen (1982) described a learning algorithm, based on earlier work of von der Malsburg (1973) and Grossberg (1976), that can be used to self-organize a topographical map whose cells classify input patterns. In our application, neighboring cells in the map code similar heading directions. During learning, neighbor-
Self-Organizing Navigation Using Optic Flow
323
hood interactions cause adjacent cells in the heading map to code similar headings. The interactions also ensure that the map uses all of its cells to encode headings. Otherwise, a single cell might learn large weights early on and continue to monopolize learning even though its match with the input pattern is poor. In addition, the neighborhood interactions cause heading map cells to distribute themselves automatically according to the frequency statistics of sampled headings. Greater discrimination among the more common headings is possible because these headings attract a higher density of map cells. The heading map consists of a group of cells with activities H1 that perform a maximum inner product classification (Grossberg, 1976; Kohonen, 1982) using weights hkil projecting from cells with activities Nki in the normalized flow field. For each presentation of a normalized flow pattern, one cell in the heading map will receive the largest total input. Early in development, this maximally excited cell and its N nearest neighbors are set to activity level 1/(N + 1), and all other map cell activities are set to zero. Weights projecting to each of the active cells are adjusted toward the flow pattern. Initially, large neighborhoods help the map cells, which begin with small, random weights chosen from a uniform distribution between 0 and 0.1, to “unfold” properly and thereby cover the entire range of experienced headings. Over time, the neighborhood size N shrinks to zero so that flow patterns are categorized more precisely. The following equations were simulated: X Nki hkil Hl∗ = ki
Hmax = cell with maximum activity Hl∗ 1 for Hmax and N neighbors Hl = N+1 Hl = 0 otherwise dhkil = α2 (Nki − hkil )Hl , dt
(3.6) (3.7)
where Hl∗ is the input to the lth heading map cell, and α2 is a learning rate parameter that starts at 0.1 and shrinks with a linear decay rate to 0.001 over 2000 learning trials. The neighborhood N starts as a 15 × 15 square centered at the maximally active cell. The width of this square shrinks by one after every 100 heading samples until the neighborhood consists of only the maximally active map cell. After training, a cell in the heading map will respond maximally to flow generated by a particular heading. However, most headings will fall between those encoded by cells in the heading map. During performance, winner-take-all competition among the heading cells is relaxed so that several candidate heading cells survive the competition based on the size of
324
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
their total input. Allowing distributed activation across heading cells creates an interpolated output. For example, if an input heading falls halfway between headings encoded by two cells, then each of those cells will probably survive the competition and contribute approximately half of their information to the rest of the network. Activity in the surviving heading cells is also normalized so that the heading representation is a weighted average of active heading cells. Heading cell activity is thus calculated during performance as follows: Hl∗ =
X
Nki hkil
ki
Hl∗∗ = 0
if
∗ Hl∗ < (max[Hm ] − 0)
Hl∗∗ = Hl∗ otherwise H∗∗ Hl = X l ∗∗ , Hm
m
(3.8)
m
where 0 is a threshold parameter determining how many cells survive the ∗ ]/15. In words, the competition. For the simulations, 0 was set to maxm [Hm heading map cell activities Hl are a contrast-enhanced version of their total inputs Hl∗ . Other methods, such as the softmax of Bridle (1989), could be used to provide the same functionality as this thresholding process. Note that the sizes of the total inputs to the map cells determine the “neighborhood” during performance, whereas the active neighborhood during training is determined based on proximity to the maximally active cell. This was done because it provided the best map unfolding during training and noise tolerance during performance. Since the trained map is topographically organized, however, the maximally active cells during performance will still typically be neighboring cells in the map. As described thus far, the heading map can only discriminate between headings, not identify heading in terms of absolute azimuth and elevation angles. This is sufficient for the proper formation of the depth and independent motion maps in the system. However, testing the network’s ability to determine heading accurately and using the heading estimate to perform navigation require that retinotopic labels be attached to each cell in the heading map. A simple technique for assigning a retinotopic label to a heading cell with activity Hl is to find the index i of the smallest weight dli projecting from that cell to the retinotopically organized depth map. The smallest weight will be located near the FOE on the retina, which is equivalent to the retinotopic heading. Alternatively, the following technique could be used to assign heading labels in a mobile robot: 1. Move toward a visual cue, such as a point of light, at eye level and adjust the wheels until the image of the light is stationary on the optic
Self-Organizing Navigation Using Optic Flow
325
sensor. When the image is stationary, the navigator is heading directly toward the point of light in body-centered coordinates. 2. As the robot approaches the light, rotate the eye so that the light image falls on many different regions of the optic sensor. Since the navigator is heading toward the light in body-centered coordinates, the projected point of light should still be motionless even though it is projected to a different retinal location. However, the retinotopic heading will have changed. The floor and other objects in the scene will create a flow pattern that excites a different cell in the heading map. 3. For each eye position, map the maximally excited heading cell to the current retinotopic location of the imaged point of light. In the heading map simulations, heading labels were supplied by systematically sampling heading directions and labeling each cell with the heading that maximally excited it. The heading estimate is calculated using the following equation: HMAP =
X
Hl 8l ,
(3.9)
l
where 8l is the preferred heading of the lth heading map cell, and Hl is determined using equation 3.8. This labeling procedure was done only so the accuracy of the self-organizing map could be compared against other reported models. In the navigation simulations, all navigation took place in a retinotopic coordinate system so the heading labels were applied using the smallest weight technique. Because the system bases its heading estimate on a flow field where only the rotational component due to self-generated eye movements has been removed, it has an advantage over many other heading estimation algorithms during curvilinear motion. This can be seen in the following example. Imagine that the navigator is following the circumference of a circle. Algorithms designed to ignore rotational components of the flow field without using extraretinal information cannot distinguish whether the rotational component of the flow field is due to eye movement or the curvilinear body movement. Therefore, during curvilinear motion, these algorithms will ignore the rotational component of the motion and always report that the navigator is heading straight forward. In this case, however, “straight forward” is defined in a coordinate system that is rotating as the navigator proceeds around the circle. In other words, the heading estimation algorithm will be unable to distinguish if it is indeed heading straight or in a circle. In contrast, the current architecture will recognize that the rotational part of the optic flow due to the curvilinear motion is not associated with an internally generated eye motion. This unremoved rotational flow component will bias the heading estimate in the direction of the curve, so that the navigator will be aware that it is constantly drifting away from “true forward motion.”
326
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
3.6 Depth Map. In addition to providing heading information, optic flow can be used to determine the distance to the object at each retinal location (if one is present) in a scene. For visual navigation, it suffices to form a distance measure that is scaled by the speed of the observer—for example, an inverse time-to-collision measure pkTk for each retinal location, where p is the inverse depth, or proximity, of an object, as in equation 2.1, and kTk is the translation speed of the observer. The general problem of determining scene depth from a sequence of images is a difficult one, largely due to the unknown rotational component of the flow field and the unknown heading direction. Solutions have been proposed (e.g., Koenderink & van Doorn, 1987; Longuet-Higgins, 1981; Prazdny, 1980; Waxman & Ullman, 1985; Waxman & Wohn, 1988; see Simpson, 1993, for a review), but these solutions typically require rather complex calculations. In contrast, once the rotational component of the flow field is removed and heading direction is known, as in the current network, calculating the time to collision at each retinal position is relatively straightforward (see also Perrone, 1992). By removing the rotational flow component from equation 2.1, the optic flow generated by translational motion alone reduces to: vTx = pkTk(− f Tx0 + xTz0 )
vTy = pkTk(− f Ty0 + yTz0 ),
(3.10)
where vTx and vTy are the x and y components of the translational flow field, f is the focal length of the imaging system, (Tx0 , Ty0 , Tz0 ) are the components of a unit vector that defines the direction of translation, and (x, y) specifies retinal location. From equation 3.10, one can derive the relationship between the magnitude of the flow vector at a retinal location, vT , and time to collision: q vT = pkTk (− f Tx0 + xTz0 )2 + (− f Ty0 + yTz0 )2 . (3.11) Note that for a specific heading (Tx0 , Ty0 , Tz0 ) and motion sampling position (x, y) on the retina, the square root term in equation 3.11 is a constant, which we can rename kTxy . In the current simulations, which involve only two directional cells at each retinal location of the translation field, the function g that determines the magnitude of the translational flow velocity at the ith retinal location is: q 2 + W2 = v . (3.12) g(Wi ) = W1i T 2i In order to produce a depth measure that is invariant across all headings and retinal locations, kTxy must be removed from the flow speed measure vT . This is accomplished in the system by logarithmically compressing the flow speed represented at the translation field before passing it to the cell representing the corresponding retinal location in the depth map, then subtracting off an average value of this compressed flow speed. Specifically,
Self-Organizing Navigation Using Optic Flow
327
the retinotopically organized depth map cell activities are governed by the following equation: Di = log[g(Wi )] −
X
Hl dli ,
(3.13)
l
where dli is the weight projecting from the lth heading map cell to the ith depth map cell. Tal and Schwartz (1994) have demonstrated that a logarithmic relationship between neuron firing rate and input activity is a property of integrate-and-fire neurons. This suggests that the logarithmic processing in equation 3.13 could also be implemented in a biological system. Using a VAM learning mechanism, the weight dli learns to represent the average of the compressed flow velocity signal at the ith retinal location when heading in the direction coded by the lth heading map cell. The learning equation governing the weights dli is: ddli = α3 Di Hl , dt
(3.14)
where α3 is a learning-rate parameter that was set equal to α2 for the simulations. Training the network according to equations 3.13 and 3.14 during random movements leads to depth map cell activities that each codes the following depth measure for objects at the corresponding retinal location: Di = log vT − |log vT | = log(pkTkkTxy ) − |log(pkTkkTxy )| = (log pkT| + log kTxy ) − |(log pkTk) + log kTxy | = (log pkTk + log kTxy ) = (|log pkTk | + log kTxy ) = log pkTk − |log pkTk |,
(3.15)
where the term |log pkTk | is a constant determined by the environmental experience of the network during learning. For correct operation, the system should be trained in an environment that consists of objects in the same depth range as the environment it will encounter during performance so that an appropriate value of this constant is learned. A large, positive cell activity in the depth map corresponds to a short time to collision; these large cell activities identify the retinal locations of nearby obstacles for use by the navigation module. This depth measure has several advantages. First, it is easily learned and calculated in the network using the VAM mechanism already described. Second, it is invariant with respect to retinal position. In particular, the same value of the depth measure corresponds to the same time to collision regardless of the retinal location, even though objects at the same depth generate different velocity signals at different places on
328
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
Figure 3: Speed measure produced by equation 3.17 as a function of translational flow angle and number of cells in the floret. The speed measure is scaled so that the maximum output for each collection of cells is 1.0. More cells reduce the angular dependence of this flow speed measure.
the retina. Third, logarithmic processing leads to a depth measure that is compressed such that nearby objects garner a larger percentage of the cell’s dynamic range. This allows the depth map to represent proximity of nearby objects more accurately than distant objects, which is a useful property for avoiding collisions. Finally, because the depth map encodes an inverse timeto-collision measure pkTk, a large signal means a short time to collision. Therefore, the most salient signals in the depth map are those that present the most danger. A larger number of directional cells at each retinal location (as would be expected in motion processing pathways in vivo) enables a more biologically plausible function g than the one defined by equation 3.13 to determine the magnitude of translational flow velocity: g(Wi ) =
X [Wki ]+ .
(3.16)
k
When there are a small number of directionally tuned cells at each retinal location, this measure depends on the direction of the flow vector at each translational flow speed. Larger numbers of directionally tuned cells eliminate this dependence (see Figure 3). 3.7 Independent Motion Map. When one is sitting still, it is a common experience to have one’s attention drawn to a moving object. This is not
Self-Organizing Navigation Using Optic Flow
329
surprising since the isolated motion signal uniquely identifies the location of the object. During locomotion, however, large optic flow signals fill the entire retina, yet independently moving objects may still retain their perceptual saliency. For example, the optic flow velocities in peripheral vision can be quite large when one is driving down a street, yet the location of a moving vehicle is still salient because the optic flow signal it creates differs from the pattern of flow generated by surrounding stationary objects. The independent motion map in Figure 2 is formed by a VAM mechanism, similar to the one used for estimating depth, that detects independently moving objects. Whereas the depth VAM compares the magnitude of a flow signal to a learned average, the motion VAM compares the direction of flow with an expected pattern of flow. The net input to a cell in the independent motion map is: Iki = Nki −
X
Hl mlki ,
(3.17)
l
where Nki is the excitation from the corresponding cell in the normalization field and the mlki are adaptive weights tuned to represent the expected flow field when heading in the direction represented by heading map cell output Hl . The cell outputs Mki in the independent motion map are calculated from this input as follows: Mki = Iki
for
X
Iki > 3
k
Mki = 0
for
X
Iki ≤ 3,
(3.18)
k
where à 3 = max
X
! Iki − 0.1.
(3.19)
k
This is a dynamic threshold based on the maximum disturbance in the independent motion map that has the effect of maintaining only the activities of pixel locations with approximately the same level of activity. Activity in the independent motion map is the vector difference between the current normalized flow vector and the expected normalized flow vector. The length of this vector is reflected in the total activity of the florets at a given pixel location of the independent motion map. A flow vector that differs by 180 degrees will have maximal activity, while a flow vector that differs only slightly will have very little activation. If there is any noise in the system, it will unavoidably appear as activity in the map. The threshold, 3, is introduced as a simple way to “clean up” the independent motion
330
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
map. Its chosen value can be flexible depending on the needs of the navigator. We expect that, as designed, the independent motion map would be most useful with a fairly high threshold. In this way, the robot’s attention would be drawn only to pixel locations with significant deviation from the expected flow direction. Of course, this means that the robot might miss slowly moving objects or objects that match its heading. However, without some threshold there will be continuous low-level activity across the entire map that might be distracting. The weights mlki are adjusted as follows: dmlki = α4 Iki Hl , dt
(3.20)
where α4 is a learning-rate parameter that was set equal to α2 for the simulations. This learning law leads to weights mlki that represent the average directional flow pattern seen at retinal location i when moving in the heading direction coded by Hl . After training, if the pattern of flow received from the normalization field corresponds to the expected pattern for heading Hl , then weighted inhibition from Hl will cancel the input from the normalization field, and all activity in the independent motion map will be quenched. However, independently moving objects will likely generate flow directions that differ from the directions expected for the perceived heading and retinal locations. A nonzero output Mki indicates that an independently moving object is located at the retinal location indexed by i since the flow in direction k at this retinal location is significantly larger than the expected flow for the current heading as encoded by the weights mlki . The larger the value of 3, the larger the angular difference between the normalized flow signal and the expected flow signal must be before it produces a positive activation in the independent motion map. It is possible for an object to avoid detection by the independent motion map if that object is moving toward the observer such that the flow caused by its motion is exactly aligned with the translational flow field. Such an object would still be avoided by the current system since the faster flow would be registered as a shorter time to collision in the depth map, allowing the robot to navigate around the object before it gets too close. Also, noise in the flow field can lead to false positives at the independent motion map. However, these false positives will typically last for only a few time steps and can be largely eliminated by averaging over time. Although learning in the weights projecting to the heading, depth, and independent motion maps occurs concurrently, correct values of the independent motion map and depth map weights depend on the heading directions coded by cells in the heading map. These weights thus do not stabilize until shortly after the weights projecting to the heading map have stabilized.
Self-Organizing Navigation Using Optic Flow
331
3.8 Navigation Module. Together, the heading, depth, and independent motion maps provide a representation of the visual world that can be used to navigate around obstacles toward stationary or moving targets. Goal-based navigation in the presence of obstacles is carried out in the simulations reported here using a very simple approach-avoidance algorithm. We do not pose this algorithm as a new or creative solution for navigation; rather, we use it simply to show that the maps built up by the system are robust enough to allow successful navigation in the presence of large amounts of sensor noise even with a relatively “stupid” navigation algorithm. For simplicity, the navigator was limited to fixed-speed motion on the ground plane (kTk = 1), thus requiring the generation of only an azimuthal translation command, Tx . Navigation toward a target involves keeping the output of the heading map HMAP , defined in equation 3.9, equal to the direction of the goal G in retinal coordinates. If the target and heading do not match, then a nonzero difference vector generates motor commands that adjust the navigator’s translation Tx toward the target, Tx (t + 1) = Tx (t) + ε(G − HMAP ),
(3.21)
where ε is a gain factor set to 0.02. Translational commands were limited to −45◦ < Tx < 45◦ . Obstacles are ignored until one becomes a threat by registering a short time to collision as indicated by cell activity in the depth map, defined by equation 3.13, greater than a threshold value β. A value of β = 3.5 was used in the simulations reported here. The center of mass Cx of these suprathreshold cells in the depth map is calculated, and the heading is altered to move away from this center according to the following algorithm: if (HMAP > Cx ) then Tx (t + 1) = Tx (t) + δ if (HMAP < Cx ) then Tx (t + 1) = Tx (t) − δ,
(3.22)
where δ is a small, positive constant set to 0.2 in these simulations. The navigator veers away from the looming obstacle until it is no longer a threat. Once clear, the approach signal regains dominance, and the navigator resumes progress toward the goal. 4 Simulations 4.1 Training the System. An important advantage of this system over previous ones for heading perception in mobile robots is its ability to selforganize through an action-perception cycle rather than requiring teaching signals and supervised learning. The system is trained by randomly generating rotational and/or translational movements, then using the combination of eye velocity information and the flow pattern resulting from the movements to tune the parameters in the translation field and the heading
332
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
map. These learning processes are detailed in the following paragraphs. Throughout training and testing, optic flow information was corrupted by varying amounts of three types of noise: 1. Directional noise. Each flow vector is perturbed by a uniform randomly distributed angle between plus or minus the amount of directional noise. 2. Speed noise. Each flow vector is multiplied by a uniform randomly distributed number between 0 and 2. 3. Aperture noise. Each flow vector is perturbed by a uniform randomly distributed angle between plus or minus the amount of aperture noise; then the magnitude of the flow signal is reduced by the cosine of the angular difference between the original and perturbed vectors. Aperture noise attempts to model the uncertainty of using local flow detectors to measure the motion of a luminance edge. Our use of noise that is uncorrelated across pixel locations is only a rough approximation to the aperture noise that would arise in a real vision system, which could lead to higher error levels than those seen in the simplified simulations reported here. During training and performance, simulated objects in the field of view were placed randomly at distances between 1 and 200 units from the navigator. The units are relative to a navigator focal length of 1 and a navigator speed of 1. 4.2 Translation Field. The weights projecting to the translation field from the eye velocity cells learn to cancel the rotational optic flow generated by eye movements. The easiest way for this to happen is to generate random eye motions without any translational movement (imagine an infant scanning around a room before it can crawl). Therefore, learning in the translation field was carried out before learning began in the higher stages of the model. Under these circumstances, a fast learning rate may be used, and the system is completely trained after only 20 to 30 random eye movements. Figure 4A shows the weights projecting from each eye velocity vector cell to the floret of cells at one retinal location after training with purely rotational movements of the eyes. To illustrate more clearly the pattern of weights projecting from each eye velocity vector cell, 12 directional cells were used in each floret of the translation field in this simulation. If necessary, the weights may also be trained in the presence of translational movements. Random translations have an effect similar to noise on the desired training signal. In an actual implementation, noise may also result from limitations of sensor arrays such as those due to the aperture problem (Wallach, 1976; Perrone, 1992), although this problem may be reduced by motion preprocessing stages that convert aperture ambiguities
Self-Organizing Navigation Using Optic Flow
333
Figure 4: Learning at the translation field. (A) Inhibitory weights wjki in equation 3.4 projecting from eye velocity signals after training with purely rotational eye movements. Each of the four curves on the plot indicates the weight values projecting from one of the four eye velocity vector cells to all of the floret cells at one retinal location. An eye rotation corresponding to the activation of a single eye velocity vector cell results in a flow pattern that takes a cosine shape across each floret of cells (see equation 3.1). The cosine shapes of these curves thus indicate that the weights have successfully learned to cancel the flow field components due to eye rotations as reflected at the eye velocity vector. (B) Noise tolerance while learning eye velocity parameters without noise (left) and with ±45 degrees of random directional noise (uniformly distributed) added to the motion detector field input. Percentage error is measured as the amount of residual activity at the translation field during a rotational movement divided by the amount of activity that would occur without rotational nulling.
334
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
into coherent representations of object speed and direction (Chey, Grossberg, & Mingolla, 1996). A slower learning rate is required for stable learning with noise. Still, the system can learn to cancel the effects of eye rotations on the flow field in the presence of noise with a relatively small number of movements, requiring on the order of 500 randomly generated eye movements to tune the parameters. Such a simulation is summarized in Figure 4B, which shows how the error decreases as training proceeds in the noiseless case and with ±45 degrees of random (uniformly distributed) directional noise added to each cell’s input at the motion detector field. 4.3 Heading Map. Like learning at the translation field, learning at the heading map is carried out during an action-perception cycle. Here, however, the goal is to train the system to use the translational component of the flow field to determine heading direction. This is done by randomly generating translational movements of the eye (as if it were mounted on a moving body), then using the resulting translational flow field to self-organize a map representation of heading direction. As noted in section 3, the heading map is a variant of a SOFM. During learning, the cells in the heading map spread out to code different heading directions. The angular separation of neighboring heading cells will depend on the number of cells in the map and the statistical distribution of heading directions sampled during learning (Kohonen, 1982). This provides the map with the desirable feature of efficiently allocating its resources toward more commonly experienced input patterns. For example, when the system is trained on a body that spends most of its time moving forward (as would be expected for most animals), it develops an accuracy bias as illustrated in Figure 5A. The heading map on the left is the result of training on a set of 2000 movements with headings distributed uniformly between ±45 degrees in both azimuth and elevation. The map on the right was trained on a set of headings biased toward small deviations from straight ahead. The tightly grouped heading cells in the center of the map (corresponding to straight-ahead movement) allows the map to code these directions more accurately than more peripheral directions. In this regard, Crowell and Banks (1993) noted that people are more accurate at judging headings with small eccentricities (forward) than those with large eccentricities (sideways), and similar degradation of heading detection as the FOE moves away from the fovea was noted by Warren and Kurtz (1992). When trained with a distribution of headings as shown in the right side of Figure 5A, the system develops a similar accuracy bias. As noted for learning in the translation field, it is important for the heading map to be tolerant of noise in the direction and speed of local optic flow signals. Figure 5B shows that the heading map is still able to organize topographically when trained with noisy optic flow information. In these “aperture noise” simulations, the effects of the aperture problem were
Self-Organizing Navigation Using Optic Flow
335
Figure 5: Unfolding of the heading map. (A) Example heading maps with uniform heading sampling (left) and sampling biased toward straight ahead (right). When the distribution of training samples is biased toward straight ahead, the distribution of map cells is more concentrated for movements near straight ahead, resulting in more accurate heading estimates for these movement directions. (B) Heading map after 2000 training steps for three levels of simulated aperture error in the motion detection field.
simulated by randomly perturbing each perceived flow vector by an angle uniformly distributed between ±0, 40, and 90 degrees and then reducing the magnitude of each flow vector by a factor equal to the cosine of the perturbed angle minus the actual angle. Two thousand randomly generated movements with headings between ±45 degrees for azimuth and elevation were used to train the network. In all noise simulations, the same level of noise was used for both training and testing of the network. Figure 6A shows heading estimate accuracy under various kinds and amounts of noise in the optic flow signal. To allow comparisons with the
336
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
Figure 6: Heading estimation under various noise conditions. (A) Error in heading judgment plotted as a function of the amount of directional noise, speed noise, and aperture noise in the flow field for a network trained with headings between ±25 degrees. (B) Example of heading error for 0 degree noise and 40 degree aperture noise for a network trained with headings between ±45 degrees (top). The circles mark the actual heading direction, and the plus signs mark the estimated heading direction.
simulations of Hatsopoulos and Warren (1991), training was carried out on headings between ±25 degrees in azimuth and elevation, and error was averaged over test headings between ±20 degrees. The directional noise simulations indicate that the network performs with about 0.75 degree average error with no directional noise and 3.9 degrees average error with ±90 degrees directional noise. This is comparable to both human performance and
Self-Organizing Navigation Using Optic Flow
337
the performance reported in Hatsopoulos and Warren (1991), which showed average errors of approximately 0.8 degree for no noise and 3.4 degrees for ±90 degrees noise. The network’s performance under aperture noise (see bottom of Figure 6A) is similar to the directional noise results. Finally, a simulation using “speed noise” (an increase or decrease of the magnitude of the flow vector), reported in Figure 6A, shows that speed noise has little effect on the network’s performance, again in keeping with psychophysical results showing that speed noise has little effect on the ability of humans to determine heading. Some comments should be made regarding these comparisons. First, the current network achieves comparable results despite using a learning scheme that trains from randomly generated headings as compared to the learning scheme of Hatsopoulos and Warren that requires network-using optic flows that correspond to known headings. Second, the current network works in the presence of eye rotations due to the learned removal of rotational flow components at the translation field; the Hatsopoulos and Warren does not work in the presence of eye rotations. Although the results reported here use a slightly larger retina (7 × 7 versus 5 × 5) and more cells for heading detection (49 versus 25), the number of input patterns required to train the network adequately is less in the current system than in the Hatsopoulos and Warren (1991) network (2000 versus 4000). Figure 6B shows a graphical representation of heading estimation performance under two different amounts of aperture noise, this time for a network trained with headings between ±45 degrees. The open circles denote an arbitrarily selected grid of sample headings. The plus marks plot the responses of the heading map. A possible criticism of these simulations is that the aperture problem is not simply uniform random noise but instead is systematic in nature for objects that span many retinal locations and thus may cause errors that do not simply average out as they appear to in the previous simulations. The results of a simulation designed to test further the system’s ability to deal with the aperture problem are illustrated in Figure 7. Several changes distinguish this simulation from those presented in Figure 6. First, in previous simulations, the scenes consisted of point objects placed at random depths for each pixel location. A flow signal was calculated for each pixel location, and then various types of random noise were added to the flow field. Here, larger rectangular objects replace the point objects. A more realistic aperture effect is simulated by assuming that the flow detectors can detect activity only at luminance boundaries and, due to the aperture effect, can detect only the component of flow that is perpendicular to the luminance edge. At corners, the ambiguity of the aperture effect is removed because both the horizontal and vertical flow detectors can respond. Second, because most of the field now contains no information, the network was modified so that no learning takes place at pixels without activity in the translation field. Third, the retina was increased to 20 × 20 pixels so that objects with long,
338
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
Figure 7: Simulation providing a more realistic treatment of the aperture problem. (Top) Example scene where aperture noise is correlated to the viewed objects. The objects are squares of various dimensions placed at random depths averaging 100 units away. It is assumed that the optic flow detectors can detect only the component of the flow field that is perpendicular to the luminance edge. The “x” marks the actual heading and the open circle marks the heading predicted by the heading map. (Center) Scene with correlated aperture noise plus ±90 degrees directional noise. The “x” marks the actual heading direction and the open circle marks the heading estimated by the heading map. (Bottom) Plot of average heading map error versus amount of directional noise added in addition to the systematic aperture noise error induced by shapes in the scene.
Self-Organizing Navigation Using Optic Flow
339
straight edges could be observed. Fourth, the learning rate and neighborhood shrinking rate were reduced by one-half. Finally, the number of trials was increased from 2000 to 8000. This last change reflects the fact that for any given trial, three-fourths of the pixels are not receiving information and therefore cannot learn. With these new scenes, it simply takes longer for each pixel to experience enough training data. The top panel of Figure 7 is an example of the more realistic scenes and aperture effect used to train and test the network. The center panel shows a typical training scene with the aperture effect and an additional ±90 degree directional noise. The bottom panel shows a plot of the average error in heading estimation as an increasing amount of directional noise is added in addition to the systematic aperture effect error induced by shapes in the scene. For zero added directional noise, the average heading error is about ±1.8 degrees, indicating that the network deals very well with the more realistic aperture effect. The average heading estimation error with an additional ±90 degrees of aperture noise increases to only ±2.7 degrees. 4.4 Depth Map. Figure 8 shows the effect of directional flow noise on relative depth estimation. The optic flow field used in these simulations had 49 cells arranged in a 7 × 7 grid. A random depth was selected for objects at each of the 49 retinal locations. Then movements were made toward those objects along 36 randomly chosen headings between ±37 degrees in both azimuth and elevation. Because the weights learned for the depth map depend on the nature of the flow field experienced during training, it is important that the flow field used to train the map is similar to the flow fields that will be encountered after learning ceases. Specifically, the training flow field should contain flow information corresponding to the typical range of object depths that will be encountered during performance. The first three panels of Figure 8 (the top two and the bottom left) compare the depth measure calculated by the system to the actual depth of the object at each retinal location. (For purposes of comparison, actual depth was processed according to equation 3.15 and scaled to compensate for translational speed.) For each of these simulations, noise was present during both training and performance. The effects of noise during training are quite small relative to the effects of noise during performance; that is, most of the error in the plots is attributable to noise during performance rather than incorrect values of system parameters learned during training. The final plot in Figure 8 (bottom right) shows error versus the amount of directional noise. The system was trained in the absence of noise for this plot. Error was calculated as the difference between the actual object depth and the depth predicted by the network divided by the actual object depth. This was multiplied by 100 to obtain the percentage error, and the result was averaged over all retinal locations and 36 different headings. As indicated in Figure 8, the depth map is unusually robust to noise because the optic flow signals were perturbed in direction but not magnitude. Recall that for
340
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
Figure 8: Effect of different levels of directional noise on relative depth estimates. Each of the 49 cells on the 7 × 7 retina estimates the relative depth of the object in its receptive field. The first three plots (the top two and the lower left) show network depth estimates (solid lines) compared to actual relative depths (dotted lines) for three different levels of directional noise presented during both training and performance. The final plot shows the average percentage error of depth estimates as a function of directional noise for a network trained in the absence of noise.
a given heading, object depth depends on only the magnitude of the optic flow signal. Therefore, as long as the heading estimate is reasonably correct (see Figures 6 and 7), the depth estimates will also be correct regardless of directional noise. If any speed noise is present in the flow field, depth estimation will unavoidably degrade. This is evident from Figure 9, which shows the effect of aperture noise on relative depth estimation. Since aperture noise contains a speed component in addition to the directional component, depth map accuracy degrades significantly more than with directional noise. Nonetheless, the depth map still performs well with 40 degree aperture noise, and later simulations will show that the 100% average error in depth estimates under 90 degree aperture noise is still adequate for navigation because multiple
Self-Organizing Navigation Using Optic Flow
341
Figure 9: Effect of different levels of aperture noise on relative depth estimates. Each of the 49 cells on the 7 × 7 retina estimates the relative depth of the object in its receptive field. The first three plots (the top two and the lower left) show network depth estimates (solid lines) compared to actual relative depths (dotted lines) for three different levels of aperture noise presented during both training and performance. The final plot shows the average percentage error of depth estimates as a function of aperture noise for a network trained in the absence of noise.
snapshots of the scene, taken as the navigator moves, tend to average out the unwanted effects of noise over time. Performance could be improved further if predictions about the continuity of surfaces in the field of view were used to perform neighborhood smoothing operations on the output of the depth map, but this topic is beyond the scope of this article. It should also be noted that the worst depth estimation errors occur for objects near the FOE, where optic flow signals tend to be very small. This is another inherent problem with using optic flow for depth estimation. One technique that can help alleviate this problem for real-world scenes is to ignore retinal locations with very small depth weights and fill in depth estimates by averaging estimates from neighboring cells.
342
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
4.5 Independent Motion Map. Figures 10 and 11 illustrate the performance of the independent motion map. The scene consists of a tiled ground plane, a distant wall, and a box floating above the ground plane between the observer and the wall. Each panel shows the visual field as seen by the network’s retinal array. Objects in the visual field are indicated by dotted lines. Arrows in the left panels indicate flow components in the translation field, and arrows in the right panels indicate flow components at the independent motion map after thresholding according to equations 3.18 and 3.19. Figure 10 indicates performance at the independent motion map in the absence of noise. In Figure 10A, the navigator moves forward, while the box remains stationary. Since optic flow in the scene is commensurate with the flow expected by the activated heading map cells, all activity in the independent motion map is quenched. In Figure 10B, the navigator again moves forward, but this time the box is moving independently to the left. The flow generated by the box does not fit with the flow pattern established by the rest of the scene, so it is singled out in the independent motion map. Figure 10C shows an example of detecting the same box motion while the navigator moves at 37 degrees to the right instead of straight ahead. One might note that the direction coded by the active cells in the independent motion map corresponds only roughly to the actual motion direction of the object with respect to the navigator. This highlights the fact that this map is primarily useful for identifying the retinal locations of moving objects rather than their direction of movement relative to the navigator. The directional error arises because the motion network detects mismatches between expected direction and the incoming flow direction. The difference of these two directions will point approximately in the actual direction of the moving object. However, the exact calculation of direction requires knowledge of both the depth of the moving object and the navigator’s speed so that the component of optic flow due to object motion may be separated from the component due to self-motion. Figure 11 illustrates independent motion map performance under various noise conditions. Figures 11A and 11B illustrate the performance of the independent motion detector in the presence of ±40 degree and ±90 degree aperture noise, respectively. Noise was present during both training and performance. The network successfully detects the moving box in both conditions, but the network erroneously detects motion at several retinal locations in the ±90 degree condition. The independent motion detector is the most sensitive part of the network to directional noise since it relies on the accuracy of optic flow directions at each retinal location and cannot average over the entire retina. Directional noise can be countered to some degree by choosing a higher value of the threshold 3 in equation 3.18 but raising the threshold also increases the chances of missing a moving object. Figure 11C shows the performance of the independent motion map in the presence of 100% speed noise. Because both the heading and independent
Self-Organizing Navigation Using Optic Flow
343
Figure 10: Independent motion detection with no noise. Each panel shows the visual field as seen by the network’s retinal array. Arrows in the left panels indicate flow components in the translation field, and arrows in the right panels indicate flow components at the independent motion map after thresholding. (A) The navigator is moving straight ahead over a tiled ground plane. The suspended box is stationary with respect to the ground plane, so no activity remains in the independent motion map. (B) The navigator is moving straight ahead over a tiled ground plane. The suspended box is moving to the left. The flow field at the retinal location that corresponds to the box does not match the expected field, so activity at that location pops out in the independent motion field. (C) Here the navigator is moving at 37 degrees to the right while the box still moves to the left, indicating that independent motion may be detected for a wide range of navigator headings.
344
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
Figure 11: Independent motion detection under various noise conditions. (A) Example of independent motion map performance when the incoming flow field is randomly perturbed by ±40 degree aperture noise. (B) At ±90 degree aperture noise, some errant vectors survive the threshold along with the moving box. (C) Performance of the independent motion map in the presence of 100% speed noise. Independent motion detection is essentially unaffected by perturbations in the magnitude of the flow signals.
motion maps are essentially unaffected by perturbations in the magnitude of optic flow signals, the moving box is easily detected in the speed noise condition. Faster-moving objects perturb the optic flow signals more than slower-moving objects and are therefore easier to detect.
Self-Organizing Navigation Using Optic Flow
345
Figure 12: (A) Two example paths taken by the approach-avoidance algorithm, viewed from above. The circles represent times when the approach signal was dominant. The “x” symbols represent times when the avoidance signal was dominant. The navigator starts at the bottom and attempts to reach the plus sign while avoiding the blocks. (B) Two example paths taken by the approachavoidance algorithm in the presence of 90 degree aperture noise. (C) Two example paths taken by the approach-avoidance algorithm in the presence of 100% speed noise.
4.6 Navigation. The simulations shown in Figure 12 demonstrate the utility of the self-organizing heading and depth maps for visual navigation. Figure 12A shows an overhead view of a field of rectangular obstacles and the path that the navigator takes to reach the goal indicated by the plus
346
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
Figure 13: Pursuit behavior viewed from above. The first navigator, designated by the open circles, is pursuing the navigator designated by the plus symbols. The “plus” navigator is slightly slower but gets a head start in its attempt to reach the goal. The other navigator is faster and attempts to capture the first navigator before it reaches its goal. The first navigator (open circles) detects the location of the other navigator using its independent motion map. In both simulations, the open circle navigator is able to overtake the slower plus navigator before it reaches its goal.
sign. The task for the navigator is to start at the bottom center and reach the plus sign at the top of the scene without hitting any obstacles. No high-level path planning was used, only the simple approach-avoidance algorithm described in section 3. The open circles mark places along the path where the navigator was in approach mode, and the “x” symbols mark the places where the navigator was in avoidance mode. Because the components of the network are robust to noisy optic flow fields, the approach-avoidance calculation based on the network’s output is also robust to noise. Figure 12B shows two example paths taken by the navigator in the presence of 90 degree aperture noise. The main difference between performance in 0 degree noise and 90 degree noise is that at higher noise levels, the navigator occasionally misjudges steps and clips the corners of obstacles. Also, random depth map errors occasionally cause the navigator to veer unnecessarily. Figure 12C shows two example paths taken by the navigator in the presence of 100% speed noise. This 100% speed noise is guaranteed to degrade the performance of the depth map by at least that amount, and one can see by the x’s that the navigator is sometimes avoiding ghosts. However, on average the ghosts tend to cancel each other, and the network extracts useful information about the true location of the obstacles. Finally, Figure 13 shows examples of motion pursuit simulations that require all elements of the visual navigation network. Aperture noise of ±30 degrees was used in these simulations. One navigator, designated by
Self-Organizing Navigation Using Optic Flow
347
the plus symbol, attempts to reach a goal indicated by a plus sign at the top of the picture. A second navigator, designated by the open circle symbol, is slightly faster and uses the first navigator’s changing location as its goal. The second navigator uses its independent motion map to determine the location of the first navigator. As shown in the two simulations of Figure 13, the second navigator is typically successful in its attempt to overtake the slower navigator.
5 Concluding Remarks The network described in this article was developed primarily as a module for autonomous robot navigation. Its features include self-organization, fast learning, noise tolerance, operation in the presence of eye movements, and reasonable memory and computational demands. This system represents one of the first self-tuning systems to attempt seriously to solve the problem of navigation using optic flow, although the structure and learning principles were inspired by several previous computational and neural models (Hatsopoulos & Warren, 1991; Lappe & Rauschecker, 1993, 1995; Perrone, 1992). Based on the good performance seen in the simplified simulations, it is expected that the system will also perform adequately on real images. Work is in progress to transfer these algorithms to mobile robots. Efforts spent developing a self-tuning system reduce the burden of software rewrites that would otherwise be required to accommodate robot hardware changes. The current system could be expanded by introducing feedback between the modules. For example, independently moving objects could be detected more accurately by creating a feedback loop that removes pixels containing suspected independently moving objects from the normalization field. Therefore, the next iteration of the heading estimate will be less corrupted than the previous, which will improve the performance of both the depth and motion maps. Although this was not the primary goal of the current model, many of its cell types also show similarities to MT and MST cell properties. Cells in MT are sensitive to the orientation and velocity of visual stimuli (Rodman & Albright, 1987), as are cells in the motion detection, translation, and normalization fields in the model. Many cells in MST respond maximally to radially expanding patterns and patterns that include expansion and full field translation, suggesting a role in heading detection and/or depth estimation (Lagae, Maes, Raiguel, Xiao, & Orban, 1994); cells in the heading map are similarly tuned. MST receives strong fiber projections from MT (e.g., Maunsell & Van Essen, 1983; Ungerlieder & Desimone, 1986), as do heading map cells in the normalization field. Cells in MT and MST have complex properties, however, that are by no means completely explained by the current model or other models of heading direction (Lagae et al., 1994; Graziano, Andersen, & Snowden, 1994).
348
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
To the extent that the present model does capture brain heuristics, the use of eye velocity information to nullify the rotational component of the flow field suggests that MST cells may change their flow field sensitivities in the presence of eye movements. Bradley, Maxwell, Andersen, Banks, and Shenoy (1996) have presented evidence consistent with this hypothesis by showing that during an active pursuit eye movement, expansion cells shift their preferred focus of expansion in a direction that would compensate for the eye rotation. This shift does not occur during a simulated rotation, suggesting that it is due to extraretinal information. Finally, it must be noted that optic flow alone can be only an imperfect indication of object motions and distances. For example, no optic flow is available for objects centered at the FOE, and thus the distance of these objects cannot be determined using optic flow until they become close enough to subtend a significant portion of the retina away from the FOE. An independently moving object that is heading directly down the line of sight through the FOE will generate flow vectors that are directionally consistent with the flow vectors caused by the translational motion of the navigator, and such an object will thus not be identified as moving by the independent motion map. However, such an object would correctly register a shorter time to collision in the depth map, thus allowing the navigator to move around it even though it was not identified as independently moving. An independently moving object that covers a large portion of the retina will generate false heading estimates. This is evidenced in humans by the mistaken perception that their stationary car is rolling because the truck next to them is moving slowly in the opposite direction. Despite these shortcomings, optic flow is a rich and usually reliable source of information that should not be dismissed simply because it is difficult to measure and provides misleading or no information in certain circumstances. The purpose of our architecture is to extract as much information from optic flow as possible using a self-tuning network with relatively uncomplicated computational elements. Our approach distinguishes itself from previous work by providing a unified treatment of heading, depth, and independent object motion within a single architecture. In addition, our approach deals with the effects of self-generated eye motions thoroughly and effectively and has the unique advantage of self-organization. We have shown, using simplified simulations, that our network is able to provide sufficiently robust information to approach targets while avoiding obstacles, even if those targets are moving. This does not mean that our system, or any other system that attempts to process optic flow alone for that matter, is sufficient to allow error-free navigation in all real-world situations. Robust navigation requires the integration of many information sources, and our network has been designed to offer one piece of the puzzle by contributing useful heading, range, and independent motion estimates that a robot may use to aid in tasks such as moving object pursuit and obstacle avoidance.
Self-Organizing Navigation Using Optic Flow
349
Acknowledgments S. C. was supported in part by the Defense Advanced Research Projects Agency (ONR N00014-92-J-4015) and the Office of Naval Research (ONR N00014-91-J-4100, ONR N00014-92-1309, and ONR N00014-95-1-0657). S. G. was supported in part by the Office of Naval Research (ONR N00014-951-0409 and ONR N00014-95-1-0657). F. H. G. was supported in part by the Alfred P. Sloan Foundation and the Office of Naval Research (ONR N0001495-1-0409). The authors thank Diana Myers for her assistance in the preparation of the manuscript.
References Banks, M. S., Ehrlich, S. M., Backus, B. T., & Crowell, J. A. (1996). Estimating heading during real and simulated eye movements. Vision Research, 36, 431– 443. Bradley, D. C., Maxwell, M., Andersen, R. A., Banks, M. S., & Shenoy, K. V. (1996). Mechanisms of heading perception in primate visual cortex. Science, 273, 1544–1547. Brenner, E. (1991). Judging object motion during smooth pursuit eye movements: The role of optic flow. Vision Research, 31, 1893–1902. Bridle, J. S. (1989). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In F. Fougelman-Soulie & J. Herrault (Eds.), Neuro-computing: Algorithms, architectures, and applications. Berlin: Springer-Verlag. Bullock, D., & Grossberg, S. (1988). Neural dynamics of planned arm movements: Emergent invariants and speed-accuracy properties during trajectory formation. Psychological Review, 95, 49–90. Chey, J., Grossberg, S., & Mingolla, E. (1996). Neural dynamics of motion grouping: From aperture ambiguity to object speed and direction. Journal of the Optical Society of America A (14), 2570–2594. Cornilleau-Peres, V., & Droulez, J. (1993). Stereo-motion cooperation and the use of motion disparity in the visual perception of 3-D structure. Perception and Psychophysics, 54, 223–239. Crowell, J., & Banks, M. (1993). Perceiving heading with different retinal regions and types of optic flow. Perception and Psychophysics, 53(3), 325–337. Cutting, J. E. (1986). Perception with an eye towards motion. Cambridge, MA: MIT Press. Dron, L. (1993). Dynamic camera self-calibration from controlled motion sequences. In Proceedings of the 1993 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 501–506). Los Alamitos, CA: IEEE Computer Society Press. Gaudiano, P., & Grossberg, S. (1991). Vector associative maps: Unsupervised real-time error-based learning and control of movement trajectories. Neural Networks, 4, 147–183.
350
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
Gaudiano, P., Zalama, E., & Lopez-Coronado, J. (1996). An unsupervised neural network for low-level control of a wheeled mobile robot. IEEE Transactions on Systems, Man, and Cybernetics, 26, 485–496. Gibson, J. J. (1950). The perception of the visual world. Boston: Houghton Mifflin. Graziano, M. S. A., Andersen, R. A., & Snowden, R. (1994). Tuning of MST neurons in macaque MST. Society for Neuroscience Abstracts 16, 7. Grossberg, S. (1976). Adaptive pattern classification and universal recoding. I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121–134. Grossberg, S., & Kuperstein, M. (1989). Neural dynamics of adaptive sensory-motor control (Exp. ed.). Elmsford, NY: Pergamon Press. Hatsopoulos, N. G., & Warren, W. H., Jr. (1991). Visual navigation with a neural network. Neural Networks, 4, 303–317. Heeger, D. J., & Jepson, A. (1990). Visual perception of three-dimensional motion. Neural Computation, 2, 129–137. Heikkonen, J., & Oja, E. (1993). Self-organizing maps for visually guided collision-free navigation. In Proceedings of 1993 International Joint Conference on Neural Networks (Vol. 1, pp. 669–672). New York: IEEE. Hildreth, E. C. (1992). Recovering heading for visually-guided navigation. Vision Research, 32, 1177–1192. Hildreth, E. C., Grzywacz, N. M., Adelson, E. H., & Inada, V. K. (1990). The perceptual buildup of three-dimensional structure from motion. Perception and Psychophysics, 48, 19–36. Holt, R. J., & Netravali, A. N. (1991). Camera calibration problem: Some new results. CVGIP: Image Understanding, 54, 368–383. Koenderink, J. J., & van Doorn, A. J. (1975). Invariant properties of the motion parallax field due to the movement of rigid bodies relative to an observer. Optica Acta, 22, 773–791. Koenderink, J. J., & van Doorn, A. J. (1987). Facts on optic flow. Biological Cybernetics, 56, 247–254. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Lagae, L., Maes, H., Raiguel, S., Xiao, D.-K., & Orban, G. A. (1994). Responses of macaque STS neurons to optic flow components: A comparison of areas MT and MST. Journal of Neurophysiology, 71, 1597–1626. Lappe, M., & Rauschecker, J. P. (1993). A neural network for the processing of optic flow from ego-motion in man and higher mammals. Neural Computation, 5, 374–391. Lappe, M., & Rauschecker, J. P. (1995). Motion anisotropies and heading detection. Biological Cybernetics, 72, 261–277. Longuet-Higgins, H. C. (1981). A computer algorithm for reconstructing a scene from two projections. Nature, 293, 133–135. Longuet-Higgins, H. C., & Prazdny, K. (1981). The interpretation of moving retinal images. Proceedings of the Royal Society of London B, 208, 385–397. Luong, Q.-T., & Faugeras, O. D. (1992). Self-calibration of a camera using multiple images. In Proceedings of the 11th IAPR International Conference on Pattern Recognition (Vol. 1, pp. 9–12). Los Alamitos, CA: IEEE Computer Society Press.
Self-Organizing Navigation Using Optic Flow
351
Maunsell, J. H. R., & Van Essen, D. C. (1983). The connections of the middle temporal visual area (MT) and their relationship to a cortical hierarchy in the macaque monkey. Journal of Neuroscience, 3, 2563–2586. Monen, J., & Brenner, E. (1994). Detecting changes in one’s own velocity from the optic flow. Perception, 23, 681–690. Orban, G. A., Lagae, L., Raiguel, S., Xiao, D.-K., & Maes, H. (1995). The speedtuning of medial superior temporal (MST) cell responses to optic-flow components. Perception, 24, 269–285. Perrone, J. A. (1992). Model for the computation of self-motion in biological systems. Journal of the Optical Society of America A, 9, 1–18. Prazdny, K. (1980). Egomotion and relative depth map from optical flow. Biological Cybernetics, 36, 87–102. Regan, D., & Beverly, K. I. (1982). How do we avoid confounding the direction we are looking and the direction we are moving? Science, 215, 194–196. Rieger, J. H., & Lawton, D. T. (1985). Processing differential image motion. Journal of the Optical Society of America A, 2, 354–359. Rieger, J. H., & Toet, L. (1985). Human visual navigation in the presence of 3D rotations. Biological Cybernetics, 52, 377–381. Rodman, H. R., & Albright, T. D. (1987). Coding of visual stimulus velocity in area MT of the Macaque. Vision Research, 27, 2035–2048. Royden, C. S. (1994). Analysis of misperceived observer motion during simulated eye rotations. Vision Research, 34, 3215–3222. Royden, C. S., Banks, M. S., & Crowell, J. A. (1992). The perception of heading during eye movements. Nature, 360, 583–585. Royden, C. S., Crowell, J. A., & Banks, M. S. (1994). Estimating heading during eye movements. Vision Research, 34, 3197–3214. Simpson, W. A. (1988). Depth discrimination from optic flow. Perception, 17, 497–512. Simpson, W. A. (1993). Optic flow and depth perception. Spatial Vision, 7, 35–75. Tal, D., & Schwartz, E. (1994). Weber-Fechner transduction: A logarithmic compressive nonlinearity is a generic property of integrate and fire neurons. In Proceedings of the World Congress on Neural Networks (Vol. 4, pp. 360–367). Hillsdale, NJ: Erlbaum. Tanaka, K., & Saito, H. (1989). Analysis of motion of the visual field by direction, expansion/contraction, and rotation cells clustered in the dorsal part of the medial superior temporal area of the macaque monkey. Journal of Neurophysiology, 62, 626–641. Treue, S., Andersen, R. A., Ando, H., & Hildreth, E. C. (1995). Structure-frommotion: Perceptual evidence for surface interpolation. Vision Research, 35, 139–148. Ungerleider, L. G., & Desimone, R. (1986). Cortical connections of visual area MT in the Macaque. Journal of Comparative Neurology, 248, 190–222. van den Berg, A. V. (1992). Robustness of perception of heading from optic flow. Vision Research, 32, 1285–1296. van den Berg, A. V. (1993). Perception of heading. Nature, 365, 497–498. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striata cortex. Kybernetik, 14, 85–100.
352
Seth Cameron, Stephen Grossberg, and Frank H. Guenther
Wallach, H. (1976). On perception. New York: Quadrangle. Wang, H. T., Mathur, B. P., & Koch, C. (1989). Computing optical flow in the primate visual system. Neural Computation, 1, 92–103. Warren, W. H., Jr. (1995). Self-motion: Visual perception and visual control. In W. Epstein & S. Rogers (Eds.), Handbook of perception and cognition: Perception of space and motion (pp. 263–325). San Diego: Academic Press. Warren, W. H., Jr. (1996). The state of flow. In T. Watanabe (Ed.), High level motion processing. Cambridge, MA: MIT Press. Warren, W. H., Jr., & Hannon, D. J. (1988). Direction of self-motion is perceived from optical flow. Nature (London), 336, 162–163. Warren, W. H., Jr., & Hannon, D. J. (1990). Eye movements and optical flow. Journal of the Optical Society of America A, 7, 160–169. Warren, W. H., Jr., & Kurtz, K. J. (1992). The role of central and peripheral vision in perceiving the direction of self-motion. Perception and Psychophysics, 51, 443–454. Warren, W. H., Jr., Li, L. Y., Ehrlich, S. M., Crowell, J. A., & Banks, M. S. (1996). Perception of heading during eye movements uses both optic flow and eye position information. Investigative Ophthalmology and Vision Science, 37. Waxman, A. M., & Ullman, S. (1985). Surface structure and three-dimensional motion from image flow kinematics. International Journal of Robotics Research, 4, 72–94. Waxman, A. M., & Wohn, K. (1988). Image flow theory: A framework for 3-D inference from time-varying imagery. In C. Brown (Ed.), Advances in computer vision (Vol. 1, pp. 165–224). Hillsdale, NJ: Erlbaum. Zalama, E., Gaudiano, P., & Lopez-Coronado, J. (1995). A real-time, unsupervised neural network for the low-level control of a mobile robot in a nonstationary environment. Neural Networks, 8, 103–123. Received January 8, 1996; accepted July 15, 1997.
Communicated by Ning Qian
Analysis of Direction Selectivity Arising from Recurrent Cortical Interactions Paul Mineiro David Zipser Cognitive Science Department, University of California at San Diego, La Jolla, CA 92093, U.S.A.
The relative contributions of feedforward and recurrent connectivity to the direction-selective responses of cells in layer IVB of primary visual cortex are currently the subject of debate in the neuroscience community. Recently, biophysically detailed simulations have shown that realistic direction-selective responses can be achieved via recurrent cortical interactions between cells with nondirection-selective feedforward input (Suarez et al., 1995; Maex & Orban, 1996). Unfortunately these models, while desirable for detailed comparison with biology, are complex and thus difficult to analyze mathematically. In this article, a relatively simple cortical dynamical model is used to analyze the emergence of directionselective responses via recurrent interactions. A comparison between a model based on our analysis and physiological data is presented. The approach also allows analysis of the recurrently propagated signal, revealing the predictive nature of the implementation. 1 Introduction Layer IVB of primate visual cortex has cells whose time-averaged response is maximum for stimuli moving in a particular direction (Kandel, 1991). This is the first point in the primate visual processing pathway where direction selectivity is found. In their seminal paper, Adelson and Bergen (1985) demonstrated that motion detection could in principle be performed by a nonlinear system in which direction selectivity was critically dependent on linear spatiotemporal filtering using oriented spatiotemporal filters. In addition to their mathematical analysis, they proposed a particular implementation in which the oriented spatiotemporal filters necessary for direction selectivity were constructed from nondirection-selective filters in a feedforward manner similar to other contemporary models of direction selectivity (Adelson & Bergen, 1985; Watson & Ahumada, 1985; van Santen & Sperling, 1985). More recently, biophysically detailed simulations have demonstrated that physiologically realistic direction selectivity could arise from recurrent interactions between cortical cells with nondirection-selective feedforward Neural Computation 10, 353–371 (1998)
c 1998 Massachusetts Institute of Technology °
354
Paul Mineiro and David Zipser
input (Suarez, Koch, & Douglas, 1995; Maex & Orban, 1996). These models have focused on achieving biological plausibility at the expense of model complexity and consequently are difficult to analyze mathematically. In this article, a relatively simple cortical dynamical model, based on a neural network model optimized to detect motion (Haynes, 1996), is used to investigate the emergence of direction-selective responses via recurrent interactions. Our analysis shows that recurrent interactions can give rise to direction-sensitive responses that form the basis of a direction-selective system. This connects the findings found by simulation of recurrent models to the original analytical work of Adelson and Bergen (1985) and, furthermore, provides insight into how the spatial structure of recurrent connections influences direction selectivity. In addition we show that the simple dynamical model, when appropriately parameterized, is capable of generating realistic direction selectivity. Underlying these direction-selective responses is a predictive recurrently propagating signal, whose implications we discuss further. Before continuing, the difference between direction-sensitive and direction-selective responses should be clarified. Consider a stimulus of fixed spatial profile that undergoes motion in different directions. A directionsensitive response has a temporal response profile that depends on the direction of stimulus movement but has the same temporally averaged response to all directions of stimulus movement. In other words, the time-averaged (over the presentation interval) response of a directionally sensitive unit does not depend on direction. The time-averaged response of a directionselective unit is dependent on the direction of stimulus movement. In their seminal article on motion processing in the fly visual system, Poggio and Reichardt (1976) proved that no linear system can be direction selective, although linear systems can exhibit directionally sensitive responses.
2 Cortical Dynamical Model Here we introduce a relatively simple cortical dynamical model that will allow an analytical exploration of the emergence of direction selectivity via cortical interactions. The cortex is idealized as a one-dimensional array of cells. Because the model is continuous in space, we are concerned with activity as a function of location, and not with individual neurons. However, this activity is computed in the same way as in simple neuron models. The internal state of a cell of type i at position x in the cortex at time t is represented by a single value ai (x, t). The cortical activity observed at x and t is given by g(ai (x, t)), where g is the activation function (e.g., a logistic function). Model cells are leaky integrators of activation. Their time constant τ is not meant to reflect cell parameters such as membrane capacitance but, rather, the effective integration time of the neuron. This effective time constant can be adjusted to an arbitrary value using local excitatory connectivity
Direction Selectivity from Recurrent Interactions
355
(Maex & Orban, 1992). In the model, these local excitatory connections are implicit in the choice of τ . Model cells receive two independent sources of stimulation: a feedforward component ultimately caused by retinal stimulation and a recurrent component from other cortical cells. The feedforward influence on a cell of type i is given by a deterministic functional fi , which maps retinal activation patterns r(x, t) to cortical afferent inputs bi (x, t). For analytical purposes, the exact functional form of fi is unimportant, but two assumptions are made. First, it is assumed that fi is not directionally selective. This implies that any direction selectivity of the system will be due to cortical interactions. Second, it is assumed that a moving stimulus will generate cortical input that does not change shape as the stimulus moves, a valid assumption over a small patch of the retina, where fi can be considered independent of absolute retinal position. This assumption allows us to ignore the details of the mapping from the retina to the cortex, and instead concentrate on the response of the cortex to moving afferent input patterns with stable spatial profiles. In this model, we assume recurrent interactions are so fast as to be effectively instantaneous and that the strength of the interactions between cells depends only on the cell types and their relative positions. Given these assumptions, the lateral interactions are fully described by the recurrent connectivity function Wij (x), which specifies how the activation of a cell of type i is influenced by the output of a cell of type j at relative position x. These considerations lead to the following spatiotemporal dynamical equations, τ
∂ai (x, t) = −ai (x, t) + bi (x, t) ∂t XZ ∞ + dx0 Wij (x − x0 )g(aj (x0 , t)) j
−∞
(2.1)
where bi (x, t) is the feedforward input to cortical cells of type i, related to the retinal activation by the functional fi . These equations relate cortical activation, as a function of space and time, to the spatiotemporal afferent input, and thus implicitly to spatiotemporal retinal activation. They are similar to equations used to model ocular dominance formation and cortical instabilities (Murray, 1989). It is sometimes convenient to work with these equations in the frequency domain. Applying spatial Fourier and temporal Laplace operators yields1 (τ s + 1)ˆai (k, s) = bˆi (k, s) +
X
ˆ ij (k)ˆoj (k, s), W
j
1
The Laplace transform is used in the temporal domain to preserve causality.
(2.2)
356
Paul Mineiro and David Zipser
where oˆj (k, s) is the Fourier and Laplace transform of g(aj (x, t)), which when g is nonlinear is nontrivially related to the transform of aj (x, t). In this article, k denotes spatial frequency, and s denotes (complex-valued) temporal frequency. In the following section, we analyze the linear version of equation 2.2, for the case of a single cell type. This linear system can exhibit directional sensitivity but not direction selectivity. However, real cortical cells exhibit nonlinearities; for example, they cannot achieve negative firing rates. The introduction of more plausible rectified linear activation functions leads to direction-selective responses. 2.1 Linear Analysis. In this section we examine equation 2.2 for the case of a single cell type that has a simple linear output function g(a) = a. For this case, analytical results can be obtained. In the frequency domain the solution is ˆ s) ˆ s)b(k, aˆ (k, s) = H(k, 1 ˆ s) = , H(k, ˆ 1 + τ s − W(k)
(2.3) (2.4)
where cell type subscripts have been dropped for clarity. Hˆ is the frequency space characterization of the cortical kernel, which describes how the cortex responds to afferent input pattern b. The spatiotemporal form of the solution is a convolution of the afferent input pattern b with the cortical kernel H, ˆ the inverse transform of H. From this expression, it follows that this architecture is capable of direction-sensitive responses. Consider the cortical response to a moving afferent unit amplitude sinusoidal input pattern of spatial frequency k moving at speed v. Because the system is linear, the cortical response is also a sinusoid of spatial frequency k moving at speed v whose amplitude is given by ˆ −2πikv)|, |H(k, ˆ −2πikv)|2 = |H(k,
1 , 2 ˆ ˆ i (k))2 (1 − Wr (k)) + (2π kvτ + W
(2.5)
ˆ r (k) and W ˆ i (k) are the real and imaginary components of the W(k), ˆ where W respectively. The amplitude of the response, as a function of v (for fixed ˆ i (k), and therefore depends on the sign k), is largest when 2π kvτ = −W ˆ ˆ of v if Wi (k) 6= 0. Since the imaginary component of W(k) arises from the 2 odd component of W(x), it follows that spatial asymmetries in the lateral connectivity can be used to create cortical direction sensitivity. Incidentally, 2 A function A is even if A(x) = A(−x), and odd if A(x) = −A(−x). In general, any function can be decomposed into the sum of an even function and an odd function.
Direction Selectivity from Recurrent Interactions
357
ˆ the real part of W(x), corresponding to the even component of the recurrent connectivity, can be used to adjust the spatial frequency tuning (orientation tuning in two dimensions) to stationary inputs (v = 0). This can be done independently of the direction tuning, which is set by the odd component of the recurrent connectivity. Because oriented spatiotemporal receptive fields feature prominently in motion energy systems (Adelson & Bergen, 1985), the following question naturally arises: What form of W(x) gives rise to a cortical kernel of the form H(x − v0 t)? From equation 2.4 it follows that the requisite form is W(x) = δ(x) − v0 τ δ 0 (x), which by inspection of equation 2.1 would imply that the asymmetric component of the recurrent connections is computing the spatial derivative of cortical activation. While this is a physically unrealizable recurrent connectivity pattern, it can be well approximated by a derivative of gaussian profile. The linear system described cannot exhibit direction-selective responses and is therefore inadequate as a model of direction selectivity; however, slightly modifying the model by introducing half-wave rectified linear activation functions solves this problem without destroying the insight gained from the linear analysis. The motivation for this is given in the next section. A more serious concern is the direction sensitivity generated by the model as a function of speed. In particular, it is possible to get small direction sensitivity, which persists over a wide range of speeds, or large direction sensitivity, for only a narrow range of speeds, but not both (see the appendix). This sensitivity will form the basis of direction selectivity in the nonlinear model, and it is not consistent with observed physiological responses, which exhibit strong direction selectivity over a wide range of speeds (Orban, Kennedy, & Bullier, 1986). We have found that by combining nonlinear cell responses with competition between detectors selective for different directions, we can reproduce the observed direction selectivity profiles. 2.2 Rectified Linear Activations. Half-wave rectification means treating as zero values that are negative: ½ g(a) = bac =
a 0
if a > 0 . otherwise
(2.6)
Since the activation of a unit is interpreted as a firing rate, half-wave rectification merely represents the impossibility of negative firing rates. Halfwave rectified models of neural firing are used extensively in neural modeling (e.g., Carandini & Ringach, 1997; Heeger, Simoncelli, Movshon, 1996; Mclean, Robb, & Palmer, 1994; Maex & Orban, 1991; Fester, 1989). Insight into the effect of half-wave rectification can be gained by considering the following. Recall for a drifting sinusoidal cortical input pattern of spatial frequency k drifting with speed v the amplitude of the cortical
358
Paul Mineiro and David Zipser
response for the linear system is given by 1 . 2 ˆ ˆ i (k))2 (1 − Wr (k)) + (2π kvτ + W
ˆ −2πikv)|2 = |H(k,
(2.7)
This response is also a sinusoid of spatial frequency k drifting at speed v, so its time average is zero, and the system is not directionally selective. If we were to half-wave rectify this cortical response, we would generate ˆ −2π ikv)|. In a new signal whose average value would be given by |H(k, other words, rectification followed by temporal averaging is sensitive to the response amplitude, and in this way directionally selective responses can be constructed from directionally sensitive responses. Of course, the actual effect of introducing half-wave rectified activation functions is more complicated, both because the input contains multiple frequency components and because the rectified response of the linear cortex is not the same as the response of the rectified linear cortex (due to recurrent interactions). 3 A Parameterization That Produces Realistic Direction Selectivity In this section, reasonable choices for the feedforward input, recurrent connectivity, and neural output function are made, and the response of the model to a standard physiological stimulus is presented. The purpose of this exercise is to establish that the dynamics of equations 2.2, when appropriately parameterized, can account for some observed direction-selectivity properties. All the values in this parameterization are listed in the appendix. Note the number of cell types is two, and the activation function for cells is half-wave rectification, g(a) = bac. 3.1 Model Parameters. 3.1.1 Retino-cortical transformation. The retino-cortical transformation, which until now has been ignored in order to isolate effects due solely to cortical dynamics, is now modeled explicitly as linear spatiotemporal filtering, with receptive field Fi (x, t): Z bi (x, t) = 0
∞
dt0
Z
∞
−∞
dx0 Fi (x − x0 , t − t0 )r(x0 , t0 ).
(3.1)
Since the main goal is to understand the emergence of direction-selective responses from cortical interactions, only separable feedforward receptive fields, which are not direction selective, will be considered. This implies Fi (x, t) = Xi (x)T(t) (no difference in temporal profiles between cell types is assumed).
Direction Selectivity from Recurrent Interactions
359
3.1.2 Feedforward spatial envelope. Cortical cells, in addition to being direction selective, are typically also spatial frequency (orientation) selective. Spatial frequency tuning can be generated by suitable choice of the even component of the recurrent connectivity; however, since the main focus of this article is direction selectivity, we will generate spatial frequency tuning in the model exclusively by the feedforward input. The spatial envelope X(x) is therefore chosen to be a Gabor sinusoidal, given by X1 (x) = X2 (−x) =
C −π(x/σx )2 e sin(2π f x + φ), σx
(3.2)
where C is dimensionless. This choice, in addition to providing spatial frequency tuning, is motivated by the observation that Gabor sinusoids provide a good approximation to the instantaneous 1D spatial profile of simple cell-receptive fields along the preferred direction (de Valois & de Valois, 1990). Figure 1 shows the actual spatial envelope used for the leftward selective cell type; the rightward selective cell type has the mirror image spatial envelope. 3.1.3 Feedforward temporal envelope. Speed-varying experiments have revealed several categories of velocity-sensitive cells in layer IVb, based on how a cell’s response varies as a function of the speed of a stimulus moving in the preferred direction. Most cells observed in central V1 are either velocity low pass (roughly 70%) or velocity tuned (roughly 10%), although other types become more prevalent in the periphery (Orban et al., 1986). Velocity low-pass cells respond well to low speeds and have responses that fall off as speed increases, while velocity-tuned cells show a strong response at a particular speed (see Figure 3). Following Maex and Orban (1992), the two different cell types will be accounted for by changing the temporal profile of the feedforward input, either a sustained or transient profile. The feedforward stimulation from a bright bar held in the receptive field of a cell with a sustained profile is constant.3 A cell with a transient profile will receive feedforward stimulation when the bar is first presented, but the stimulation will adapt (decay) with a time constant τr . Physiologically the sustained and transient profiles can be identified with parvocellular and magnocellular information streams, respectively. Mathematically the sustained and transient profiles are given by
3
Ts (t) = δ(t)
(3.3)
1 Ttr (t) = δ(t) − e−t/τr . τr
(3.4)
Equivalently, it adapts at a time scale much longer than the stimulus duration.
360
Paul Mineiro and David Zipser
a) Feedforward
Left
Right
b) Recurrent
Left
Right
Figure 1: (a) Gabor sinusoidal feedforward spatial profile. (b) Gaussian derivative recurrent connectivity pattern. Cortical units receive excitatory connections from units on their right and inhibitory connections from units on their left. Intuitively, this recurrent connectivity should lead to leftward direction selectivity.
A network with sustained input will be sensitive to low velocities, whereas a network with transient input will not respond to slowly moving stimuli, corresponding to the velocity low-pass and velocity-tuned cells, respectively. 3.1.4 Cortical connectivity. Recall that the odd component of the recurrent connectivity is responsible for the emergence of direction selectivity and that a perfectly oriented cortical spatiotemporal kernel arises when the physically unrealizable δ 0 (x) recurrent connection pattern is used. A convenient approximation to the δ 0 (x) function used for the recurrent connectivity
Direction Selectivity from Recurrent Interactions
361
in the model is the derivative of gaussian (see Figure 1), µ W11 (x) = W22 (−x) =
¶ 2π v0 τ x −π(x/σw )2 , e σw3
(3.5)
where τ is the effective time constant of the network and v0 has dimensions of velocity. The connectivity pattern between the two direction-selective components is chosen to be symmetric, here an inhibitory gaussian of width σw : W12 (x) = W21 (x) = −
B −π(x/σw )2 e σw
(3.6)
This choice was motivated by the anticipation of the winner-take-all dynamics between the cell types. Localized inhibitory interactions between differently tuned direction-selective cells have been previously motivated by both physiological and psychophysical experiments (Qian, Anderson, & Adelson, 1994; Qian & Anderson, 1994). 3.2 Response Properties. 3.2.1 Receptive field. The spatiotemporal receptive field of a neuron represents how a quickly flashed dot at a particular spatial location affects the neuron’s future activity; it is equivalent to the impulse response of the neuron. Using reverse correlation techniques, researchers have mapped the spatiotemporal-receptive fields of directionally selective simple cells in area 17 of the cat (Mclean et al., 1994). In the model, the spatiotemporal-receptive fields of the units can be analytically obtained if the units are linear or, equivalently, if the dynamics with rectified units are linearized around a positive cortical response profile. The response of cell type 1 to a retinal activation pattern r(x, t) in the frequency domain is given by oˆ1 (k, s) = Hˆ 1 (k, s)ˆr(k, s) ˆ Hˆ 1 (k, s) = T(s)
³
´
(3.7)
ˆ 22 (k) + W ˆ 12 (k)Xˆ 2 (k) Xˆ 1 (k) 1 + s − W ³ ´³ ´ , (3.8) ˆ 12 (k)W ˆ 21 (k)− 1+s− W ˆ 11 (k) 1+s− W ˆ 22 (k) W
and therefore the spatiotemporal-receptive field of a cell of type 1 is given by the inverse transform of Hˆ 1 (k, s). This is shown in Figure 2 (for this parameterization, a velocity low-pass leftward-selective model cell) along with a spatiotemporal-receptive field from a direction-selective cell in area 17. Note that the model cell’s receptive field is similar to the later portion of the real cell’s receptive field—the part that can be plausibly attributed to
362
Paul Mineiro and David Zipser
a)
b)
Figure 2: (a) Spatiotemporal-receptive field for a direction-selective simple cell from cat area 17. (b) Spatiotemporal-receptive field for a model cell with tonic input, given in terms of dimensionless space (x/σw ) and time (t/τ ). Dotted contours enclose inhibitory regions, and solid contours enclose excitatory regions. The model cell’s receptive field closely matches the later part of the physiological data, the part presumably created by recurrent interactions. Data from Mclean et. al. (1994).
Direction Selectivity from Recurrent Interactions
363
recurrent interactions. This is the part of the receptive field that is oriented and therefore responsible for direction selectivity. The initial part of the real cell’s receptive field is not reproduced by the model; this part is presumably primarily due to the feedforward input, the detailed modeling of which is not the goal of this article.4 3.2.2 Response to moving pulse. Figure 3 compares the response properties of cells in central macaque V1 to half-wave rectified model cell responses.5 The stimulus presented to the network is a moving pulse, which is the one-dimensional analog to a moving bar: ½ r(x, t) =
1 0
0 < x − vt < l . otherwise
(3.9)
The parameter l sets the width of the pulse. The model accurately reconstructs the observed velocity tuning profiles for both velocity low-pass and velocity-tuned neurons. Furthermore, the model reproduces the observed direction-selectivity properties of cortical cells. Physiologists have defined the direction-selectivity index (DI) as DI = 1 −
null direction response , preferred direction response
(3.10)
and it is typically expressed as a percentage. DI is a function of speed, and for most cells it is low at low speeds, rises to a larger value and is approximately independent of speed for roughly two orders of magnitude, and decreases at high speeds (see Figure 3). Model cells display this characteristic DI curve, in contrast to the linear model. The difference is due to the nonlinear cell responses and inhibitory interaction between cell types, which creates winner-take-all dynamics between the two cell types. This results in suppression of the null direction response without significant change in the preferred direction response. The direction-selective responses obtained in the model can be heuristically understood by examining the spatiotemporal pattern of firing across the cortex in response to a moving pulse (see Figure 4). Motion in both directions evokes a response when the stimulus is inside a cell’s feedforward receptive field. In addition, after transient activity6 has decayed, motion in 4
A more complex neural network model, employing essentially the same solution but with recurrent coupling chosen by optimization, produces extremely realistic spatiotemporal receptive fields (Haynes, 1996). 5 The agreement in Figure 3 might seem surprising given that the model successfully explains only the later portion of the receptive field. However, the curves in Figure 3 are concerned with the peak responses of the cells. 6 Transient uncertainty about the direction of motion is unavoidable in any realizable motion detection scheme, as it takes finite time to estimate the velocity of a stimulus.
364
Experiment
a)
Paul Mineiro and David Zipser
Model b)
c)
d)
e)
f)
Figure 3: A comparison of the response properties of cells in central macaque V1 (experiment) and half-wave rectified cells in a network with directionally opponent inhibition (model). (a) Peak response versus speed for a velocity low-pass cell in response to a slit moving in the preferred direction. (b) Peak output versus dimensionless speed (vσw /τ ) in response to a pulse moving in the preferred direction, sustained input case. (c) Direction-selectivity index (DI) versus speed averaged over an ensemble of cells in a central V1 sample. (d) DI versus dimensionless speed for a model cell, sustained input case. (e) Peak response versus speed for a velocity-tuned cell in response to a slit moving in the preferred direction. (f) Peak output versus dimensionless speed in response to a pulse moving in the preferred direction, transient input case. Experimental traces are from Orban et. al. (1986).
Direction Selectivity from Recurrent Interactions
365
Time
a)
Space 0.00
0.48
0.96
Time
b)
Space 0.00
0.68
1.40
Figure 4: Spatiotemporal pattern of firing across the cortex in response to a rightward-moving pulse, for the two component network of Figure 3. Shown are (a) leftward-tuned cells and (b) rightward-tuned cells. Both show the most vigorous response when the pulse is inside their feedforward receptive field. However, for rightward-tuned cells, the stimulus is preceded by a wave of excitation; this recurrently propagated excitation is absent in the leftward-tuned cells.
366
Paul Mineiro and David Zipser
the preferred direction generates a wave of recurrently propagated excitation, which precedes the stimulus. Motion in the null direction does not generate this recurrent stimulation, and in fact recurrent inhibition acts to inhibit cellular response. Note the spatial extent of the recurrently propagated excitation decreases as stimulus speed increases. The predictive nature of the recurrently propagated signal is especially intriguing. First, it suggests the recurrent connections could be developed by a correlation-based learning rule such as Hebbian learning. Furthermore, the recurrent signal is reminiscent of recent proposals of predictive perception, where neural delays are compensated for by extrapolation from the current sensory data (Nijhawan, 1997). In addition, this recurrent signal is potentially useful for overcoming dynamic and figural noise. Figure 5 shows the network response to moving pulse, which is briefly occluded. The recurrent signal is maintained during the occlusion, allowing for rapid restoration of the response. 4 Discussion The work presented here provides insight into how cortical interactions could give rise to directionally selective responses. The critical component is the odd (spatially asymmetric) component of the recurrent connectivity, which gives rise to directionally sensitive responses that can be understood using linear analysis. Delay in the propagation of the recurrent signal is unnecessary (i.e., the recurrent connections are not delay lines). Rather, the integration time constant of the local microcircuit and the spatial extent and strength of the recurrent connectivity establish a preferred speed. This distinguishes the implementation discussed here from delay-based Reichardt detector mechanisms (Reichardt, 1961). The model demonstrates that physiologically plausible direction selectivity can be achieved via inhibitory interactions between oppositely tuned cells with rectified linear responses, an observation previously made by Maex and Orban (1991) using a simpler two-neuron model. Recurrent inhibitory interactions between differently tuned cells are a common feature of computational models of visual cortex, often invoked to explain nonlinear nonspecific-oppression effects found in both physiological and psychophysical experiments (Qian et al., 1994; Qian & Anderson, 1994; Heeger et al., 1996). Here the subtractive inhibitory interaction between the two directions serves to dampen the response in the nonpreferred direction without affecting the preferred direction response, increasing direction selectivity and causing the DSI to be approximately constant over a wide range of velocities. Interestingly, experiments in cat visual cortex have revealed that linear predictors robustly predict preferred direction response magnitude but overestimate the nonpreferred response (Reid, Soodak, & Shapley, 1991). The analysis and insights gained by studying the model extend to the two (spatial)-dimensional case. The one-dimensional model analyzed here can
367
Time
Direction Selectivity from Recurrent Interactions
Space 0.00
0.68
1.40
Figure 5: Spatiotemporal pattern of firing in response to a rightward-moving pulse that is briefly occluded, for the two-component network of Figure 3. Only the responses of rightward-tuned units are shown. Note that the recurrent signal maintains its integrity during the occlusion, allowing rapid restoration of the response.
be interpreted as representing the cortical dynamics and connectivity along the axis defined by the preferred direction of the cell. With this interpretation, the model makes the following predictions about the cortical connectivity subserving direction selectivity. First, connections between cells with similar direction preference should be concentrated along the axis defined by the preferred direction of the cell, and relatively more excitatory in the preferred direction than in the antipreferred direction. Second, connections between cells with differing direction preference should be inhibitory and spatially isotropic. This is consistent with what is known about lateral connectivity in V1: localized connections form a dense inhibitory “halo,” while longer-range connectivity is patchy and tends to be distributed in an oval pattern (Amir, Harel, & Malach, 1993). In the parameterization of the model presented, spatial frequency (orientation) tuning arose due to the feedforward input, while direction selectivity
368
Paul Mineiro and David Zipser
was generated by the feedback connections. This choice was motivated by the desire to isolate the effects of recurrent connections on direction selectivity. Like direction selectivity, the extent to which orientation tuning is generated by feedforward and feedback mechanisms is currently the subject of debate in the neuroscientific community (Ferster, Chung, & Wheat, 1996; Somers, Nelson, & Sur, 1995; Nelson, Toth, Sheth, & Sur, 1994). Thus, the model presented here represents one extreme in the space of relative contributions of feedforward and feedback dynamics to orientation tuning and direction selectivity. There are, of course, other possibilities; for example, it has been suggested that recurrent circuitry in the cortex acts to amplify a weak orientation bias in the cortical input (Somers et al., 1995). Similarly, the directionally selective cortical response analyzed here could be operating in conjunction with feedforward mechanisms. An intriguing result of interest to recurrent models of orientation selectivity is that the spatial frequency tuning to static images can be manipulated by the even component of the recurrent connectivity, similar to how direction selectivity can be affected by the odd component. In fact, a recent model of recurrent orientation selectivity has been independently developed using dynamics extremely similar to those employed here, although it considered connections between different types of orientation selective cells at a single spatial location (Carandini & Ringach, 1997). Appendix A.1 Direction Selectivity for the Linear Single Species Model. Physiologists classify cells according to their direction selectivity index (DI), defined as 1 minus the ratio of the peak response in the null and preferred directions (typically expressed as a percentage). In the linear models, the DI in response to drifting sinusoids can be found from H:
DI(k, v) = 1 −
ˆ 2π ikv)| |H(k, . ˆ −2π ikv)| |H(k,
(A.1)
Here only the single dimension along the preferred direction is considered. For the single species interaction case, this is of the form µ DI(k, v) = 1 −
1 + (a(k)v − b(k))2 1 + (a(k)v + b(k))2
¶1/2 (A.2)
a(k) =
2πkτ ˆ r (k)) (1 − W
(A.3)
b(k) =
ˆ i (k) W . ˆ r (k)) (1 − W
(A.4)
Direction Selectivity from Recurrent Interactions
369
The peak value of DI and the second derivative of the DI at the peak are p 1 + b(k)2 vmax (k) = a(k)
p DI(k, vmax (k)) = 1 + b(k) − 1 + b(k)2 p ¯ 2 (k)b(k) 1 + b(k)2 ¯ a ∂ 2 DI p (k, v)¯¯ =− . ∂v2 b(k) + 1 + b(k)2 vmax (k)
(A.5) (A.6) (A.7)
It follows that as the maximum DI is increased by increasing b(k), at constant vmax , the negative curvature at the maximum becomes larger, causing the DI to become more sharply peaked. Although a simultaneously large and broadly peaked DI could possibly be achieved using an expansive saturating nonlinear output function, cortical simple cells are well approximated as (rectified) linear mechanisms (Mclean et al., 1994). Thus iso-direction interactions alone are probably not responsible for observed cortical direction selectivity. A.2 Parameter Values. All the figures in this article were generated using the following set of parameters: B = Directionally opponent inhibition strength = 1. C = Feedforward input strength = 10. f = Feedforward Gabor spatial frequency = 0.8. φ = Feedforward Gabor spatial phase = 0. l = Pulse width = 0.75. σx = Feedforward Gabor spatial scale = 1. σw = Lateral connectivity spatial scale = 1.5. τ = Effective cellular time constant = 1. τr = Feedforward transient input time constant = 0.0625. v0 = Lateral connectivity strength = 2. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am., 2(2), 284–299. Amir, Y., Harel, M., & Malach, R. (1993). Cortical hierarchy reflected in the organization of intrinsic connections in the Macaque monkey visual cortex. J. Comparative Neurology, 334, 19–46. Carandini, M., & Ringach, D. L. (1997). Some properties of recurrent models of orientation selectivity. In Proceedings of the Computation and Neural System Conference. San Diego: Academic Press.
370
Paul Mineiro and David Zipser
de Valois, R., & de Valois, K. (1990). Spatial vision. New York: Oxford University Press. Ferster, D., Chung, S., & Wheat, H. (1996). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380(6571), 249–252. Fester, D. L. (1989). The synaptic inputs to simple cells of the cat visual cortex. In D. Lam & C. Gilbert (Eds.), Neural mechanisms of visual perception. Woodlands, TX: Portfolio Pub. Co. Haynes, K. A. (1996). A model of local motion processing in primary visual cortex. Unpublished doctoral dissertation, University of California at San Diego. Heeger, D. J., Simoncelli, E. P., & Movshon, J. A. (1996). Computational models of cortical visual processing. Proceedings of the National Academy of Sciences, 93(2), 623–627. Kandel, E. (1991). Perception of motion, depth, and form. In E. Kandel, J. Schwarz, & T. Jessel (Eds.), Principles of neural science (Chap. 30). Amsterdam: Elsevier. Maex, R., & Orban, G. A. (1991). Subtraction inhibition combined with a spiking threshold accounts for cortical direction selectivity. Proc. Nat. Acad. Sci., 88, 3549–3553. Maex, R., & Orban, G. A. (1992). A model circuit for cortical temporal low-pass filtering. Neural Computation, 4(6), 923–945. Maex, R., & Orban, G. A. (1996). Model circuit of spiking neurons generating directional selectivity in simple cells. Journal of Neurophysiology, 75(4), 1515– 1545. Mclean, J., Raab, S., & Palmer, L. A. (1994). Contribution of linear mechanisms to the specification of local motion by simple cells in areas 17 and 18 of the cat. Visual Neuroscience, 11(2), 271–294. Murray, J. D. (1989). Mathematical biology. Berlin: Springer-Verlag. Nelson, S., Toth, L., Sheth, B., & Sur, M. (1994). Orientation selectivity of cortical neurons during intracellular blockade. Science, 265(5173). Nijhawan, R. (1997). Visual decomposition of colour through motion extrapolation. Nature, 386(6620), 66–69. Orban, G. A., Kennedy, H., & Bullier, J. (1986). Velocity sensitivity and direction selectivity of neurons in areas V1 and V2 of the monkey. Journal of Neurophysiology, 56(2), 462–480. Poggio, T., & Reichardt, W. (1976). Visual control of orientation behavior in the fly. Quarterly Reviews of Biophysics, 9(3), 377–438. Qian, N., & Anderson, R. A. (1994). Transparent motion perception as detection of unbalanced motion signals II: Physiology. Journal of Neuroscience, 14(12). Qian, N., Anderson, R. A., & Adelson, E. H. (1994). Transparent motion perception as detection of unbalanced motion signals I: Psychophysics. Journal of Neuroscience, 14(12). Reichardt, W. (1961). Autocorrelation, a principle for the evaluation of sensory information by the central nervous system. In W. A. Rosenblith (Ed.), Sensory communication (Chap. 17). New York: Wiley. Reid, R. C., Soodak, R. E., & Shapley, R. M. (1991). Directional selectivity and spatiotemporal structure of receptive fields of simple cells in cat striate cortex. Journal of Neurophysiology, 66(2), 505–529.
Direction Selectivity from Recurrent Interactions
371
Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortex. Journal of Neuroscience, 15(8), 5448–5465. Suarez, H., Koch, C., & Douglas, R. (1995). Modeling direction selectivity of simple cells in striate visual cortex with the framework of the canonical microcircuit. Journal of Neuroscience, 15(10), 6700–6719. van Santen, J. P., & Sperling, G. (1985). Elaborated Reichardt detectors. J. Opt. Soc. Am., 2(2), 300–320. Watson, A. B., & Ahumada, A. J. J. (1985). Model of human visual-motion sensing. J. Opt. Soc. Am., 2(2), 322–341. Received August 29, 1966; accepted June 20, 1997.
Communicated by Terrence Sanger
Statistically Efficient Estimation Using Population Coding Alexandre Pouget Georgetown Institute for Computational and Cognitive Sciences, Georgetown University, Washington, DC 20007-2197, U.S.A.
Kechen Zhang Computational Neurobiology Laboratory, Salk Institute, La Jolla, CA 92037, U.S.A.
Sophie Deneve Georgetown Institute for Computational and Cognitive Sciences, Georgetown University, Washington, DC 20007-2197, U.S.A.
Peter E. Latham Department of Neurobiology, University of California at Los Angeles, Los Angeles, CA 90095-1763, U.S.A.
Coarse codes are widely used throughout the brain to encode sensory and motor variables. Methods designed to interpret these codes, such as population vector analysis, are either inefficient (the variance of the estimate is much larger than the smallest possible variance) or biologically implausible, like maximum likelihood. Moreover, these methods attempt to compute a scalar or vector estimate of the encoded variable. Neurons are faced with a similar estimation problem. They must read out the responses of the presynaptic neurons, but, by contrast, they typically encode the variable with a further population code rather than as a scalar. We show how a nonlinear recurrent network can be used to perform estimation in a near-optimal way while keeping the estimate in a coarse code format. This work suggests that lateral connections in the cortex may be involved in cleaning up uncorrelated noise among neurons representing similar variables.
1 Introduction Many sensory and motor variables in the brain are encoded with coarse codes, that is, through the activity of large populations of neurons with broad tuning to the variables. For instance, direction of visual motion is believed to be encoded in the medial temporal (MT) visual area by the responses of a large number of cells with bell-shaped tuning to direction, as illustrated in Figure 1A (Maunsell & Van Essen, 1983). Neural Computation 10, 373–401 (1998)
c 1998 Massachusetts Institute of Technology °
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham B
40
Activity
Activity
A
20 0
C
40
Activity
374
20 0
100 200 300 Direction (deg)
40 20 0
100 200 300 Preferred Direction (deg)
100 200 300 Preferred Direction (deg)
Figure 1: (A) Idealized tuning curves for 16 direction-tuned neurons. (B) Noiseless pattern of activity (o) from 64 simulated neurons with tuning curves like the ones shown in A, when presented with a direction of 180◦ . (C) Same as in B but in the presence of gaussian noise.
In response to an object moving along a particular direction, the pattern of activity across such a population would look like a noisy hill of activity (see Figure 1C). On the basis of this activity vector, A, the best that can be done is to recover the conditional probability distribution of the direction of motion, θ, given the activity, p(θ | A) (Anderson, 1994; Zemel, Dayan, & Pouget, 1998). A slightly less ambitious goal is to come up with a good guess, or estimate, θˆ , of the direction, θ, given the activity. Because of the stochastic nature of the noise, the estimator is a random variable; that is, for the same image, θˆ will vary from trial to trial. A good estimator should be unbiased; the conditional mean of the estimator, E[θˆ | θ], should be equal to the true direction, θ. Furthermore, this unbiased estimator should have the smallest possible conditional variance, E[(θˆ − θ )2 | θ], because the variance determines how well two similar directions can be discriminated using this estimator (Green & Swets, 1966; Paradiso, 1988). This conditional variance is bounded below by the Cram´er-Rao bound, which depends on the noise level and the tuning curves of the units (Paradiso, 1988; Papoulis, 1991). Typically, computationally simple estimators, such as the optimum linear estimator (OLE) (Baldi & Heiligenberg, 1988; Pouget, Fisher, & Sejnowski, 1993), are not efficient, in the statistical sense that their variances are several times the bound. By contrast, Bayesian or maximum likelihood (ML) estimators (which are equivalent for the case under consideration in this article) can reach this bound but require more complex calculations (Paradiso, 1988; Seung & Sompolinsky, 1993; Salinas & Abbott, 1994). These decoding techniques are valuable for a neurophysiologist interested in reading out the population code, but they are not directly relevant for understanding how neural circuits perform estimation. In particular, they all provide the estimate in a format that is incompatible with what we know of sensory representations in the cortex. For example, cells in V4 are estimating orientation from the noisy responses of orientation tuned V1 cells, but, unlike ML or OLE, which provide a scalar estimate, V4 neurons
Estimation Using Population Coding
375
retain orientation in a coarse code format, as demonstrated by the fact that V4 cells are just as broadly tuned to orientation as V1 neurons (Desimone, Schein, Moron, & Ungerleider, 1985). Such coarse codes have several computational advantages over scalar representations, and it is important to understand how they are maintained throughout the cortex (Hinton, 1992). Therefore, it seems that a theory of estimation in biological networks should have two critical characteristics: (1) it should preserve the estimate in a coarse code, and (2) it should be efficient, that is, the variance should be close to the Cram´er-Rao bound. This article describes a model that satisfies these two requirements. Our model uses lateral connections in a nonlinear recurrent network of direction-tuned neurons to come up with an ML estimate of direction in a coarse code format. We also show how linear recurrent networks are related to the population vector estimator used by Georgopoulos, Kalaska, Caminiti, & Massey (1982), and we provide a performance comparison between various network architectures and classical estimation methods such as OLE and ML. In this article, we first describe how we generated neuronal patterns of activity used in the simulations. Then we review four estimators that have been previously used in the literature to decode such patterns. Next, we consider linear and nonlinear networks with lateral connections, and we show how they can be used as estimators. We report the results of simulations in which we compared the performance of a nonlinear network to the classical methods. Finally, we show analytically the relation between the nonlinear network and maximum likelihood. 2 Model of Neuronal Responses The simulations involve estimating the value of the direction of a moving bar based on the activity, A = {ai }, of 64 input units with bell-shaped tuning to direction corrupted by noise. The tuning function of unit i, fi (θ ), which is the same as the conditional mean response, E [ai | θ ], was given by: fi (θ) = α exp(β(cos(θ − θi ) − 1)) + γ .
(2.1)
This function is known as the circular normal distribution. Its profile is very similar to a gaussian, but it is periodic. α corresponds to the mean peak response, β to the width of the tuning curve, and γ to the mean spontaneous activity of each unit. Cortical neurons commonly show spontaneous activity, although the amplitude of this activity varies from one cortical area to the next. The peaks of the tuning curves, θi , were evenly spread over the interval [0◦ , 360◦ ]. The activity ai depended on the noise distribution. We used two types of noise, normally distributed with fixed variance, σn2 : µ ¶ 1 (a − fi (θ ))2 exp − , P(ai = a | θ) = p 2σn2 2πσn2
376
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
or Poisson distributed: P(ai = k | θ) =
fi (θ)k e− fi (θ ) . k!
Figure 1C shows a typical pattern of activity with gaussian noise (σn2 = 7). Note that the noise is in the activity level, ai , not in θ . On any given trial, θ is assumed to have a given value; i.e., the probability distribution over θ , P(θ), is assumed to be a Dirac function. 3 Classical Decoding Methods We now review four different methods for decoding patterns of neural activity: maximum likelihood (ML), optimum linear estimator (OLE), center of mass (COM), and complex estimator (COMP). We indicate in each case how we computed the variance of these estimators. Simulation results and comparison with recurrent network architecture are provided in the following sections. 3.1 Maximum Likelihood (ML). The ML estimate is defined as: θˆML = arg max P(A | θ). θ
With independent noise between units, finding the ML estimate reduces to curve fitting, or template matching (Paradiso, 1988; Lehky & Sejnowski, 1990; Wilson & McNaughton, 1993). One needs to find the noise-free hill that minimizes distance from the data where the distance metric is determined by the distribution of the noise (if the noise is gaussian, the appropriate distance is the Mahalanobis norm; Duda & Hart, 1973). This step involves a nonlinear regression, which is typically performed by moving the position of the hill until the distance from the data is minimized (see Figure 2B). The position of the peak of the final hill corresponds to the ML estimate. With a large number of units, this estimate is unbiased, and its variance is equal to the Cram´er-Rao bound (Paradiso, 1988; Papoulis, 1991; Seung & Sompolinsky, 1993): i 1 h E (θˆML − θ)2 = , I where ¸ · ∂2 I = E − 2 log P(A | θ) . ∂θ If we assume independent noise across units, then: · ¸ N X ∂2 E − 2 log P(ai | θ ) . I= ∂θ i=1
(3.1)
Estimation Using Population Coding
θCOMP
θML
B
50
50
40
40
30
30
Activity
Activity
A
377
20
20
10
10
0
0
−10
100
200 Direction (deg)
300
−10
100 200 300 Preferred Direction (deg)
Figure 2: (A) The complex estimator uses the phase of the first Fourier component of the input pattern (solid line) as an estimate of direction. It is equivalent to fitting a cosine function to the input. (B) The ML estimate is found by moving an “expected” hill of activity (dotted line) until the squared distance with the data is minimized (solid line).
For a normally distributed noise with fixed variance, σn2 : PN I=
0 2 i=1 fi (θ) , σn2
(3.2)
and for a Poisson distributed noise (Seung & Sompolinsky, 1993): I=
N X fi0 (θ)2 . fi (θ) i=1
(3.3)
3.2 Optimum Linear Estimator (OLE). The simplest possible estimator is an estimator that is linear in the activities of the neurons, A (Pouget et al., 1993): θˆOLE = wT A. A common choice for w is to take the weight vector minimizing the mean square distance between the estimate, θˆOLE , and the true direction, θ : h i w = arg min E (θ − θˆOLE )2 . w
One can think of the linear estimator as being the response of a single output unit with weights w. Note that this estimator is poorly adapted to the estimation of a periodic variable such as direction. In our simulations, we worked around 180 degrees, staying away from the discontinuity at 0 and 360 degrees.
378
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
OLE is known to be unbiased for a large number of units, that is, E[θˆOLE | θ] = θ (Baldi & Heiligenberg, 1988). In this case, its variance given θ is: ·³ E
θˆOLE − Eθ
´2
¸ X L |θ = w2i σi2 ,
(3.4)
i=1
where σi2 = σn2 for the normally distributed noise with fixed variance σn2 , and σi2 = fi (θ) for the Poisson distributed noise. 3.3 Center of Mass (COM). This estimator is a one-dimensional version of the population vector used by Georgopoulos et al. (1982) (see also Zohary, 1992; Snippe, 1996). It is defined as: PN θ (a − γ ) ˆθCOM = Pi=1 i i . N i=1 (ai − γ ) The mean spontaneous activity, γ (see equation 2.1), is subtracted from the activities ai to prevent systematic bias. Like OLE, COM handles poorly the discontinuity between 0 and 360 degrees. We obtained an approximation of the variance of the COM estimate using computer simulations. These estimates, computed for 201 values of direction, systematically varied from 170 to 190 degrees by increments of 0.1 degree. For each direction, the variance and mean of the estimate were calculated according to: L i 1X h θˆ l E θˆCOM | θ = L l=1 COM ·³ h i´2 ¸ E θˆCOM − E θˆCOM | θ |θ =
L ³ h i´2 1 X l − E θˆCOM | θ . θˆCOM L − 1 l=1
We used L = 1000 trials in all simulations. 3.4 Complex Estimator (COMP). The complex estimator (also known as population vector; Georgopoulos et al., 1982) is defined as the phase of the first Fourier component of the input pattern (Seung & Sompolinsky, 1993): θˆCOMP = phase(z), where z=
N X j=1
aj eiθj .
Estimation Using Population Coding
379
This estimator is often said to be linear (see Seung & Sompolinsky, 1993; Salinas & Abbott, 1994), but it is important to realize that only z, and not θˆCOMP , is linear in A. Recovering the phase of a complex number is a nonlinear operation. This estimator is equivalent to an ML estimator only under the assumption that the data were generated according to a cosine tuning function with period 2π corrupted by gaussian noise of fixed variance (see Figure 2A). This estimator is guaranteed to be suboptimal if the noise is nongaussian or if the data are generated by any other function and, in particular, the one used in our simulations (see equation 2.1). We obtained an approximation of the variance of the estimator using computer simulations as described in the previous section. 4 Recurrent Networks All the methods described so far recover a scalar estimate of direction. We now consider network architectures in which the estimate is kept in a coarse code format. These networks have an input and output layer of 64 units, fully connected from the input to the output layer (feedforward connections) and within the output layer (lateral connections), using periodic boundaries and identical weight matrices for the feedforward and lateral connections (see Figure 3A). We use the notation A = {ai } for the activity of the input units as specified in equation 2.1 and Ot = {oi,t } for the activities of the output units at time t. We consider only the case of a transient input; at time zero, we set the activity of the input units to {ai }, pass it through the feedforward connections, and then remove the input and let the activities of the output units evolve according to the dynamical equation for this layer. As we will show, an appropriate choice of the weights and the activation function can ensure that the activity in the output layer, which forms a recurrent network, will evolve toward a stable state, corresponding to a hill-shaped pattern of activity (see Figure 3B, which shows the activity over time for the nonlinear network described below; Zhang, 1996). We can use the final position of the hill across the neuronal array after relaxation as a coarse code estimate of the direction, θ. In the next two sections, we explore the properties of this estimator for linear and nonlinear activation functions. 4.1 Linear Network. We first consider a network with linear activation functions in the output layer and whose dynamics is governed by the following difference equation: Ot = ((1 − λ) I + λW) Ot−1 ,
(4.1)
B g (θ)
40
i
Activity
A
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
Activity
380
Preferred Direction (deg)
Output
20 0
Input
100
Activity
50
e
m Ti
fi(θ)
0
300 200 100 ion (deg) ct ire D d re Prefer
Preferred Direction (deg)
C
−3
6
D
x 10
300 250
h(x)
4 2
200 150
0 100 −2 −4
50 −20
0
20
Difference in Preferred Direction (deg)
0 −10
−5
0
5
10
x
Figure 3: (A) Two-layer network for estimation using coarse code. The first layer generates the noisy activity pattern {ai } according to the tuning function { fi (θ )}. The output layer is a recurrent network that generates a hill of activity corresponding to the tuning function {gi (θ)}. (B) Activity over time in the output layer with a nonlinear activation function in response to an initial small, random pattern of activity. The activity of the units is plotted as a function of their preferred direction of motion. (C) Pattern of weights in the nonlinear recurrent network as a function of the difference in preferred direction between units. (D) Activation function, h(x), of the nonlinear recurrent network.
where λ is a number between 0 and 1, I is the identity matrix, and W is the matrix for the lateral connections. The activity at time 0, O0 , is initialized to WA, where A is an input pattern (like the one shown in Figure 1C) and W is the feedforward weight matrix, which is set to be equal to the lateral weight matrix (hence, the same notation). The dynamics of such networks is well understood (Hirsch & Smale, 1974). If each unit receives the same weight vector—if all the rows of W, which we will denote w, are translated versions of one another—a Fourier transform of equation 4.1 (not in time but over the vectors Ot , Ot−1 , and w)
Estimation Using Population Coding
381
leads to:
Ot = ((1 − λ)I + λW ) Ot−1 = QOt−1 , where Ot and Ot−1 are the Fourier transforms of Ot and Ot−1 , and W is a diagonal matrix with the Fourier coefficients of w along the diagonal. Since O0 = WA, we obtain:
Ot = Qt WA. Consequently, the network dynamics amplifies or suppresses independently the Fourier component of the initial input pattern, A, by a factor equal to the corresponding component of the Fourier transform of Q. For example, if the first diagonal term of Q is more than one (resp., less than one), the first Fourier component of the initial pattern of activity will be amplified (resp., suppressed). Thus, we can choose W such that the network amplifies selectively the first Fourier component of the data while suppressing the others. The network would be unstable, but if we stop after a large, yet fixed, number of iterations, the activity pattern would look like a cosine function of direction with a phase corresponding to the phase of the first Fourier components of the data. If we now use the position of the peak of the hill, which is the same as the phase of the cosine, as an estimate of direction, we end up with the same value as the one provided by the COMP methods. A network for orientation selectivity proposed by Ben-Yishai, Bar-Or, and Sompolinsky (1995) is closely related to this linear network. Their network is actually nonlinear, but the nonlinearity simply acts as a gain control, which prevents activity from growing to infinity. Although such networks keep the estimate in a coarse code format, they suffer from two problems: it is unclear how they could be extended to nonperiodic variables, such as disparity, and they are suboptimal since they are equivalent to the COMP estimator. 4.2 Nonlinear Network. We consider next a network with nonlinear activation functions in which the dynamics of the output units is governed by the following difference equations: ³ ´d oi,t = h(ui,t ) = a log(1 + eb+cui,t ) ui,t = (1 − λ)ui,t−1 + λ
N X
wij oj,t−1 .
(4.2) (4.3)
j=1
Using vector notations, we rewrite these equations as: ³ ´d Ot = h(Ut ) = a log(1 + eb+cUt )
(4.4)
382
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
Ut = (1 − λ)Ut−1 + λWOt−1 .
(4.5)
As shown in Zhang (1996), the weights, W, can be set in such a way that a hill of activity of profile, g(θ ), centered at any location on the network is a stable state (see Figure 3B). These kinds of networks are known as line attractor networks, because the set of all hills defines a one-dimensional continuous stable manifold in activity space. The rows of the weight matrix must be a translated version of the same vector, w, which is found by solving: g(θ) = h(w ∗ g(θ)),
(4.6)
where g(θ) is the desired bell-shaped profile, ∗ is the convolution, and h(·) is the activation function (this equation involves continuous functions but it can be easily discretized to deal with a finite number of units).1 There is no analytical solution to this equation, but a close approximation can be obtained for a wide variety of bell-shaped profiles of activity and monotonic activation functions (Zhang, 1996). Thus, the shape of the stable hill is fully specified by the weights and activation function. By contrast, the final position of the hill on the neuronal array depends on only the initial input (Zhang, 1996). Therefore, like ML, the network fits an “expected” function—the stable hill—through the noisy input pattern, A. We will use the notation g(θ ) to refer to this function and gi (θ) for the corresponding tuning curves of the output units (see Figure 2A). For reasons that will become clear, we selected the lateral weights, W, to minimize the L2 distance between g(θ )—the function corresponding to the stable hill—and the function f (θ ) (see equation 2.1) used to generate the activity patterns, A (see Zhang, 1996, for details about this procedure based on regularization theory). The resulting weights are locally excitatory with long-range inhibition, a common pattern of connectivity in models of cortical circuitry (see Figure 3C). The resulting network can be used as an estimator by first initializing the input layer to a vector A, passing the activity through the feedforward connections (which amount to setting U0 to WA) and iterating equations 4.4 and 4.5 until a stable hill of activity is obtained. The stable hill in the output layer can be treated as a population code for the estimated direction, θˆRN (RN, recurrent network), and a scalar value can be obtained by computing the peak position. We computed the position of the peak using a COMP 1 Strictly speaking, the weights that solve equation 4.6 in the discrete case lead to a network with N stable fixed points along the one-dimensional manifold, interspersed with N unstable fixed points, where N is the number of units. Therefore, the resulting network is not truly a line attractor network; the eigenvalue, λ, of the Jacobian along the manifold near the attracting fixed point is slightly less than 1. It can be shown, however, that 1 > λ > 1 − k/N2 , where k is a constant independent of N. Therefore, for large N, the dynamics of convergence along the manifold is so slow that it can be ignored for all practical purposes, which is what we do in the rest of the article.
Estimation Using Population Coding
383
operator applied to the stable pattern of activity, O∞ , although any unbiased estimator would have worked. Note that this step would not be required in the brain. We have added it only to allow comparison with the other estimators. Estimates of the bias and variance of the direction estimates were obtained with the same method as that used for the COMP estimator. The activation function, h(·), used in equation 4.2, looks like a linear rectified function (see Figure 3D). It is close to zero for negative x and grows roughly linearly past a threshold. The parameters α, β, and γ in equation 2.1 were set, respectively, to 38, 7, and 3.8 and the parameters a, b, c, and d in equation 4.2 were set to, respectively, 6.3, 5, 10, and 0.8. All of these choices were motivated by the fact that the same parameters and function were used by Zhang (1996) in a previous study. Our results do not depend critically on these particular choice; variations in these parameters do not affect our results. The standard deviation of the gaussian noise was set to σn = 5.8, which corresponds to a signal-to-noise ratio of 6 for the most active units. By comparison, the signal-to-noise ratio of the most active units when using Poisson noise was 6.5. 5 Simulation Results Since the preferred directions of two consecutive units in the network are more than 5 degrees apart, we first wondered whether recurrent network (RN) estimates would exhibit a systematic bias—a difference between the mean estimate and the true direction—in particular for directions between the peaks of two consecutive units. Our simulations showed no significant bias for any of the directions tested (see Figure 4). This entails that, with 64 units only, the stable hill can settle in any position, in particular between the peaks of the tuning curves of two successive units. Next, we compared the standard deviations of the estimates for four estimators—OLE, COM, COMP and ML—to the nonlinear RN. We did not simulate the linear network since it is equivalent to the COMP methods. The standard deviations for the ML and OLE were obtained using equations 3.2, 3.3, and 3.4. The RN method was found to outperform the OLE, COM, and COMP estimators in both cases and to match the Cram´er-Rao bound for gaussian noise (see Figure 5). For noise with Poisson distribution, the standard deviation for RN was only 6.5% above the bound (see Figure 5B). To confirm that ML and RN are similar, we looked at the coefficient of correlation between the two estimates. We obtained a value of 0.98, indicating that the two estimates are almost identical on individual trials. We also estimated numerically −∂ θˆRN /∂ai |θ =170◦ , the partial derivative of the RN estimate with respect to the initial activity of each of 64 units for a direction of 170 degrees. This derivative in the case of ML matches
384
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
Estimated Direction (deg)
190 188 186 184 182 180 178 176 174 172 170
170
172
174
176
178
180
182
184
186
188
190
Direction (deg)
Figure 4: The solid line shows the mean estimated direction as a function of the true direction for normally distributed noise of fixed variance. The estimator is unbiased, that is, the mean estimate is equal to the true direction. The upper and lower dotted lines are one standard deviation away from the mean.
B 20
Standard Deviation (deg)
Standard Deviation (deg)
A 15 10 5 0
30 25 20 15 10 5 0
OLE COM COMP ML
RN
OLE COM COMP ML
RN
Figure 5: Histogram of the standard deviations of the estimate for all five methods (OLE, optimal linear estimator; COM, center of mass; COMP, complex estimator; ML, maximum likelihood; RN, recurrent network). (A) Noise with normal distribution. (B) Noise with Poisson distribution. In both cases, the value for ML is the Cram´er-Rao bound. The RN method reaches this bound for gaussian noise and performs slightly worse for Poisson noise.
closely the derivative of the cell tuning curve, fi0 (θ ). In other words, in ML, units contribute to the estimate according to the amplitude of the derivative of the tuning curve. As shown in Figure 6A, the same is true for RN; −∂ θˆRN /∂ai |θ =170◦ matches closely the derivative of the units’ tuning curves. In contrast, the same derivatives for the COMP estimate (dotted line) or the COM estimate (dash-dotted line) do not match the profile of fi0 (θ ). In particular, units with preferred direction far away from 170 degrees—units whose
Estimation Using Population Coding
B
Normalized Derivative
1 0.5
RN COMP COM
0 −0.5 −1 100 200 300 Preferred Direction (deg)
6
Standard Deviation (deg)
A
385
5 4 3 2 1 0
0
5 10 Time (# of iterations)
15
ˆ Figure 6: (A) Comparison of f 0 (θ) (solid line) and −∂ θ/∂a i |θ =170◦ for RN, COMP, and COM. All functions have been normalized to one. (B) Standard deviation as a function of time, or number of iterations of the recurrent network. The point at t = −1 is the COMP estimator applied to the input activity, A, whereas the point at t = 0 corresponds to COMP applied to WA.
activity is just noise—end up contributing to the final estimate, hindering the performance of the estimator. Reaching a stable state can take many iterations, which could make the RN method too slow for any practical purpose. Consequently, we looked at the standard deviation of the RN as a function of time—that is, the number of iterations. We found that the convergence to ML is very fast. In fact, initializing U0 to WA and O0 to h(U0 ) is sufficient to obtain a standard deviation very close to the bound, and 5 to 10 iterations leads to the asymptotic values (see Figure 6B). The initialization is mathematically equivalent to one network iteration with the integration constant, λ, set to one (see equation 4.5). We can therefore conclude that there is no need to wait for a perfectly stable pattern of activity to obtain minimum standard deviation and that one network iteration is sufficient to obtain performance close to ML. So far, the input units (which determine the input patterns, A) and the network units had the same tuning curves: f (θ ) = g(θ ). Next, we explored the effect of varying the amplitude and the width of the input tuning curves while keeping the output tuning curves constant. A comparable situation for ML would be to fit the wrong tuning curve through the data. With ML, an error in the assumed amplitude of the bump would not affect performance (the minimum of the nonlinear regression step is unaffected; see Figure 2B), whereas a mismatch between the actual and assumed width results in suboptimal performance. Our simulations revealed that both parameters affect the performance of the network estimate (see Figures 7A and 7B). Large differences in amplitude or width lead to a standard deviation much larger than the Cram´er-Rao
386
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
B Percentage above CR Bound
Percentage above CR Bound
A 60 50 40 30 20 10 0 −4
−2 0 2 Log of the amplitude ratio
4
60 50 40 30 20 10 0 0
50 100 Width (deg)
150
Figure 7: Standard deviation of the RN estimator in terms of percentage above the Cram´er-Rao bound, as a function of the amplitude (A) and width (B) of the input bump. (A) The data are plotted as a function of the logarithm in base 2 of the ratio of input-output amplitudes. Changing the gain of the input by a factor of 2 or less affects the performance of the network only moderately. (B) The width of the tuning curve is computed at half the peak value. The network sharpens the input tuning curves for width values above 50 degrees and widens for smaller values. The Cram´er-Rao bound is reached only when the widths of the input and output tuning curves differ by less than 10 degrees.
bound. The curves, however, are both quadratic around the minimum, indicating that the network is fairly robust with respect to these kinds of errors. In particular, changing the amplitude by a factor of two has a minimal impact on the standard deviation (see Figure 7A). Nevertheless, unlike ML, performance eventually decreases with larger-amplitude changes. Finally, Figure 8A shows the covariance matrix of the input unit activities, ai , when presented repetitively with a direction of 170 degrees. Since the noise was chosen to be independent across units, only the diagonal terms of the covariance matrix—the variances of the individual units—differ from zero. Interestingly, the covariance of the network units after a stable pattern has been reached has a different structure (see Figure 8B). Units with similar direction preferences around 150 degrees are positively correlated while being negatively correlated with units whose direction preference is around 190 degrees, and vice versa. Furthermore, units with preferred directions away from the test direction (outside the interval 170◦ ±30◦ ) have a variance and covariance close to zero. Interestingly, these correlations do not reflect the similarity of the tuning curves for units with similar preferred directions. The similarity in tuning curves introduces similarities in the mean responses. By contrast, the covariance matrices plotted in Figure 8B show correlations in the fluctuations about these mean responses. Such correlations are often considered
Estimation Using Population Coding
387
A
B
0.2 0.1 0
4 2 0 −2 −4
100
100
g)
200 ion (de t irec ed D
200
D
3
2
2
1.5
1 0 −1 −2 −3 −100 0 100 Difference in Preferred Direction (deg)
Correlation
Correlation
C
200 n (deg) ectio
200
100 err f Pre
ir 100 dD ferre Pre
1 0.5 0 −0.5 −1
100 200 Direction (deg)
300
Figure 8: Covariance matrices of the input units (A) and network units (B) for repetitive presentations of a direction of 170 degrees. Only the central part of the covariance matrix is shown (units with preferred directions between 84 and 270 degrees). Whereas the input units are independent, the output units are correlated due to the lateral connections. (C) Correlation of unit with preferred direction 135 degrees with all the other units as a function of the difference in preferred direction. The curve has the same profile as the derivative of the tuning curve. (D) Correlation between two units (preferred directions 158 and 182 degrees) as a function of stimulus direction.
damaging because they reduce the signal-to-noise ratio (Zohary, Shadlen, & Newsome, 1994). We see here, however, that they could be the unavoidable consequence of pooling the unit activities through the lateral connections to clean up the noise in an optimum way. At first, one might think that this pattern of correlation reflects the weights of the lateral connections; for example, units with similar preferred directions are positively correlated because they are positively interconnected. It turns out, however, that these correlations are the result of fitting a hill to the data. Indeed, the activity of a unit at the end of relaxation, oi,∞ , is dependent on the activities of all the other units in a way that is specified by the profile of the stable hill. Consequently, the correlation between pairs of units is determined by the product of their tuning curve derivatives, evaluated at the current direction (170 degrees in Figure 8B). Hence, when plotting the correlation of the units with preferred direction 135 degrees with all the
388
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
other units, the resulting curve has the same profile as the derivative of the tuning curve (compare Figure 8C and f 0 (θ ) shown in Figure 6A). This property is not specific to the RN method but would also apply to any method involving fitting a hill, such as ML or COMP. The magnitude and sign of these correlations are therefore dependent on the stimulus direction. This is illustrated in Figure 8D, which shows the correlation between units with preferred direction of 158 degrees and 182 degrees as a function of the stimulus direction. Notice that even though the weight of the connection between these two units is negative (−0.08), the correlation can be positive or negative depending on the stimulus direction. Whether such patterns of correlations exist in the cortex is unknown. Correlations between cells have been reported in area MT (Zohary et al., 1994), but there has been no attempt to relate these correlations to the tuning curve derivatives. It is unlikely, however, that real neurons will exhibit reversal in the correlation sign as large as the one illustrated in Figure 8. Relaxation in our network is a deterministic process, whereas, by contrast, additional noise would be introduced at each iteration if we were to model our units as Poisson process, a more realistic assumption (Shadlen & Newsome, 1994). This extra noise is likely to lead to additional correlations whose form remains to be determined. Nevertheless, we would expect the correlation to change with the stimulus direction in a way consistent with what is illustrated in Figure 8D. 6 Analysis Our simulations demonstrate that the recurrent network can provide a coarse code estimate of direction that is almost as efficient as the ML estimate. We now prove analytically that the network estimate is indeed close to the ML estimate for small gaussian noise; that is, it is unbiased and efficient. The proof relies on a linearization of the network dynamics around the stable manifold. 6.1 Notation. We start by rewriting the dynamics of the network as follows: ³ ´0.8 (6.1) Ot = h(Ut ) = 6.3 log(1 + e5+10Ut ) Ut = (1 − λ)Ut−1 + λWOt−1
(6.2)
= (1 − λ)Ut−1 + λWh(Ut−1 )
(6.3)
= e(Ut−1 ).
(6.4)
As we have done so far, we will use the notation fi (θ ) to refer to the function corresponding to the tuning curve of the input units with preferred direction θi —the mean activity in response to θ —and gi (θ ) the equivalent function for the output units.
Estimation Using Population Coding
389
In response to a direction θ0 , the mean activity vector for the input units is given by { fi (θ0 )}N i=1 , and we will use boldface fonts, f(θ0 ), to refer to this vector. The same convention will be applied all the other functions used in the proof. The functions f (θ0 ) and g(θ0 ) are defined with respect to the variable Ot . There exist two corresponding functions for the activity variable Ut , which we will denote f u (θ0 ) and gu (θ0 ), where f (θ0 ) = h( f u (θ0 )) and g(θ0 ) = h(gu (θ0 )), h(·) being the network activation function. f(θ0 ), g(θ0 ), fu (θ0 ), and gu (θ0 ) refer to the corresponding vectors of activity. In the simulations, we initialized U0 to WA and O0 to h(U0 ) to simulate the propagation of activity through the feedforward connections. To simplify notations in the proof, we will consider instead that O0 is initialized to A and U0 to h−1 (A). This modification does not affect the proof because the initialization used in the simulations is equivalent to one iteration of the output network with the integration constant, λ, set to 1, and it turns out that the eigenvectors of the Jacobian for the output network are independent of λ. We will look at the case where A, and therefore O0 , is distributed according to a normal distribution N (hAi , 60 ) with hAi = f(θ0 ) and 60 diagonal with all the diagonal terms equal to σn2 . 6.2 Linearization. We consider the case in which the functions f and g (and f u and gu ) are identical. Since A is a random variable, we can think of this system as being a random process that generates a temporal sequence of random variables, {O0 , O1 , . . . , Ot , . . . , O∞ }, where O0 = A, and {U0 , U1 , . . . , Ut , . . . , U∞ }. We first note that our network is globally stable since the dynamics minimizes a Lyapunov function of the form (Cohen & Grossberg, 1983): X Z ui 1X wij h(ui )h(uj ) + zh0 (z) dz. L=− 2 i,j 0 i Since the weights were chosen to solve g = h(Wg), we know that a hill of profile g(θ0 ), peaking at any location of the neuronal array, is a fixed point. In terms of the variable Ut the stable activity profile is given by the function gu (θ0 ). Since we consider the case where fu (θ0 ) = gu (θ0 ), fu (θ0 ) is a stable state. Moreover, for small enough noise, most initial patterns, U0 , are less than ² away from the stable manifold, that is, the Euclidean distance between U0 and the nearest point on the manifold is less than ², where ² is a small number. Consequently, we can study the behavior of our network by linearizing equation 6.4 around hU0 i = gu (θ0 ). Let JT be the Jacobian of the function e(·) (see equation 6.4) evaluated at hU0 i (we use JT instead of J to simplify notation later on): Ut = e(Ut−1 )
(6.5)
≈ e(hU0 i) + J (Ut−1 − hU0 i). T
(6.6)
390
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
Combining equation 6.6 and the fact that e(hU0 i) = hU0 i (the mean hU0 i is a stable state), we find that: ˜ t−1 , ˜ t ≈ JT U U
(6.7)
˜ t = Ut − hU0 i. The transpose of the Jacobian JT is of the form: where U JT = (1 − λ)I + λWH0 ,
(6.8)
where H0 is a diagonal matrix whose diagonal terms are equal to h0 (gui (θ0 )). ˜ t . Indeed, linWe can obtain a similar linear equation for the variable O earizing equation 6.1 yields: ˜ t. ˜ t ≈ H0 U O If we substitute equation 6.8 in equation 6.7 and multiply both sides by H0 , we obtain: ˜ t−1 ˜ t ≈ H0 ((1 − λ)I + λWH0 )U H0 U 0 ˜ ˜ ˜ Ot ≈ (1 − λ)Ot−1 + λH W Ot−1 . Since H0 is diagonal and W is symmetric, H0 W = (WH0 )T , which entails: ˜ t−1 . ˜ t ≈ JO O ˜ t is J. Iterating this equation leads to: Therefore, the Jacobian for O ˜ 0. ˜ t ≈ Jt O O ˜ 0 is distributed according to N (0, 60 ), where 0 is a vector of N zeroes. O ˜ 0 by a linear relationship, O ˜ t is distributed according ˜ t is related to O Since O to N (0, 6t ), where: 6t = Jt 60 JtT . Let us define J∞ = lim Jt t→∞
˜ ∞ = J∞ O ˜ 0. O The existence of a bounded Lyapunov function ensures that all the eigenvalues of J are less than or equal to one, and therefore J∞ exists. At equilibrium, we have: 6∞ = J∞ 60 J∞T .
(6.9)
Estimation Using Population Coding
391
g’(θ0) 8
O
µ
gu’(θ0)
g(θ) A
Figure 9: During relaxation, the initial activity A = O0 is projected onto the tangent, g0 (θ0 ), of the stable manifold defined by the function g(θ), along directions orthogonal to gu 0 (θ0 ). As a result, the initial distribution of activity (shown as a density plot indicated by the gray circles) is collapsed onto the axis defined by g0 (θ0 ).
6.3 Characterizing the Transformation J∞ . We now show that J∞ is a projection on a line pointing in the direction of g0 (θ0 )—the derivative of g with respect to θ evaluated at θ0 — along the directions orthogonal to gu 0 (θ0 ) (see Figure 9). 6.3.1 Projection onto g0 (θ0 ).
First, we note that:
˜ ∞, ˜∞ =O J O ∞
and furthermore: ˜ 0 = J∞ O ˜∞ J∞ J∞ O ˜ = O∞ ˜ 0. = J∞ O
˜ 0 ; thus: This is true for arbitrary O J∞ J∞ = J∞ , which is the definition of a projection. Therefore, J∞ is a projection onto ˜ ∞ . Next, we show that this subspace is a line the subspace spanned by O pointing in the direction of g0 (θ0 ).
392
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
˜ which are solutions The projecting space is spanned by the vectors, O, to: ˜ = O. ˜ J∞ O
(6.10)
The activity patterns that satisfy equation 6.10 correspond to stable states. ˜ =O ˜ ∞ = O∞ − hO0 i where O∞ and hO0 i are of the form: Therefore O O∞ = g(θ0 + δθ ) hO0 i = g(θ0 ) hence, ˜ = O∞ − hO0 i O
(6.11)
= g(θ0 + δθ ) − g(θ0 )
(6.12)
0
≈ δθg (θ0 ).
(6.13)
Therefore, J∞ is a projection onto g0 (θ0 ). A similar analysis would show that J∞T is a projection onto gu 0 (θ0 ). 6.3.2 Projection Along the Directions Orthogonal to gu 0 (θ0 ). To demonstrate that the projection is along the directions orthogonal to gu 0 (θ0 ), we ⊥ need to show that for any vector, gu 0 (θ0 ) , orthogonal to gu 0 (θ0 ) (i.e., T ⊥ ⊥ gu 0 (θ0 ) gu 0 (θ0 ) = 0), we have, J∞ gu 0 (θ0 ) = 0. We start from the fact that J∞ is a projection onto g0 (θ0 ) and therefore: ⊥
J∞ gu 0 (θ0 ) = αg0 (θ0 ) ⊥
T
⊥
T
⊥
T
gu 0 (θ0 ) J∞ gu 0 (θ0 ) = αgu 0 (θ0 ) g0 (θ0 ) T
(J∞T gu 0 (θ0 ))T gu 0 (θ0 ) = αgu 0 (θ0 ) g0 (θ0 ) gu 0 (θ0 ) gu 0 (θ0 ) = αgu 0 (θ0 ) g0 (θ0 ) T
0 = αgu 0 (θ0 ) g0 (θ0 ). T
Since, in general (and in our simulations), gu 0 (θ0 ) and g0 (θ0 ) are not orthogonal, we can conclude that: T
α = 0. In other words, any vector orthogonal to gu 0 (θ0 ) is an eigenvector of J∞ whose eigenvalue is zero. Therefore, J∞ is a projection on g0 (θ0 ) along the directions orthogonal to gu 0 (θ0 ) (see Figure 9A). Next, we show that the resulting estimator is unbiased and has a variance close to the Cram´er-Rao bound.
Estimation Using Population Coding
393
6.4 Properties of the Network Estimate. ˜ ∞ is distributed according to N (0, 6∞ ), 6.4.1 Unbiased Estimator. Since O we have hO∞ i = hO0 i = g(θ0 ), and hO∞ i = f(θ0 ) when the functions f and g are identical. This entails that the final activity, O∞ , is an unbiased estimate of the initial activity O0 . The network estimate θRN is obtained by applying a complex estimator to O∞ . The complex estimator is an unbiased estimate of direction when applied to O0 . Since O∞ is an unbiased estimate of O0 , the complex estimator applied to O∞ , that is, θˆRN , is unbiased. 2 be the variance corre6.4.2 Variance of the Network Estimate. Let σCR sponding to the Cram´er-Rao bound. If the activity of the units, oi,0 , is independent and normally distributed according to N ( fi (θ0 ), σn2 ), we have (from equation 3.2): 2 = σCR
σn2 . 0 kf (θ0 )k2
We now show that the variance of the network estimate, σ ˆ2 , is close θRN
2 . to σCR ˜ ∞ , are confined to the axis At the end of relaxation, all the patterns, O 0 defined by g (θ0 ). Therefore, the covariance matrix is of the form:
2 6∞ = σ∞
g0 (θ0 )g0 (θ0 )T , kg0 (θ0 )k2
(6.14)
2 is the variance of the norm of O ˜ ∞ along the axis g0 (θ0 ). Different where σ∞ patterns correspond to the stable hill placed at different locations. Using 2 is related to the variance of the equation 6.13, we can now show that σ∞ 2 network estimate, σ ˆ , through the following relationship: θRN
D E 2 = kO∞ − hO∞ i k2 σ∞ ≈ kg0 (θ0 )k2 σθˆ2 . RN
Therefore: σθˆ2 ≈ RN
2 σ∞ 0 kg (θ0 )k2
.
Combining equations 6.9, 6.14, and 6.15, we get: σθˆ2 g0 (θ0 )g0 (θ0 ) ≈ J∞ 60 J∞T . T
RN
(6.15)
394
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
We now multiply both sides of this equation by gu 0 (θ0 ) on the left and T g (θ0 ) on the right: u0
σθˆ2 gu 0 (θ0 ) g0 (θ0 )g0 (θ0 ) gu 0 (θ0 ) ≈ gu 0 (θ0 ) J∞ 60 J∞T gu 0 (θ0 ). T
T
T
RN
Since J∞T gu 0 (θ0 ) = gu 0 (θ0 ) (from the fact that gu 0 (θ0 ) is stable): σθˆ2 (g0 (θ0 ) gu 0 (θ0 ))2 ≈ gu 0 (θ0 ) 60 gu 0 (θ0 ) T
T
RN
gu 0 (θ0 ) 60 gu 0 (θ0 ) T
σθˆ2
RN
≈
(g0 (θ0 )T gu 0 (θ0 ))2
.
If f = g and 60 = σn2 I: σθˆ2
RN
kfu 0 (θ0 )k2
≈ σn2
(fu 0 (θ0 ) f0 (θ0 ))2 kfu 0 (θ0 )k2 ≈ σn2 u 0 kf (θ0 )k2 kf0 (θ0 )k2 cos2 µ σn2 , ≈ 0 kf (θ0 )k2 cos2 µ T
where µ is the angle between the vector f0 (θ0 ) and fu 0 (θ0 ). 2 by a factor inversely proportional to Therefore, σ ˆ2 differs from σCR θRN
cos2 µ when f = g and 60 = σn2 I. In general, the angle µ will be small if the activation function, h, is close to linear within the network dynamical range. With the tuning curves and activation function we used, the cos2 µ 2 . term makes σ ˆ2 2% larger than σCR θRN Given the small influence of this term, we will ignore it in the rest of the article. This amounts to treating J∞ as an orthogonal projection onto g0 (θ0 ). Projecting the initial activity orthogonally onto g0 (θ0 ) amounts to finding the stable state that minimizes the square distance with O0 . In the presence of independent gaussian noise of equal variance, the ML estimate is also the peak position of the stable state, which minimizes the square distance with the initial activity. 6.5 Nonoptimal Cases. 6.5.1 Nonequal Variance. For arbitrary gaussian noise with covariance matrix 60 , the ML estimate is the direction that minimizes the Mahalanobis distance between O0 and f(θ): θML = arg min(O0 − f(θ))T 60−1 (O0 − f(θ )). θ
Estimation Using Population Coding
395
Since our network minimizes the square distance, it will be suboptimum whenever this Mahalanobis distance differs from the Euclidean distance. This is the case, in particular, when some neurons are noisier than others, that is, when the variance of the noise is not the same for all units. 6.5.2 Correlations. In general, our method is also suboptimal when the activity of the units is correlated. However, it is still optimum for certain types of correlations. For gaussian noise with arbitrary covariance matrices, 60 , the variance of the Cram´er-Rao bound (obtained from equation 3.1) and the variance of the network estimate (ignoring the difference between f0 (θ0 ) and fu 0 (θ0 ), and under the assumption that f = g) are given by: σθˆ2
RN
≈
2 σCR =
f0 (θ0 )T 60 f0 (θ0 ) (f0 (θ0 )T f0 (θ0 ))2 1 f0 (θ0 )T 60−1 f0 (θ0 )
.
These two quantities are equal if and only if f0 (θ0 ) is an eigenvector of 60 . This is the case in particular for the covariance matrix of the stable state, 6∞ . Indeed, all the variance in this case is along the axis defined by f0 (θ0 ). It would be easy to show that this is also the case for any of the covariance matrices 6t . In other words, covariance introduced by iterating the network does not affect performance, which is precisely why we reach the Cram´erRao bound at the end of relaxation. 6.5.3 Large Noise. The size of the domain in which our linear approximation works depends on the amplitude of the second and higher derivatives of the tuning function, h. The activation function we have used is flat for negative inputs and rises almost linearly after a threshold (see Figure 3D). Except for the fast transition from flat to linear rise, the high-order derivatives are all small. This predicts that the network should be able to handle a fairly large amount of noise and still provide optimal performance. Another factor allows the network to be robust with respect to noise. step In our simulations, U0 is initialized to WA. This first linear averaging √ increases the signal-to-noise ratio by a factor proportional to N, where N is the number of input units (since wij ∼ 1/N; Zhang, 1996). Therefore, in the simulations, the size of the √ domain in which our approximation applies is proportional to ² for U0 but N² for A. Our simulations confirm that our network can indeed handle a fairly large amount of noise without a significant decrease in performance. Hence, we have found that signal-to-noise ratio (the ratio of α/σn ; see equation 2.1) as low as 3 leads to a standard deviation within 5% of the Cram´er-Rao bound.
396
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
6.5.4 Nongaussian Distributions. Nongaussian noise distributions are a problem only in the first one or two iterations. The central limit theorem states that the average of a large number of random variables converges to a normal distribution. Since U0 is initialized to WA, U0 will be normally distributed in the limit of a large number of units. Even if U0 is not close to a normal distribution, U1 or U2 will be, since the averaging process is repeated on each iteration. How much information will be lost in the first iterations cannot be determined in general and depends on the noise distribution. In the case of a Poisson distribution, the convergence to a normal distribution is likely to be fast since such a distribution is similar to a normal distribution with the variance equal to the mean. Our network is no longer optimum, but our simulations confirm that performance is still close to maximum likelihood. 6.5.5 Different Input and Output Functions. When the input and output functions, f and g, differ, the performance of the network is difficult to predict in the general case. For small differences, however, the linear approximation leads to: O˜ ∞ = J∞ (O0 − hO0 i) = J∞ (O0 − f(θ0 )) = J∞ (O0 − f(θ0 ) + g(θ0 ) − g(θ0 )) = J∞ (O0 − g(θ0 )) − J∞ (g(θ0 ) − f(θ0 )). As long as g(θ0 ) − f(θ0 ) is orthogonal to gu 0 (θ0 ), J∞ (g(θ0 ) − f(θ0 )) = 0, and the network behaves as if f and g were identical. This is the case, in particular, when f and g differ by their width or amplitude. Indeed, f and g are even functions, whereas gu 0 (θ0 ) is an odd function. This explains why performance is minimally affected by such changes, as shown in the simulations (see Figure 7). 6.6 Relation to Linear ML Estimator. Discrimination tasks have been widely used in psychophysics to probe the representation of sensory variables such as orientation or direction. In one variation of the task, subjects are presented with two possible directions, θ0 ± δθ, in rapid succession. The task is to determine whether the temporal sequence is θ0 + δθ followed by θ0 − δθ, or the reverse. Assuming that this task is performed on the basis of the response of direction-tuned neurons such as the ones we have used so far, optimal performance can be obtained by looking at the sign of the difference between the ML estimation of the first and second direction. This reduces to a linear problem when the reference direction, θ0 , is kept constant. Therefore, this task can be performed optimally by a two-layer network (Pouget & Thorpe, 1991; Seung & Sompolinsky, 1993).
Estimation Using Population Coding
397
Three-layer networks are required when more than one reference direction is used (Pouget & Thorpe, 1991; Mato & Somplinsky, 1996), and mixtures of expert architecture can work for any arbitrary direction, but a large number of hidden units and a gating network are necessary for optimal performances (Mato & Sompolinsky, 1996). Our network provides an alternative method that does not require a dedicated hidden layer. The iterative process converges onto a linear operator J∞ , which is the optimal linear operator for the reference direction θ0 . 7 Discussion Our results demonstrate that it is possible to perform efficient, unbiased estimation with coarse coding using a neurally plausible architecture. This shows that one of the advantages of coarse codes is to provide a representation that simplifies the problem of cleaning up uncorrelated noise within a neuronal population. Our model relies on lateral connections to implement a prior expectation on the profile of the activity patterns. As a consequence, units determine their activation according to their own input and the activity of their neighbors. When the noise is small enough, this lateral pooling results in a near-orthogonal projection of the initial activity onto the tangent to the stable manifold; the stable hill corresponds to the one minimizing the square distance with the initial activity. Consequently, the network is very close to ML when the noise is normally distributed with equal variance in each unit. Cleaning up noise efficiently does not entail that the lateral connections increase the signal-to-noise ratio, or the information content, of the representation. It is a well-known result in information theory that data processing cannot increase information content (Cover & Thomas, 1991; this result holds for Shannon information, but the generalization to Fisher information is straightforward). The fact that we are within 2% of the Cram´er-Rao bound when applying a complex estimator to the stable hill of the network demonstrates that our procedure preserves almost completely Fisher information. Our network, however, does not simply preserve Fisher information; it also changes the format of information to make it easily decodable. Whereas ML is the only way to decode the input pattern efficiently, a complex estimator, or even a linear estimator, is sufficient to decode the stable hill while reaching the Cram´er-Rao bound (see Figure 10). One can therefore think of the relaxation of activity in the nonlinear recurrent network in two ways: as a clean-up mechanism or as a processing that makes information easier to decode. If spike trains are the result of a Poisson process (Shadlen & Newsome, 1994), cleaning up noise efficiently is a critical problem for the cortex. As information is transmitted from one area to the next, noise increases, leading to wider and wider activity distribution. Eventually activities are bound to fall outside the neurons’ bandwidth, resulting in information loss. Our pro-
398
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
A
B
I(θ)3 = I(θ)1 θ RN
θ ML
I(θ)3
C
I(θ)2 < I(θ)1 θCOMP
I(θ)2
Maximum Likelihood Estimator
Complex Estimator
I(θ)2
Complex Estimator
I(θ)2
40
Activity
I(θ)2 = I(θ)1
20 0 100
200
300
20 0
I(θ)1
40
Activity
I(θ)1
40
Activity
Activity
Preferred Direction (deg)
20 0
100
200
300
Preferred Direction (deg)
I(θ)1
40 20 0
100
200
300
Preferred Direction (deg)
100
200
300
Preferred Direction (deg)
Figure 10: The COMP estimate preserves Fisher information, I(θ)1 , when applied to the stable hill of the recurrent network (A)—as ML does (B)—but not when applied to the initial input (C). Therefore, the network dynamics changes the format of information such that a simple estimator can read out the activity optimally. I(θ )1 , I(θ )2 , and I(θ )3 refer to the Fisher information about direction at various stages in the estimation process.
cedure can prevent this problem by keeping the activities within a limited bandwidth while preserving the information content. Unlike OLE, COM, and COMP, the RN estimate is not the result of a process in which units vote from their preferred direction, θi . Instead, units contribute according to the derivatives of their tuning curves, fi0 (θ ), as in the case of ML. This feature allows the network to ignore background noise, that is, responses due to other factors beside the variable of interest. This property also predicts that discrimination of directions around the vertical (90 degrees) would be most affected by shutting off the units tuned at 60 and 120 degrees (assuming that the half-width of the tuning curves is around 30 degrees). This prediction is consistent with psychophysical experiments showing that discrimination around the vertical in humans is affected by prior adaptation to orientations displaced from the vertical by ±30 degrees (Regan & Beverley, 1985). As we have shown, the cleaning-up process is optimum only if the output and input units have the same tuning curves. It is worth mentioning that learning the weights of the lateral connections with a simple delta rule, a biologically plausible rule, would actually lead to an output pattern matching the input (Zhang, 1996). It is therefore possible that the match occurs
Estimation Using Population Coding
399
naturally in the cortex as the result of a self-organizing process. The fact that optimum performance is obtained for matched input and output tuning curves has some interesting implications for orientation selectivity and the role of lateral connections in general in cortical processing. It argues that the pooled input to cortical neurons should have the same mean tuning as the output of the cells, a proposal in line with Hubel and Wiesel’s (1962) model of orientation selectivity and recent experimental data by Ferster, Chung, and Wheat (1996). By contrast, several groups have proposed that lateral connections are used to sharpen tuning curves (Sillito, 1975; Heggelund, 1981; Wehmeier, Dong, Koch, & Van Essen, 1989; Worgotter ¨ & Koch, 1991; Somers, Nelson, & Sur, 1995). Our work suggests that this sharpening process can only degrade the representation and that the role of lateral connections may be better described in terms of cleaning up noise, or changing the format of information, rather than sharpening tuning curves (Pouget & Zhang, 1996). These considerations must be tempered by the fact that our attractor network is a poor model of cortical circuitry in V1. This model is neurally plausible in the same way Hopfield network are: its style of computation and the representation used are similar to the ones used in the cortex. Several aspects of this model, however, are clearly implausible. V1 circuits are not stable in the awake state, that is, V1 neurons do not keep on firing when the stimulus is extinguished, and inputs are typically not transient. We believe, however, that the modifications required will not affect these conclusions, and we intend to explore this issue further. Our approach can be readily extended to any other periodic sensory or motor variables. For nonperiodic variables such as the disparity of a line in an image, our network needs to be adapted since it currently relies on circularly symmetric weights. Simply unfolding the network will be sufficient to deal with values around the center of the interval under consideration, but more work is needed to deal with boundary values. We can also generalize this work to arbitrary mapping between two coarse codes for variables x and y where y is a function of x. Indeed, a coarse code for x provides a set of radial basis functions of x that can be used subsequently to approximate arbitrary functions. It is even conceivable that a similar approach can be used for one-to-many mappings, a common situation in vision or robotics, by adapting our network such that several hills can coexist simultaneously. We are currently exploring such architectures. Acknowledgments This research was supported in part by a training grant from the McDonnellPew Center for Cognitive Neuroscience and a Department of Defense grant (DAMD17-93-V-3018) (A. P.). We thank Peter Dayan, Laurenz Wiskott, Terry Sanger, Richard Zemel, and an anonymous reviewer for their valuable comments and insightful suggestions.
400
Alexandre Pouget, Kechen Zhang, Sophie Deneve, and Peter E. Latham
References Anderson, C. H. (1994). Basic elements of biological computational systems. International Journal of Modern Physics C, 5(2), 135–137. Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59(4–5), 313–318. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proceedings of the National Academy of Sciences USA, 92, 3844–3848. Cohen, M., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural network. IEEE Transactions SMC, 13, 815–826. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Desimone, R., Schein, S. J., Moran, J., & Ungerleider, L. G. (1985). Contour, color and shape analysis beyond the striate cortex. Vision Research, 25(3), 441–452. Duda, R. O., & Hart, R. E. (1973). Pattern classification and scene analysis. New York: Wiley. Ferster, D., Chung, S., & Wheat, H. (1996). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380, 249–252. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. Journal of Neuroscience, 2(11), 1527–1537. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Heggelund, P. (1981). Receptive field organization of simple cells in cat striate cortex. Experimental Brain Research, 42, 89–98. Hinton, G. E. (1992). How neural networks learn from experience. Scientific American, 267(3), 145–151. Hirsch, M., & Smale, S. (1974). Differential equations, dynamical systems and linear algebra. New York: Academic Press. Hubel, D., & Wiesel, T. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology (London), 160, 106–154. Lehky, S. R., & Sejnowski, T. J. (1990). Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. Journal of Neuroscience, 10(7), 2281–2299. Mato, G., & Sompolinsky, H. (1996). Neural network models of perceptual learning of angle discrimination. Neural Computation, 8, 270–299. Maunsell, J. H. R., & Van Essen, D. C. (1983). Functional properties of neurons in middle temporal visual area of the Macaque monkey. I. Selectivity for stimulus direction, speed, and orientation. Journal of Neurophysiology, 49(5), 1127–1147. Papoulis, A. (1991). Probability, random variables, and stochastic process. New York: McGraw-Hill. Paradiso, M. A. (1988). A theory of the use of visual orientation information which exploits the columnar structure of striate cortex. Biological Cybernetics,
Estimation Using Population Coding
401
58, 35–49. Pouget, A., Fisher, S. A., & Sejnowski, T. J. (1993). Egocentric spatial representation in early vision. Journal of Cognitive Neuroscience, 5, 150–161. Pouget, A., & Thorpe, S. J. (1991). Connectionist models of orientation identification. Connection Science, 3(2), 127–142. Pouget, A., & Zhang, K. (1996). A statistical perspective on orientation selectivity in primary visual cortex. In Society for Neuroscience Abstracts, vol. 22. Regan, D. M., & Beverley, K. I. (1985). Post-adaptation orientation discrimination. Journal of Optical Society of America, 2, 147–155. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rate. Journal of Computational Neuroscience, 1, 89–108. Seung, H. S., & Sompolinsky, H. (1993). Simple model for reading neuronal population codes. Proceedings of National Academy of Sciences, USA, 90, 10749– 10753. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Sillito, A. M. (1975). The contribution of inhibitory mechanisms to the receptive field properties of neurones in the striate cortex of the cat. Journal of Physiology (London), 250, 305–329. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8(3), 511–530. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. Journal of Neuroscience, 15(8), 5448–5465. Wehmeier, U., Dong, D., Koch, C., & Van Essen, D. (1989). Modelling the visual system. In C. Koch & I. Segev (Eds.), Methods in neural modelling (pp. 335–359). Cambridge, MA: MIT Press. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space, Science, 261, 1055–1058. Worgotter, ¨ F., & Koch, C. (1991). A detailed model of the primary visual pathway in the cat: Comparison of afferent excitatory and intracortical inhibitory connection schemes for orientation selectivity. Journal of Neuroscience, 11, 1959– 1979. Zemel, R. S., Dayan, P., & Pouget, A. (1997). Population code representations of probability density functions. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpolation of population codes. Neural Computation, 10(2), 403–430. Zhang, K. (1996). Representation of spatial orientation by the intrisinc dynamics of the head-direction cell ensemble: A theory. Journal of Neuroscience, 16(6), 2112–2126. Zohary, E. (1992). Population coding of visual stimuli by cortical neurons tuned to more than one dimension. Biological Cybernetics, 66, 265–272. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implication for psychophysical performance. Nature, 370, 140–143. Received September 23, 1996; accepted June 20, 1997.
Communicated by Terrence Sanger
Probabilistic Interpretation of Population Codes Richard S. Zemel Departments of Psychology and Computer Science, University of Arizona, Tucson, AZ 85721, U.S.A.
Peter Dayan Department of Brain and Cognitive Sciences, MIT, Cambridge, MA 02139, U.S.A.
Alexandre Pouget Georgetown Institute for Cognitive and Computational Sciences, Georgetown University, Washington, DC 20007-2197, U.S.A.
We present a general encoding-decoding framework for interpreting the activity of a population of units. A standard population code interpretation method, the Poisson model, starts from a description as to how a single value of an underlying quantity can generate the activities of each unit in the population. In casting it in the encoding-decoding framework, we find that this model is too restrictive to describe fully the activities of units in population codes in higher processing areas, such as the medial temporal area. Under a more powerful model, the population activity can convey information not only about a single value of some quantity but also about its whole distribution, including its variance, and perhaps even the certainty the system has in the actual presence in the world of the entity generating this quantity. We propose a novel method for forming such probabilistic interpretations of population codes and compare it to the existing method. 1 Introduction Population codes, where information is represented in the activities of whole populations of units, are ubiquitous in the brain. There has been substantial work on how animals should or actually do extract information about the underlying encoded quantity (Georgopoulos, Schwartz, & Kettner, 1986; Baldi & Heiligenberg, 1988; Seung & Sompolinsky, 1993; Salinas & Abbott, 1994; 1995; Snippe, 1996; Sanger, 1996). With the exception of Anderson (1994), most of this work has concentrated on the case of extracting a single value for this quantity. In this article, we study ways of characterizing the joint activity of a population as coding a whole probability distribution over the underlying quantity. We will use two main motivating examples throughout this article, both Neural Computation 10, 403–430 (1998)
c 1998 Massachusetts Institute of Technology °
404
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
of them well-studied, classic examples of population codes. The first is place cells in hippocampus of freely moving rats (O’Keefe & Dostrovsky, 1971), which tend to fire when the animal is at a particular part of an environment. The second example is that of motion-selective cells in the medial temporal (MT) area of monkeys that are reporting aspects of the motion in a stochastic stimulus made up of dots moving in various directions (Newsome, Britten, & Movshon, 1989). MT cells are selective for particular directions of motion and are well activated by such random displays provided that some of the dots are moving in the directions that they prefer. In these cases, treating the activity of such populations of cells as reporting on a single value of the variables they code (e.g., direction of motion) is inadequate. Instead, these are examples of two general situations in which the population must be interpreted as coding probability distributions over these variables: 1. Insufficient information exists to define a single value with certainty. The rat may not have enough information from visual or vestibular cues to know exactly where it is. This article discusses how to make statistical sense of the obvious notion that if the animal is confused whether it is in place x1 or place x2 , then the place cells that prefer both places should be activated.1 Hinton (1992) pointed out that one should be able to use patterns of activity over such population codes to report not only a single place in the world, but also variance and uncertainty about that place and other aspects of a probability distribution. 2. Multiple values underlie the input. The monkey may have to report the direction of coherent motion embedded in a field of random motion noise. MT cells cannot be characterized as only reporting on the direction of coherent motion, since cells that prefer directions opposite to this are activated by the noise. The population must therefore be characterized as reporting something about the entire distribution of inputs. In this article, we first provide a general statistical framework that can be used to understand how the activity of a population of neurons can be considered as encoding information about the world and, concomitantly, the way that this information can be decoded from the activity. We illustrate the framework by casting the standard model for population codes in its terms, use it to show why this model is inadequate for representing probability distributions even though it does contain some distributional information, and describe an existing model for probabilistic interpretation (Anderson & Van Essen, 1994; Anderson, 1994) in terms of the framework. In section 2 we define the encoding and decoding framework and exam1 As far as we are aware, there are as yet no data on how the place cells actually fire in such ambiguous cases.
Probabilistic Population Codes
405
ine existing models for population codes in its terms. In section 3 we propose a novel method for encoding and decoding probability distributions from population codes. In section 4 we present empirical comparisons between the alternative methods. 2 Population Code Interpretations 2.1 The Encoding-Decoding Framework. The starting point for almost all work on neural population codes is the neurophysiological finding that many neurons respond to a particular variable underlying a stimulus (such as the orientation of a visually presented bar) according to a unimodal tuning function. This function is read out as the mean firing rate of the cell and is often reasonably well characterized as a gaussian. The value of the underlying variable at which the peak of the tuning function occurs (the mean of the gaussian) is called the preferred value for the cell. This form of response characterizes not only cells near the sensory periphery, but also cells that report the results of more complex processing, including receiving information from groups of cells that themselves have these tuning properties (in MT, for instance). A major caveat with almost all work on population codes, including that in this article, is that the responses of cells vary with many quantities in the world other than the particular one that is usually studied. For example, many MT cells are also selective for disparity and can be affected by spatial frequency, making it difficult to ascribe weak firing to nonpreferred motion, or an incorrect disparity, or something else. In our theory, we assume that we know the (collection of) variables with respect to which a cell’s response is systematically determined; all other variables are treated as noise, so if the response depends on some unknown variable, interpretation will be hampered. By definition, for each of these populations, the activities of the cells can be interpreted as conveying information about some underlying lowdimensional space. Interpreting population codes thus involves two spaces. First, there is an explicit space that consists of the activities r = {ri } of the cells in the population. Second, these activities are described in terms of an implicit space (Zemel & Hinton, 1995), which contains the small number of underlying dimensions (e.g., direction in the MT example above) that the activities of the cells encode and in which they are described as having tuning functions. The implicit space plays no explicit role, even though any processing on the basis of the activities r has to be referred to the implicit space. For instance, if the activities across the population are combined in some particular way (as in generating a single value), then the implicit space can be used to work out how much information is lost, and consequently to work out the best method of combination. This notion of explicit and implicit spaces underlies our framework, which is depicted in Figure 1 in the context of a single experiment. At the top are the measured activities of a population of cells. There are two key
406
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
r
unit P[r|x*] C
decode
unit
encode B
P[x|r]
f(x) x
x D
A x x*
x
x
Figure 1: (Left) Standard models of population codes, such as the Poisson model, assume an underlying encoding model such as the one illustrated here. The output of the encoding process is shown at the top: the activities of units, corresponding to the explicit, observable representation. These are assumed to have been generated by the independent application of each cell’s tuning function and additive noise to the implicit representation shown at the bottom, which in this case is a single value x∗ in the space of underlying variables. (Right) Bayesian decoding models describe how to extract P[x | r] from the observed activities, through knowledge of the tuning functions f (x). An estimate of the true value of x can then be formed according to some loss function. Uppercase letters label operations.
questions to ask about this firing: 1. What is the relationship between the activities r of the cells and the underlying quantity x in the world that is represented? (encoding) 2. What information about the quantity x can be extracted from the activities? (decoding) Although it is of active interest (Pouget, Zhang, Deneve, & Latham, 1998), we do not consider constraints that come from the neural instantiation of the decoding algorithms and pose the decoding question as an abstract
Probabilistic Population Codes
407
problem. Since neurons are generally noisy, it is often convenient to characterize encoding (see Figure 1, operations A and B) in a probabilistic way, by specifying:
P [r | x].
(2.1)
The simplest models make a further assumption of conditional independence of the different units given the underlying quantity,
P [r | x] =
Y
P [ri | x],
(2.2)
i
although others characterize the degree of correlation between the units (Zohary, Shadlen, & Newsome, 1994). If the encoding model in equation 2.1 is true, then a Bayesian decoding model specifies that the information that r carries about x can be characterized precisely as
P [x | r] ∝ P [r | x]P [x],
(2.3)
where P [x] is the prior distribution about x and the constant of proportionality is set so that Z P [x | r]dx = 1. x
Note that starting with a deterministic quantity x in the world, encoding it in the firing rates r, and decoding it (operation C) using equation 2.3 results in a probability distribution over x. This uncertainty arises from the stochasticity represented by P [r | x]. Given a loss function, we could then go on to extract a single value from this distribution (operation D). For most real cases of population codes, encoding cannot be described so crisply. This article describes the inadequacy of one particularly pervasive assumption: that the underlying quantity is a single value for instance, the single position of a rat in an environment, or the single coherent direction of motion of a set of dots in a direction discrimination task. The assumption is pervasive since this is how one typically works out what a population of cells is encoding and how each responds to some particular x. It is inadequate because it cannot capture the subtleties of other experiments, such as those in which rats can be made to be uncertain about their position (Cheng & Gallistel, 1984), or in which one direction of motion predominates yet there are several simultaneous motion directions (Newsome et al., 1989). In many cases, the natural characterization is actually a whole probability distribution P [x | ω] over the value of the variable x, where ω represents all the available information. For instance, for the rat,
408
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
this distribution might be the distribution over its possible locations in the environment. The rat can clearly expect to be in exactly one position at any one time, and it therefore makes sense to consider the distribution of uncertainty P [x | ω] as to that position. This is not quite true for the monkey; it could be observing many different directions of motion simultaneously. In this article, we characterize this multiplicity by considering a separation between the direction of motion of a single randomly chosen dot (which gives the equivalent of P [x | ω]) and the actual number of dots present at any one time. Following Hinton (1992) we consider the sum total activity over the population as reporting the latter and the distribution of that activity across the population as reporting the former. This imposes the important constraint that there is some independent standard for how much activity there should be (with which to work out the number of dots) and is clearly not the only possibility.2 Note also that there is no requirement that the animal perform decoding as in equation 2.3, or, indeed, that it explicitly perform decoding at all. That Wilson and McNaughton (1993) can extract the (x, y) coordinates of a rat in a room on the basis of the activities of 25 of its place cells does not mean that the stages of rodent processing subsequent to the hippocampus actually do this. We can now cast two existing classes of proposals for population codes in terms of this framework. 2.2 The Poisson Model. Under the Poisson encoding model, the quantity encoded is indeed one particular value, and the activities of the individual units are independent, with the terms in equation 2.2 specified as
P [ri | x] = e− fi (x)
( fi (x))ri . ri !
(2.4)
The activity ri could, for example, be the number of spikes the cell emits in a fixed time interval following the stimulus onset. A typical form for the tuning function fi (x) is gaussian: fi (x) ∝ e−(x−xi )
2
/2σ 2
,
(2.5)
about a preferred value xi for cell i. In terms of Figure 1, turning the quantity x into a set of mean firing rates fi (x) for the units is operation A; sampling 2 We propose that the magnitude of firing can be used to suggest the multiplicity of inputs as well as their properties (i.e., doubling the firing rate could indicate that perhaps there are two stimuli present). Then one could interpret the remaining pattern of activity as implying multiple distributions, one for each possible stimulus. It is likely, however, that nonlinear processes affect the firing rate under multiple stimuli. For example, Snowden (1989) showed that an MT cell’s response to motion in its preferred direction can be suppressed by adding motion in an orthogonal direction.
Probabilistic Population Codes
409
the activities ri from these means according to a Poisson distribution is operation B. These operations are descriptive models; they capture the essence of the results of a collection of experiments rather than being based on a biophysical understanding of how x actually causes the activities r. Several authors have examined maximum likelihood (ML) decoding under the Poisson encoding model (Seung & Sompolinsky, 1993; Salinas & Abbott, 1994, 1995; Snippe, 1996) and analyzed the performance of other decoding methods relative to ML. These methods all focus on extracting a single value for the underlying parameter. The full probability distribution over the quantity x from this Poisson model is given by (Sanger, 1996; Foldi´ ¨ ak, 1993):
P [x | r] ∝ P [x]
Y
e− fi (x)
i
( fi (x))ri . ri !
(2.6)
Although the Poisson model is simple and straightforward, it suffers from the assumption criticized above: that there is just a single value x. If the rat is really uncertain about whether it is at location x1 or location x2 in the world, then the standard Poisson model has no formal way of turning that uncertainty into activities. In this case, we argued that the natural characterization of the quantity in the world that the activities of the cells encode is now P [x|ω]. We describe below a method of encoding that takes exactly this approach. However, one might argue that even though there is no formal way of encoding uncertainty in the activities, there is a formal way of decoding activities to produce a probability distribution over x. Perhaps one could form P [x|r] in equation 2.6 as a decoding of r to model a diffuse or multimodal P [x|ω]. We now show that this approach turns out to be inadequate. Consider a one-dimensional case with tuning functions as in equation 2.5. Imagine that activities r are specified directly in some manner on the basis of a whole probability distribution P [x|ω] over x. The goal is to decode the activities r using equation 2.6 and actually represent P [x|ω] adequately. From now on, we will use Pˆ r (x) as the decoding distribution over x that is specified by r. This is no longer P [x|r] unless it is a true Bayesian inverse. From equation 2.6, and assuming a uniform prior over x, we have: log Pˆ r (x) = K −
X
fi (x) −
i
1 =K − 2 0
µP
i ri σ2
1 X ri (x − xi )2 2σ 2 i
P ¶µ ¶2 i ri xi P x− i ri
(2.7)
(2.8)
by completing the P square, if, as in most reasonable cases, there are sufficient units such that i fi (x) is constant in x. But the distribution in equation 2.8
410
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
is then gaussian, with mean µ and variance ξ 2 , where: P ri xi µ = Pi i ri
σ2 ξ2 = P . i ri
If we extract a single value from this decoded distribution by simply taking its mean, then this value matches the center-of-gravity estimate (Snippe, 1996). However, if we are interested in the entire distribution, then this standard model cannot capture the range of input distributions P [x|ω] under consideration. First, it will clearly not be possible to match multimodal P [x|ω] with any fidelity, since this decoded gaussian is unimodal. Second, remembering that ri ≥ 0 should reasonably be integers, then unless they are all 0, the variance of Pˆ r (x) is bound to be less than σ 2 , and so there is no setting for the r that will match P [x|ω] having widths greater than that of the tuning functions themselves. Thus this Poisson model is incapable of representing distributions that are broader than the tuning functions. In fact, even though it is not actually being used to specify the activities of the cells, the underlying assumption in the encoding of the Poisson model (as embodied in equation 2.4) is that there is just one value of x that results in the activities of the cells, and thus the Poisson model has trouble encoding anything other than delta function P [x|ω]. This analysis also applies to the gaussian encoding model. It is not strictly true if the tuning functions are not gaussian or the units have some baseline activity (Snippe, 1996; Pouget et al., 1998). We see later (in Figure 9) a case in which allowing baseline activities permits the Poisson decoding model to produce a multimodal distribution. However, each of the modes is much too narrow. To reinforce this point: if the firing rate were stationary for long enough so that one could collect arbitrary numbers of spikes, then one could estimate the true mean activities fi (x) for the cells. Decoding using the Poisson model if one knows the means will almost always lead to a delta function estimate. 2.3 The KDE Model. Anderson (1994) and Anderson and Van Essen (1994) defined a new interpretation for population codes in which the notion of representing probability distributions over x rather than just a single value is essential. This method represents the distribution Pˆ r (x) in terms of kernel density estimates (KDEs), forming a linear combination of simple kernel or basis functions ψi (x) associated with each cell, weighted by a normalized function of its activity ri :
Pˆ r (x) =
X
r0i ψi (x).
(2.9)
i
Here the r0i are normalized such that Pˆ r (x) is a probability distribution. If the ψi (x) are probability distributions themselves and ri are all positive, a
Probabilistic Population Codes
411
natural choice is to have ri r0i = P
j rj
.
(2.10)
Note that the kernel functions ψi (x) are not the tuning functions fi (x) of the cells that would commonly be measured in an experiment. They need have no neural instantiation; instead, they form part of the interpretive structure for the population code. If the ψi (x) are probability distributions, and so are positive, then the range of spatial frequencies in P [x|ω] that they can reproduce in Pˆ r (x) is likely to be severely limited. In terms of our framework, whereas the Poisson model makes decoding a corollary of (that is, the Bayesian inverse of) the encoding model, the KDE model makes encoding a corollary of the decoding model. Evaluating the KDE model requires us to consider encoding—taking a probability distribution P [x | ω] over x and producing a set of firing rates {ri } such that Pˆ r (x) in equation 2.9 approximates P [x | ω] closely. It is the encoding process that standard experiments probe. Presenting single, unambiguous stimuli (the experimental procedure that led to the characterization in the Poisson model) amounts to asking the system to encode delta function distributions P [x | ω] = δ(x − x∗ ) for various x∗ . The measured responses are then the mean resulting activities fi (x∗ ) ∼ hri i. One natural way to do encoding is to use the Kullback-Leibler divergence P as a measure of the discrepancy between P [x | ω] and i r0i ψi (x) and use the expectation-maximization (EM) algorithm to fit the {r0i }, treating them as mixing proportions in a mixture model (Dempster, Laird, & Rubin, 1987). This relies on {ψi (x)} being probability distributions themselves. The projection method (Anderson, 1994) is an alternative encoding scheme that does not require the iterations of EM but instead uses the L2 distance. This allows ri to be computed as a projection of P [x | ω] onto the tuning functions: Z ri = x
P [x|ω] fi (x) dx.
(2.11)
The kernel functions are again assumed to be a fixed implicit property of the cells in this formulation and the optimal L2 tuning functions fi (x) are derived as Z X A−1 ψ (x); A = ψi (x)ψj (x) dx. (2.12) fi (x) = j ij ij j
x
These tuning functions are likely to need smoothing or regularizing (Anderson, 1994), particularly if the ψi (x) overlap substantially. In this case, with purely linear encoding and decoding operations, the overall scheme
412
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
is a particular sort of linear filter, and many of its properties can be derived from this perspective. There is a further aspect of P [x | ω] that we might wish to represent in a population code: certainty. Consider reducing the contrast of the moving random dots near to threshold. Then the absolute activity levels of MT cells might represent the certainty that there is actually a stimulus at all. In this case, one might characterize P [x|ω] as a mixture model, with one mixing proportion for the absence of a stimulus and one mixing proportion for a distribution over directions for the presence of a stimulus. The normalization step in equation 2.10 prevents the KDE from representing this form of P certainty, since i r0i = 1. However, certainty could easily be captured. For instance, if there is a maximum value Rmax for the summed actual activities P i ri , then one could have an indicator variable z ∈ {0, 1} representing the presence (1) or absence (0) of the underlying object and:
P [z = 1|{ri }] =
X
ri /Rmax
i
P [x] X ri ˆ P (x; {ri }, z) = P ψi (x) r i
j j
if z = 0 if z = 1
where P [x] is the prior distribution over x. Note that under this formulation, as the probability of the underlying object’s presence approaches zero, the marginalized estimate of the original distribution approaches the prior:
Pˆ r (x) = Pˆ (x; {ri }, z = 1)P [z = 1|{ri }] + Pˆ (x; {ri }, z = 0)(1 − P [z = 1|{ri }]). Of course, as P [z = 1|{ri }] tends to 0, the question of the true underlying distribution becomes moot. 3 The Extended Poisson Model Given its linear decoding method and a limited number of decoding kernel functions ψi (x), we might expect the KDE model to have difficulty capturing in Pˆ r (x) probability distributions P [x|ω] that have high frequencies, such as delta functions. We also saw that the standard Poisson model has the problem of decoding almost any pattern of activities r into something that rapidly approaches a delta function as the activities increase. Is there any middle ground? We argued that the problem for the standard Poisson model comes from its encoding model (see equation 2.4), which is based on there being a single underlying value x. We can extend this encoding model to allow the
Probabilistic Population Codes
413
recorded activities r to depend explicitly on general P [x | ω]. The extended Poisson model is based on an encoding model in which the activity of cell i is Poisson about a mean, which, in the continuous version, is Z (3.1) hri i = P [x | ω] fi (x)dx. x
Note that this equation is identical to the encoding model for the kernel density model (see equation 2.11), except that here the firing rates are stochastic. In the kernel density model, there is no variability in the activities {ri } that encode a particular P [x | ω]. Under the extended Poisson model, the activity of each cell is a filtered, sampled version of the underlying probability distribution and this implies strong constraints on the potential quality of reconstructions. This model is the most straightforward extension of the conventional Poisson model and makes roughly similar predictions about observable activities when P [x|ω] is a delta function. However, it offers a much more powerful model for representing P [x|ω] that are not delta functions. Given {ri } generated using Poisson noise from equation 3.1, how should one infer P [x|ω]? Recall that for the standard Poisson model, encoding a single location in a population (P [ri |x]) leads, through decoding, to a probability distribution over possible locations (P [x|r]). Here the encoding model takes a whole function (P [x|ω]) and stochastically produces a set of numbers ({ri }) that provide partial information about that function. The general Bayesian inverse to this is a model that takes the numbers and produces a probability distribution over the functions that could have generated them. In our case, this means that decoding should really produce a probability distribution over probability distributions over the implicit variable x, that is, P [P [x|ω] | r]. Rather than do this, we choose to summarize this distribution over distributions by an approximation to its most likely member; we perform an approximate form of maximum a posteriori (MAP) decoding, not in the value of x but in distributions over x. Figure 2 illustrates the entire encoding and decoding process for the case of implicit probability distributions. We approximate P [x | ω] as a piece-wise constant histogram that takes the value φˆj in (xj , xj+1 ], and fi (x) by a piece-wise constant histogram that take the values fij in (xj , xj+1 ]. Then we model activities {ri } as being independent Poisson random variables whose means are (see equation 3.1) X φˆj fij . hri i = j
Then the true inverse distribution is: h
i
h
i
P {φˆj } | {ri } ∝ P {φˆj } e
−
P
φˆ f ij j ij
ri Y X φˆj fij , i
where P [{φˆj }] is the prior over the {φˆj }.
j
(3.2)
414
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
r encode decode
unit
P[r|P(x|w)]
P[P(x|w)|{r}] unit P(x|w) P(x|w)
f(x)
P(x|w) x
P(x|w) x
x
x r P(x)
P(x|w)
x x Figure 2: A set of firing rates may also be interpreted as encoding a probability distribution in implicit space. Decoding the rates now involves forming a probability distribution over possible implicit distributions, P [P[x|ω] | r]. The decoding distribution Pˆr (x) may be formed from this distribution over distributions by integrating or through a maximum a posteriori computation. The extended Poisson model forms Pˆr (x) using an approximate form of ML in distributions over x.
P If the system is translation invariant then i fij = f is constant for P all j. j φˆj = 1, because it represents a probability distribution, therefore P φˆ f − ij j ij = e− f , which is constant. Taking logs, e i i X h h X ri log φˆj fij , (3.3) log P {φˆj } | {ri } = K + log P {φˆj } + i
j
where K is a constant. Maximum a posteriori decoding in this context requires finding the set of {φˆj } that sum to 1 and maximize this expression. If log P [{φˆj }] is dominated by a smoothness prior such as ´2 X³ φˆj − φˆj+1 , j
(3.4)
Probabilistic Population Codes
415
then we are left with the following expression:
AP({φˆj }) =
X
´2 X X³ ri log φˆj fij − ² φˆj − φˆj+1 ,
i
j
(3.5)
j
where ² is a weighting coefficient on the smoothness prior. Thus, the extended Poisson model creates a decoding distribution Pˆ r (x) that approximates MAP inference from the distribution over distributions P [P [x|ω] | {ri }]. The values of this decoding distribution can be found in a number of ways. One simple method involves adjusting {φˆj } via simple gradient ascent in MAP({φˆj }). Alternatively, one can use a version of EM to find the P MAP values. For this, one interprets j φˆj fij as a mixture model for case i, where {φˆj } are the mixing proportions and { fij } are the fixed values of the underlying distributions. ri is then the weighting for case i, and the resulting expression in equation 3.3 therefore acts just like a likelihood. In practice, a smoothness prior such as equation 3.4 is required for regularization. For the experiments described below, we implemented a crude approximation by averaging adjacent φˆj after each EM iteration. With this method of decoding in mind, we now see how the extended Poisson model competes with the KDE model as a way of representing probability distributions. By comparison with the linear equation 2.9, equation 3.5 offers a nonlinear way of combining a set of activities {ri } to give a probability distribution Pˆ r (x) over the underlying variable x. The computational complexities of equation 3.5 are irrelevant, since decoding is only an implicit operation that the system need never actually perform. Another way of looking at the difference between the models is that the extended Poisson model is a Bayesian decoding method, and thus involves a multiplication of tuning functions (assuming the units are independent), while the KDE model is a basis function method, in which decoding entails a summation of kernels. We will see the consequences of this difference in the simulations below. Finally, just as for the KDE model, this form of decoding does not capture well the certainty in the presence of a stimulus. A natural extension to the model in equation 3.5 is to estimate φˆj just as before, but set the approximation to Pˆ r (x) to be φˆ × φˆj in the interval (xj , xj+1 ], where # P ri i . φˆ = min 1, P i,j fij "
(3.6)
416
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
4 Comparing the Models We now compare the ability of these interpretation methods to extract the probabilistic information contained in a population code. The primary question of interest is whether we can find a set of deterministic activities {ri } that make Pˆ r (x) a close approximation to an interesting implicit distribution P [x|ω]. We take three instances of implicit distributions intended to model the primary cases in which interpreting a population code as a probability distribution is essential: 1. Some uncertainty exists about the location of the target. The aim is to approximate correctly the mean and the uncertainty about this mean. 2. The target could be in one of two locations. A bimodal distribution could arise due to insufficient information, as in the case of the rat, or due to the actual existence of multiple values. 3. Uncertainty exists about the presence of the target. Here the magnitude of the integral under the implicit distribution is intended to capture the degree of certainty in the presence of the target. Finally, we also examine the noise robustness of the methods. In each of these cases, we consider a simplified situation in which the location of the object varies along only one dimension. This eases visualization of network activity and decoded distributions, but each model readily could be extended to include other dimensions. For each model, the population code consisted of 50 units, with preferred values xi spaced uniformly in the range of x = [−10, 10]. Associated with each unit was a gaussian distribution: N (xi , σ = 0.3). In the KDE model, these 50 gaussians were the kernels ψi (x), while in the Poisson and extended Poisson models, they represented the tuning functions fi (x). For the KDE model, we used two methods to find the set of activities that provided the best fit between the true implicit distribution P [x|ω] and the estimate Pˆ r (x) (see equation 2.9). The projection method computes the rates {ri } as a projection of P [x|ω] onto the tuning functions, where the optimal tuning functions are derived from the fixed kernels (see equations 2.11 and 2.12). The EM method adapts {ri } to minimize the Kullback-Leibler divergence between the estimated and true distributions. For the Poisson and extended Poisson models, we again computed the rates {ri } as a projection of P [x|ω] onto the tuning functions, where the tuning functions are now the fixed { fi (x)}. In the Poisson model, we decode directly into the ML values of equation 2.8, while in the extended Poisson model, we decode using a version of EM to find the MAP values of equation 3.5. In all of these experiments, we limited the activities {ri } to assume integer values. Note that in the case of the KDE model, these activities are then normalized as part of the decoding process (see equation 2.10). Finally, in
Probabilistic Population Codes
0.4
417
0.4
0.4
true P[x|w] KDE (Proj.)
true P[x|w] KDE (EM)
0.2
0.0 -10
true P[x|w] E-Poisson
0.2
-5
0
5
10
0.0 -10
0.2
-5
0
X
X
5
10
0.0 -10
-5
0
5
10
X
Figure 3: All three methods can reconstruct the original implicit gaussian distribution over a range of values of τ , the standard deviation of this gaussian. Here τ = 1.0.
all simulations presented here, Rmax = 50, and the number of histogram bins in the extended Poisson model was 500. Table 1 summarizes these methods. We also require some quantitative way of comparing the reconstructions of the various models. Although the Kullback-Leiber distance (the implicit metric for both extended Poisson and KDE-EM methods) is the natural measure of the difference between two probability distributions, it cannot be used here, since the reconstructions from the KDE-projection method are not strict distributions (they are not nonnegative). We therefore used the somewhat less informative squared error measure: E=
Xh
i2
Pˆ r (xj ) − P [xj |ω]
.
(4.1)
j
4.1 Uncertainty in Target Location. For these simulations, the implicit distribution is a gaussian:
P [x|ω] = N (0, τ ). All three methods form good estimates for a range of values of τ , the width of the true implicit distribution (for example, see Figure 3). However, as predicted, both KDE methods are unable to represent narrow distributions accurately for example, τ = 0.2 (see Figure 4). In general, the critical variable in the fidelity of the KDE estimate is the ratio of the widths σ of the decoding kernels and the widths τ in the true implicit distribution. The extended Poisson method is able to capture both narrow and wide implicit distributions, so it can form accurate interpretations of distributions with a relatively high variance, as well as delta functions.
418
Table 1: Summary of the Key Operations of the Three Interpretation Methods.
Operation
Encode
Extended Poisson
hri i = h
x P[x|ω] fi (x)dx
¤
fi (x) = Rmax N (xi , σ )
KDE (EM)
R
£
hri i = h Rmax x P[x|ω] fi (x)dx fi (x) =
P j
¤
£
hri i = h Rmax r0i
A−1 ψj (x) ij
¤
r0i to max L
R
Aij = x ψi (x)ψj (x)dx
Pˆr (x)
Likelihood
Error
P 0 Pˆr (x) = r ψ (x) i i i
Pˆr (x) to max L
Decode rˆi =
R ˆr P P (x) fi (x)dx ≈ j φˆj fij x £
¤
L = log P {φˆj }|{ri } ≈
G=
P
P
r i i
r log(ri /rˆi ) i i
r0i = ri /
P 0 Pˆr (x) = r ψ (x) i i i
P
r j j
R
L = x P[x|ω] log Pˆr (x)dx
log rˆi
R h
E = x Pˆr (x) − P[x|ω]
i2 dx
R G = x P[x|ω] log Pˆ[x|ω] dx P r (x)
Note: h[ ] is a rounding operator to ensure integer firing rates, and ψi (x) = N (xi , σ ) are the kernel functions for the KDE method. The certainty calculation is not included in this table.
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
hri i
£R
KDE (Projection)
Probabilistic Population Codes 2.0
419
2.0
2.0
true P[x|w] KDE (Proj.)
true P[x|w] KDE (EM)
true P[x|w] E-Poisson
1.0 1.0
1.0
0.0 -10
-5
0
X
5
10
0.0 -10
-5
0
5
X
10
0.0 -10
-5
0
5
10
X
Figure 4: The KDE method has difficulty capturing the high-frequency information in the implicit gaussian distribution, when its standard deviation, τ = 0.2, is smaller than the kernel widths (σ = 0.3).
4.2 Multiple Locations. A multimodal implicit distribution can be described as a mixture of two gaussians:
P [x|ω] = 0.5 ∗ N (x1 , τ ) + 0.5 ∗ N (x2 , τ ). Here we can model the situation in which the rat is uncertain whether the target is in location x1 or x2 . The variable τ describes an additional degree of uncertainty about the exact location of each potential target. For these simulations, we let x1 = 2 and x2 = −2. To get a sense of the different encoding models, the expected activities of the 50 units in the population code are plotted in Figure 5. For τ = 1.0, the KDE-EM method has converged to a local minimum, which is why one of the firing rates looks odd. However, because the distribution being modeled is so broad, this makes very little difference to the net quality of reconstruction. Applying the respective decoding models to these expected firing rates, we find results similar to those of the previous section. Both KDE methods can capture a bimodal implicit distribution where the width exceeds that of the kernels (see Figure 6), yet they cannot accurately reconstruct narrow distributions (see Figure 7). The extended Poisson model matches both types of implicit distributions with high fidelity. For the sake of comparison, Figure 8 shows the decoded distribution formed by the simple Poisson model (see equation 2.8). Here we use the same encoding model as in the extended Poisson method, in order to allow the unit activities to convey information about the entire implicit distribution. Nonetheless, the simple decoding model always produces a delta function estimate. An estimated delta function fits the case of the narrow unimodal implicit distribution shown in Figure 4, but it cannot match the bimodal implicit distribution here.
420
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
τ =0.2
20
KDE-Proj
40
τ = 1.0 KDE-Proj
10
20
0
0 −10
−5
0
5
10
−10 20
KDE-EM
40
−5
0
5
10
KDE-EM
10
20
0
0 −10
−5
0
5
10
−10 20
−5
0
E-Poisson
40
5
10
E-Poisson 10
20
0
0 −10
−5
0
5
1
10
−10
−5
0
1
true P[x|w]
0.5
5
10
true P[x|w]
0.5
0 −10
−5
0
5
0 −10
10
−5
0
5
10
Figure 5: Each method’s encoding model specifies how the expected firing rates of units in the population are based on the implicit distribution. The implicit distribution here is a bimodal gaussian with standard deviation τ . The firing rates on the left are for τ = 0.2 and on the right for τ = 1.0. In both cases, Rmax = 50.
true P[x|w] KDE (Proj.)
.2
true P[x|w] KDE (EM)
.2
.15
.15
.15
.1
.1
.1
.05
.05
.05
0.0 -10
-5
0
X
5
10
0.0 -10
-5
0
X
5
true P[x|w] E-Poisson
.2
10
0.0 -10
-5
0
5
10
X
Figure 6: All three methods provide a good fit to the bimodal gaussian distribution when its variance is sufficiently large (τ = 1.0).
Probabilistic Population Codes 1.0
421
1.0
1.0
true P[x|w] KDE (Proj.)
true P[x|w] KDE (EM)
true P[x|w] E-Poisson
0.5 0.5
0.5
0.0 -10
-5
0
5
10
0.0 -10
X
-5
0
5
10
0.0 -10
X
-5
0
5
10
X
Figure 7: The KDE method again has difficulty capturing the high-frequency information in the implicit bimodal gaussian distribution when its variance τ = 0.2 is smaller than the kernel widths (σ = 0.3).
3 true P[x|w] std. Poisson 2
1 0 -10
-5
0 X
5
10
Figure 8: The simple Poisson decoding model leads to a delta function for the estimated implicit distribution. Here the implicit distribution is the bimodal gaussian, τ = 1.0. The unit activities r are the same as for the extended Poisson model, and the decoding is done using equation 2.8.
This result is predicted based on the the analysis presented in section 2.2. This analysis, however, applied to the case of gaussian tuning functions in the absence of any baseline activity. Using simulations, we examine what happens when the units have some baseline activity. Here it is necessary to change the encoding model so that the unit’s expected firing rate is the sum
422
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
2
true P[x|w] std. Poisson w/ baseline
1
0 -10
-5
0 X
5
10
Figure 9: When we extend the simple Poisson decoding model to include baseline firing rates (Rb = 5.0 here), the estimated implicit distribution can be multimodal. Yet this decoded distribution does not contain any variance about these values. The implicit distribution is again the bimodal gaussian, τ = 1.0, and the unit activities r are the same as for the extended Poisson model.
of a gaussian tuning function and a baseline firing rate, Rb : fi (x) ∝ e−(x−xi )
2
/2σ 2
+ Rb .
(4.2)
Again we form the decoding distribution as the Bayesian inverse of this new encoding model:
P [x|r] ∝ P [x]
Y i
e− fi (x)
( fi (x))ri . ri !
For the simulations, we again use the encoding model of the extended Poisson model, but now use this modified simple Poisson model for decoding. An example of the results is shown in Figure 9. The decoded distribution is able to take on multiple modes, yet it is always highly peaked, due to the likelihood being a product of Poisson probabilities. Thus, this decoding method can recover multiple implicit values but cannot capture any uncertainty in these values. 4.3 Uncertainty in Object Presence. The next set of simulations addresses the issue of object presence. We use a dampened gaussian to model
Probabilistic Population Codes
423
0.4 0.35
extended Poisson KDE - Projection
0.3 0.25 0.2 0.15 0.1 0.05 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 c
1
Figure 10: This plot shows the squared error (see equation 4.1) for different values of c, the total integral under the implicit distribution, for both the extended Poisson and KDE projection methods. In these simulations, Rmax = 50.
the situation in which uncertainty exists in the actual presence of the target: cN (0, τ ), 0 ≤ c ≤ 1. Ideally, c would not be restricted to be less than 1. Instead, it could take on any positive value and thereby describe the actual number of instances of the implicit variable. Here we consider the simpler situation in which only one instance of the object is represented. In these simulations, we compare the KDE model using the projection method to the extended Poisson model. We set τ to be 1.0, because both methods are able to match this distribution with high fidelity when c = 1.0. The primary empirical result is that while both methods have difficulty as the presence c gets close to 0, both are able to recover a variety of nonnormalized gaussian implicit distributions, as shown in Figure 10. The main reason for the poor performance as c decreases is that the firing rates are forced to be integers. Clearly, increasing Rmax would extend the range of good reconstructions. 4.4 Noise Robustness. In the previous simulations, the activities of the units in the population code were equal to their expected responses according to the respective encoding models. We now examine the robustness of the interpretation methods by treating the unit activities as Poisson random variables. Each simulation involves a stochastic sampling of the unit
424
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
KDE (Projection) Extended Poisson
0.2
0.4
0.6 TAU
0.8
1
Figure 11: The expected error hEi in the decoded distribution, averaged over 50 samples from the firing rate distributions, is plotted against different values of τ , the width of the bimodal gaussian implicit distribution. The results for only one of the KDE methods are shown because both KDE methods involve the same decoding operation.
responses. The robustness of a method is estimated by computing the expected error in the decoding distribution with respect to the true implicit distribution, that is, averaging the squared-error metric (see equation 4.1) over 50 stochastic trials. The results of this set of simulations again match the predictions based on the contrast between the methods, as shown in Figure 11. Because the decoding in the extended Poisson model is nonlinear, we predict that this model will be more robust to noise than the KDE model, in which decoding is linear. Figure 11 makes it appear that both methods produce perfect reconstructions for larger values of τ . This is in fact not true and is largely due to the magnitude of the expected error for small values of τ . For the extended Poisson model, the regularization removes one component of the error: inaccuracies in the shape of the reconstructed distribution. This process is more effective as τ gets larger. However, the smoothing does not remove the second component: the bias that is present if the centers of the two gaussians in the reconstructed distribution are incorrect. This bias component of the error is slightly higher for low values of τ but relatively constant and nonzero for τ > 0.4.
Probabilistic Population Codes
425
5 Discussion We have presented a theoretical framework for understanding population codes that generalizes naturally to the important case in which the population provides information about a whole probability distribution over an underlying quantity rather than just a single value. We used the framework to analyze two existing models and to suggest and evaluate a third model for encoding such probability distributions. More informally, we have tried to examine the consequences of the seemingly obvious step of saying that if a rat, for instance, is uncertain about whether it is at one of two places, then place cells representing both places could be activated. The complications come because the structure of the interpretation changes; for instance, one can no longer think of ML methods to extract a single value from the code directly. We are not suggesting that the uncertainty is generated at any particular step in the processing system. Rather, it is a type of information that is potentially contained in the population activity, about which inferences can be made from one level to the next. So the rat need not be aware that it is uncertain; we are not positing anything about “conscious” uncertainty. Instead, different firing patterns corresponding to different distributions over the underlying implicit variables are all that is needed to infer the uncertainty at the level above. This leads to the prediction that place cells corresponding to multiple locations will be active when the rat is uncertain as to its location. This uncertainty may be present even at the sensor level. A population of orientation-selective cells should have a different pattern of firing to a bar at one orientation and a blurred image of the same bar in which the orientation is “uncertain.” This fuzzier bar (as well as the sharper bar) can be represented in terms of a probability distribution over orientation. One main result of our framework is a method for encoding and decoding probability distributions that is the natural extension of the standard Poisson model for encoding and decoding single values. We also showed that this standard model is inadequate for coding probability distributions. Under the new encoding model, the activity of a cell has Poisson statistics about a mean that is dependent on the integral of the whole encoded probability distribution, weighted by the tuning function of the cell. The behavior of this encoding model is appropriate in circumstances such as the positional uncertainty of the rat. We suggested a particular decoding model, based on an approximation to ML decoding to a discretized version of the whole probability distribution. The resulting calculations require a form of the EM algorithm, regularized by a smoothing operation. We showed that this nonlinear decoding method works well in a variety of circumstances, reconstructing broad, narrow, and multimodal distributions more accurately than either the standard Poisson model or the kernel density model. Stochasticity is built into our method, since the units are supposed to have Poisson statistics, and it is therefore also quite robust to noise.
426
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
Various aspects of this scheme merit discussion. First, we readily acknowledge that the decoding model is quite nonbiological, involving an implausible iterative computation. The point of our particular decoding model was to show explicitly a lower bound to the veracity with which a set of activities can code a distribution. One might expect the subsequent stages of processing in the brain to do one of two things with the information in the population: 1. Integrate it with information represented in other population codes to form a combined population code (e.g., combining uncertain information about the relative position of two landmarks to generate the activity of a population code formed of place cells). 2. Extract a single value from it to control behavior (e.g., pull a lever to report the best-supported direction for the motion of the dots or choose where to explore for food in a maze). In both cases, the extraction is presumably performed through standard neural operations such as taking nonlinear weighted sums and, possibly, products of the activities. We are interested in how much information is preserved by such operations, as measured against the nonbiological standard of our decoding method. The first issue—how to integrate two or more population codes to generate the output in the form of another population code—was stressed by Hinton (1992), who noted that it directly relates to Ballard’s (1981) notion of generalized Hough transforms. This question is particularly important because of the apparent ubiquity of population coding in the brain. It is not at all obvious how simple and local combination methods could be capable of preserving and combining probabilistic information in the population, and we are studying this question, using the EM-based decoder to generate targets and using local learning rules. One interesting theoretical concern is that the population code output of such a combination might not have exactly the same form as the population code inputs. For instance, it might not be completely accurate to characterize the cells as having Poisson statistics based on a gaussian tuning function. In this case, one could formally calculate the true statistical interpretation of the combined code. However, in the brain, there does not appear to be a great difference between the population codes near to the input and the population codes in deeper areas, such as MT. This actually places a strong constraint on the method of combination. One special concern for combination is how to understand noise. For instance, the visual system can be behaviorally extraordinarily sensitive, detecting just a handful of photons. However, the outputs of real cells at various stages in the system are quite noisy, with apparent Poisson statistics. If noise is added at every stage of processing and combination, then the final population code will not be very faithful to the input. There is much current
Probabilistic Population Codes
427
and confusing research on the issue of the creation and elimination of noise in cortical synapses and neurons (Shadlen & Newsome, 1994; Mainen & Sejnowski, 1995; Bair & Koch, 1996). Correlated noise presents extra and different concerns. A further concern for combination is the basis function strategy apparently adopted in parietal cortex, in which the position of an object in space is reported by neurons that have a retinotopic visual receptive field and multiplicative modulation from the position of the eyes in their orbits (Andersen, Essick, & Siegel, 1985; Zipser & Andersen, 1988; Pouget & Sejnowski, 1995; Salinas & Abbott, 1996). Multiplicative modulation based on the locus of attention has also been found in V4 (Connor, Gallant, Preddie, & Van Essen, 1996), and it has been suggested as being a general computational strategy (Poggio, 1990; Pouget & Sejnowski, 1997). The statistical effects of the multiplicative modulation remain to be investigated. The second issue—extracting a single value from the population—is also important, particularly at the interface into the motor system. Some empirical data about how this is accomplished come from the work on biasing the choices of the monkeys as to the directions of motion of the random dot stimuli through electrical microstimulation in MT (Salzman & Newsome, 1994). If the dots were moving in one direction but the electrical stimulation favored a different direction, then the monkeys would typically choose one or other of the two directions, rather than something like the mean direction. In our framework, we would regard the simultaneous activity of the MT cells that prefer the two directions as encoding a distribution and postulate an output extraction process (thought to be in the lateral intraparietal area; Shadlen & Newsome, 1996) that chooses a single value on the basis of this distribution. Short of knowing exactly the effects of the electrical stimulation on the activity of the MT cells, it is hard to use the experiment to confirm or reject this hypothesis. A final issue that we have addressed is that of certainty or magnitude. Hinton’s (1992) idea of using the sum total activity of a population to code the certainty in the existence of the quantity they represent is attractive, provided that there is some independent way of knowing what the scale is for this total. We used this scaling idea for both the kernel density and the extended Poisson models. In fact, we can go one stage further and interpret greater activity still as representing information about the existence of multiple objects or multiple motions. However, this treatment seems less appropriate for the place cell system; the rat presumably is always certain that it is somewhere. There, it has been suggested that the absolute level of activity could be coding something different, such as the familiarity of a location. An entire collection of cells is a terrible thing to waste on representing just a single value of some quantity. Representing a whole probability distribution, at least with some fidelity, is not more difficult, provided that the
428
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
interpretation of the encoding and decoding is clear. Here we have suggested some steps in this direction. Acknowledgments We thank Charlie Anderson, Terry Sanger, and Larry Abbott for helpful discussions and an anonymous reviewer for useful comments. R. Z. was supported by the McDonnell Foundation, grant JSMF 95-1; P. D. by NIMH R29 MH 55541-01 and the Surdna Foundation; and A. P. by a grant from the DOD, grant DAMD17-93-V-3018. References Andersen, R. A., Essick, G. K., & Siegel, R. M. (1985). Encoding of spatial location in posterior parietal neurons. Science, 230, 456–458. Anderson, C. H. (1994). Basic elements of biological computational systems. International Journal of Modern Physics C, 5(2), 135–137. Anderson, C. H., & Van Essen, D. C. (1994). Neurobiological computational systems. In J. M. Zureda, R. J. Marks, & C. J. Robinson (Eds.), Computational Intelligence Imitating Life (pp. 213–222). New York: IEEE Press. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving Macaque monkey. Neural Computation, 8(6), 1185–1202. Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59, 313–318. Ballard, D. H. (1981). Generalizing the Hough transfrom to detect arbitrary shapes. Pattern Recognition, 13(2), 111–122. Cheng, K., & Gallistel, C. R. (1984). Testing the geometric power of an animal’s spatial representation. In W. L. Roitblat (Ed.), Animal Cognition (pp. 409–423). Hillsdale, NJ: Erlbaum. Connor, C. E., Gallant, J. L., Preddie, D. C., & Van Essen, D. C. (1996). Responses in area V4 depend on the spatial relationship between stimulus and attention. Journal of Neurophysiology, 75, 1306–1308. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39, 1–38. Foldi´ ¨ ak, P. (1993). The “ideal homunculus”: Statistical inference from neural population responses. In F. H. Eeckman & J. Bower (Eds.), Computation and Neural Systems (pp. 55–60). Norwell, MA: Kluwer Academic Publishers. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 243, 1416–1419. Hinton, G. E. (1992). How neural networks learn from experience. Scientific American, 267(3), 145–151. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268(5216), 1503–1506.
Probabilistic Population Codes
429
Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual decision. Nature, 341, 53–54. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely-moving rat. Brain Research, 34, 171–175. Poggio, T. (1990). A theory of how the brain might work. Cold Spring Harbor Symposium on Quantitative Biology, 55, 899–910. Pouget, A., & Sejnowski, T. J. (1995). Spatial representations in the parietal cortex may use basis functions. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in Neural Information Processing Systems 7 (pp. 157–164). Cambridge, MA: MIT Press. Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9(2), 222–237. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population codes. Neural Computation, 10(2), 373–401. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rates. Journal of Computational Neuroscience, 1, 89–107. Salinas, E., & Abbott, L. F. (1995). Transfer of coded information from sensory to motor networks. Journal of Neuroscience, 15(10), 6461–6474. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proceedings of the National Academy of Sciences of the United States of America, 93, 11956–11961. Salzman, C. D., & Newsome, W. T. (1994). Neural mechanisms for forming a perceptual decision. Science, 264, 231–237. Sanger, T. D. (1996). Probability density estimation for the interpretation of neural population codes. Journal of Neurophysiology, 76(4), 2790–2793. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Sciences, USA, 90, 10749–10753. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1996). Motion perception: Seeing and deciding. Proceedings of the National Academy of Sciences, USA, 93, 628–633. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8(3), 511–530. Snowden, R. (1989). Motions in orthogonal directions are mutually suppressive. Journal of the Optical Society of America, 6(7), 1096–1101. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Zemel, R. S., & Hinton, G. E. (1995). Developing population codes by minimizing description length. Neural Computation, 7(3), 549–564. Zipser, D., & Andersen, R. A. (1988). A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature, 331, 679–684.
430
Richard S. Zemel, Peter Dayan, and Alexandre Pouget
Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370(6485), 140–143. Received January 6, 1997; accepted June 25, 1997.
Communicated by Mikhail Tsodyks
Stable and Rapid Recurrent Processing in Realistic Autoassociative Memories Francesco P. Battaglia Alessandro Treves SISSA - Cognitive Neuroscience, Trieste, Italy
It is shown that in those autoassociative memories that learn by storing multiple patterns of activity on their recurrent collateral connections, there is a fundamental conflict between dynamical stability and storage capacity. It is then found that the network can nevertheless retrieve many different memory patterns, as predicted by nondynamical analyses, if its firing is regulated by inhibition that is sufficiently multiplicative in nature. Simulations of a model network with integrate-and-fire units confirm that this is a realistic solution to the conflict. The simulations also confirm the earlier analytical result that cued-elicited memory retrieval, which follows an exponential time course, occurs in a time linearly related to the time constant for synaptic conductance inactivation and relatively independent of neuronal time constants and firing levels. 1 Introduction Autoassociative networks, or free simple memories in David Marr’s terms (Marr, 1971), have been considered one of the fundamental building blocks of brain function (Little, 1974; Kohonen, 1977; Hopfield, 1982). In an autoassociative memory, all the components of a distributed representation of a memory item are associated together by Hebbian (Hebb, 1949) synaptic plasticity, enabling the modified synaptic matrix to retrieve the whole representation when some of the components are presented later as a cue. The difference with heteroassociative memories (Willshaw, Buneman, & Longuet-Higgins, 1969) is that in the latter, a representation of item X, or part of it, is used as a cue to retrieve the representation of a different item, Y. The representation of X is distributed over the input lines to a group of units, the output lines of which encode Y. In autoassociative memories, the item is the same, but the degree to which there is a differentiation between input and output may vary across possible autoassociative architectures. Two limiting cases are useful to illustrate the spectrum of possibilities (Treves & Rolls, 1991). In a purely recurrent network, the output lines from a group of units have recurrent collateral branches that provide inputs to the same units. One can think of each memory item as having a unique representation, defined Neural Computation 10, 431–450 (1998)
c 1998 Massachusetts Institute of Technology °
432
Francesco P. Battaglia and Alessandro Treves
as the pattern of activity distributed across the units. A representation is retrieved by repeatedly circulating activity through the recurrent loops until a steady state is reached. The existence, stability, and characteristics of steady states can be analyzed mathematically with self-consistent statistics (Amit, Gutfreund, & Sompolinsky, 1987). A different architecture consists of several (L) groups of units in cascade, with purely feedforward connections from one group to the next one. The retrieval operation is, as it were, unwrapped along the L stages, with the ongoing pattern of activity at each stage increasingly approaching the target representation. Independent representations of the same memory item exist at each stage. Since the system is feedforward, simpler statistics are sufficient to analyze its operation (Domany, Kinzel, & Meir, 1989). Many intermediate possibilities exist, of course, but already to a very abstract level, one can point to three advantages that favor architectures closer to the recurrent than to the feedforward limit: (1) the number of units and connections required is reduced by a factor L; (2) if feedback reverberation is sufficient, it can sustain the activation of a representation over time, endowing the network with an additional capability for short-term memory (Amit, 1995); and (3) it is easier to store each item in the memory by forming the one required representation rather than L representations over L groups of units. A disadvantage of architectures dominated by feedback is that they suffer more from crosstalk, or interference, when many items are stored on the same connections; this disadvantage disappears if the coding becomes sparse, as revealed by analyses of the storage capacity (Tsodyks & Feigelman, 1988; Treves & Rolls, 1991). These characterizations of the advantages of different architectures, which have been derived using simple formal models, are important in understanding the autoassociative function that may be served by real networks in the brain. When considering autoassociative memory as implemented in the brain, two additional aspects need to be studied that cannot be fully understood using models that are too simple. The first aspect is the time required for the retrieval of a representation. In the simpler models, time is typically discretized into time steps. If one then contrasts a recurrent network, in which a representation is retrieved (to a required degree of accuracy) after L sweeps through the single group of units, with a multilayer feedforward network consisting of L stages activated in succession, the time required for retrieval is apparently the same. Obviously such a description has little to do with the dynamics of real neurons, and trying to construct a biophysical equivalent of the time step (e.g., the membrane time constant, or the typical interspike interval) does not lead to any real understanding. What is needed at the least is a study of formal models based on a description of real neurons as integrate-and-fire units (Lapique, 1907; Eccles, 1957) and of real synaptic transmission as conductance changes (Eccles, 1964). An analysis of the dynamics of an autoassociative recurrent network model built with such components has yielded part of the answer
Recurrent Processing in Autoassociative Memories
433
as to the time scales for retrieval (Treves, 1993). The partial answer is an analytical formula for the time constants of the exponentially decaying transient modes, through which firing activity in the network approaches the firing at steady state. There are many different modes, each of which has a time constant with a real part, describing the rate of decay of the mode, and an imaginary part, specifying the frequency of the oscillations accompanying the decay. An important family of transients has the real part of the time constant determined by the rate of inactivation of the synaptic conductances opened by activity on the recurrent collaterals. Since such a rate of inactivation in the brain is typically short (10–20 msec, even when taking into account the dendritic spread not included explicitly in the integrate-andfire description (Hestrin, Nicoll, Perkel, & Sah, 1990; Colquhoun, Jonas, & Sakmann, 1992; McBain & Dingledine, 1992), a prediction arising from the analysis is that the contribution of recurrent collaterals to the retrieval of a memory representation may take place in a relatively short time, over a few tens of milleseconds, independent of the prevailing firing rates and the membrane time constants, however defined, of the neurons in the population (Treves, Rolls, & Tovee, 1996). The analysis, however, has remained incomplete, because it describes only the modes close to steady state and not the full dynamics from an arbitrary initial state and because it is unable to tell to what extent each individual mode will be activated when the activity evolves from any initial state. These limitations can be overcome by computer simulations of the same network model considered by the analytical treatment. A second aspect that has to be addressed by models that aim to be applicable to the real brain is that of the stability of the steady states that are taken to correspond to memory retrieval. As with any other steady state in the dynamics of a system of many units, there are many possible sources of instability. One example is the instability of the steady states in which the firing of different units is asynchronous, to synchronization among groups of units (Tsodyks, Mitcov, & Sompolinsky, 1993; Deppisch et al., 1993; van Vreeswijk, Abbott, & Ermentrout, 1994; Hansel, Mato, & Meunier, 1995). A more basic potential instability arises out of the fact that the Hebbian modifiable connections that are thought to mediate associative memory in the brain are those between pyramidal excitatory cells. Therefore, a recurrent autoassociative memory is in itself a positive feedback circuit, and unless its activity can be tightly controlled by appropriate inhibition, it will tend to explode. Although the stability of realistic networks of excitatory and inhibitory units has been studied (Abbott, 1991), it was not in the context of autoassociative memories. In this article, we show that in such networks there is a fundamental conflict between stability to excitatory explosion and storage capacity. In the next section, we show that the conflict can be avoided by inhibition that is predominantly multiplicative in nature. Then we return to the issue of the time scales for retrieval, with simulations that support and qualify the analytical predictions. The last section discusses
434
Francesco P. Battaglia and Alessandro Treves
the implications of these results for the operations of associative memories in the brain. A brief report of this work appeared in Battaglia and Treves (1996). 2 The Stability-Capacity Conflict A full analysis of the stability of asynchronous steady firing states must be carried out using appropriately detailed models, but the requirements for stability against excitatory runaway reverberations can be discussed using a simple two-variable model. In such a model, two variables, νE and νI , describe, respectively, the average firing rates of excitatory and inhibitory units, which approach their steady-state values with time constants τE and τI . The steady-state values are determined by these average firing rates and the level of afferent inputs. If we assume that, above threshold, the dependence is approximately linear, the dynamical system can be written (Wilson & Cowan, 1972): aff
τE ν˙E = −νE + JEE νE − JEI νI + νE τI ν˙I = −νI + JIE νE − JII νI +
(2.1)
aff νI ,
(2.2)
where the J’s are the adimensional effective couplings (signs are chosen so that they are all positive in value) between the dynamical variables, as they emerge, essentially, from averaging synaptic strengths across pairs of aff active units, and ν aff νE,I are constant terms, which depend on the afferent input activity and are proportional to fixed-point rates. They ensure that equilibrium rates are not zero, even if the network does not receive any input, reflecting the capability of the network to self-sustain its activity. If this system of equations has a fixed point, its stability requires that Tr = (JEE − 1)/τE − (JII + 1)/τI < 0 Det =
−(JEE
−
1)(JII
+ 1) +
JIE JEI
> 0.
(2.3) (2.4)
Both inequalities can be satisfied for arbitrary values of the mean excitatoryexcitatory coupling among active units, JEE , provided inhibitory couplings are strong enough to control excitation. If, on the basis of this simple twovariable model, we want to ensure the stability of a real autoassociator, both inequalities must be satisfied with ample margins. The reason is that exactly which units are active will be highly variable, and therefore the effective value of JEE at any moment in time will fluctuate substantially. It is easy to realize, then, that for values of the four mean couplings much larger than 1, the determinant appearing in the second condition will be of the order of such large value, squared. Now, the fixed-point firing rates are fp
νE =
aff
aff
(JII + 1)νE − JEI νI Det
(2.5)
Recurrent Processing in Autoassociative Memories fp
νI =
aff
435
aff
JIE νE − (JEE − 1)νI , Det
(2.6)
which means that if the couplings are large, under conditions of robust stability the mean excitatory firing rate at the fixed point will be much lower fp aff than the one determined by afferent inputs alone, νE ¿ νE . However, this is incompatible with the effective operation of the network as contentaddressable memory, since it makes recurrent processing minor with respect to the feedforward relay of the cue. In fact, when we tried to simulate memory retrieval with large couplings and at the same time insisted on the condition that local intrinsic inputs dominate over external afferent inputs (a condition intended to mimick the observed cortical anatomy; Abeles, 1991), we always ran into large oscillations (Simmen, Treves, & Rolls, 1996), due to even transient imbalances between local excitation and inhibition, which resulted in large fluctuations in the effective couplings, and prevented the network from reaching a steady retrieval state. Only by using as a cue the nearly complete memory pattern could we effect proper retrieval, but then recurrent connections played only a minor role. Therefore, to obtain robust, stable, fixed points, we had to restrict ourselves to smaller effective couplings, in particular to values of JEE not much above 1. In that case, since the excitatory self-coupling always appears in the combination (JEE − 1), its potentially devastating influence on the stability of the fixed point will be reduced, and at the same time conditions will exist under which even small cues will be sufficient to initiate retrieval. Keeping the excitatory selfcoupling low, however, conflicts with ensuring a large storage capacity. Consider a simple autoassociator in which the weights of the connections among the units are determined by a linear sum of Hebbian-modification terms, as, for example, in the Hopfield model (Hopfield, 1982). If the units represent excitatory cells and the weights ultimately correspond to conductances, one may assume that such a memory structure is superimposed on a baseline connection weight that is large enough to keep positive even the individual weights that happen to undergo more negative modifications.1 Therefore, one may write for the weight between units i and j, ! ¶ Ã ηµ p µ µ 1 X ηi j −1 −1 , wij = w + C µ=1 hηi hηi 0
(2.7)
µ
where ηi is the firing rate of unit i in the µth memory pattern, hηi is the average firing rate, the network stores p patterns with equal strength, and C is the number of inputs per unit. The specific (covariance) form of the Hebbian term and the normalization factor are inessential to the argument 1 This assumption is made for the sake of clarity. In the simulations that follow, we use an equivalent formulation, although it is less transparent to the analysis.
436
Francesco P. Battaglia and Alessandro Treves
that follows and were chosen for consistency with previous analyses (Treves & Rolls, 1991). The minimum connection weights will be those between pairs in which the pre- and postsynaptic unit happen to be anticorrelated across all patterns; that is, whenever one of the two is firing, for example, at a typical elevated rate η∗ , the other is quiescent. Then the condition that ensures that the underlying conductance remains positive, even in such cases, reads w0 ≥
p η∗ . C hηi
(2.8)
On the other hand, the effective excitatory self-coupling—that is, the effect that the average excitatory firing rate exerts on each excitatory unit—is given by summing conductances across input lines and multiplying by the gain γ characterizing the unit’s input-output transform in a linear range above firing threshold, JEE = γ Cw0 .
(2.9)
Note that the Hebbian terms average to zero when summing across the C inputs. Previous analyses (Treves, 1990; Treves & Rolls, 1991) have shown that for the network to be able to retrieve memory patterns, the gain has to be sufficiently strong, as expressed by the condition γ ≥
a , (1 − a)
(2.10)
where 0 < a < 1 is the sparseness of the firing patterns, defines as a = hηi2 /hη2 i (Treves, 1990). Putting together now the condition that the effective excitatory self-coupling be at most of order 1 with the last three equations, one realizes why stability conflicts with storage capacity: O(1) ≈ JEE = γ Cw0 ≥ p
a η∗ ; hηi (1 − a)
(2.11)
that is, in this case, to be stable at retrieval, the network must not store more than a number of memory patterns, pmax '
hηi (1 − a) = O(1)!, η∗ a
(2.12)
that is, more than a handful of patterns. In simulations that followed these very specifications, we found it difficult to obtain retrieval in nets storing more than two or three patterns, whatever their size. The conflict arises out of requiring simultaneously dynamical stability and effective retrieval
Recurrent Processing in Autoassociative Memories
437
ability and biological plausibility (in that the memory is stored on the connections between excitatory units and in that each conductance must be a positive quantity). It does not arise in storage capacity analyses based on simplified formal models (Amit et al., 1987; Treves & Rolls, 1991) if one treats connection weights as real variables that can have either sign and can change in sign as more memories are stored. Recurrent autoassociative memory models based on an alternative simple “learning rule”—the so-called Willshaw models (Willshaw et al, 1969)— although assuming only positive (or zero) weights among excitatory units, still suffer from similar limitations. That class of models, however, is more difficult to treat analytically (Golomb, Rubin, & Sompolinsky, 1990) and does not lend itself to such a simple discussion of the conflict; moreover, what is limited is not simply p, the number of memories that can be stored (which can be well above two or three; Amit & Brunel, 1997), but the total amount of information that can be stored and retrieved, which is proportional to p but also decreases the sparser are memory patterns (and the more information need be provided with the cue). 3 Realistic Inhibition May Avoid the Conflict A seemingly innocuous assumption that was made in writing equations 2.1 and 2.2 is that excitatory firing rates depend linearly not just on themselves but also, through a separate linear term, on inhibitory rates. This is equivalent to considering what is sometimes called subtractive inhibition. Purely subtractive inhibition is a convenient model for GABAB inhibition, which acts through K+ channels of limited total conductance, primarily by hyperpolarizing the receiving cell (Connors, Malenka, & Silva, 1988). If colocated on dendrites along with excitatory inputs, GABAB can be thought of as providing an additional term that is negative in sign and hence subtractive, and occurs on a slower time scale (Hablitz & Thalmann, 1987). GABAA inhibition, which is responsible for fast inhibitory control of the activity level of recurrent networks (Miles & Wong, 1987), is sometimes referred to as multiplicative (or, rather, divisive) in nature. This is because it acts via Cl− channels of relatively large total conductance (Connors et al., 1988) and inversion potential not far below the resting potential; hence, its effect is more shunting than hyperpolarizing. If located on proximal dendritic branches or on the soma (Andersen, Eccles, & Loyning, 1964), it can be modeled to a first approximation as producing a division of the current resulting from more distal inputs (Abbott, 1991). Purely multiplicative inhibition acting on excitatory cells would lead to substitute equation 2.1 with aff
τE ν˙E = −νE + JEE (νI )νE + νE ,
(3.1)
that is, the excitatory self-coupling is now a function of the average firing
438
Francesco P. Battaglia and Alessandro Treves
rate of inhibitory units (the second part of the equation can be modified as well, but this is irrelevant for the present discussion). To the extent that afferent inputs are absent or negligible, at the fixed point the self-coupling takes the value 1, thereby automatically ensuring stability, at least in the sense of equations 2.3 and 2.4 (since the terms in JEE − 1 disappear from the inequalities). Real inhibition, of course, is not purely multiplicative; however, the situation holding in this limit clarifies that under appropriate conditions (if inhibition is multiplicative to a sufficient degree), the stability of recurrent networks against runaway excitation is automatically guaranteed. As for the upper limit on storage capacity, we have checked, by repeating previous analyses (Treves & Rolls, 1991) of recurrent associative memories of threshold-linear units with a gain γ now dependent on the average inhibitory rate, that the same, exact equations determine the storage capacity. Such a result stems from the fact that by acting on the gain, inhibition now keeps the effective JEE entering the stability analysis close to 1, but it leaves identical the capacity equations, as the analytical treatment shows. This confirms that the form of inhibition used has no effect on such absolute limit (a limit that with subtractive inhibition was far beyond what could be achieved in practice). We have also carried out simulations of a simple network model with 3000 to 5000 threshold-linear units as used in the analytical calculation, at several sparseness values. We estimated storage capacity from the simulations by progressively increasing memory load and determining the critical level at which no retrieval of any stored pattern was possible. Results are shown in Figure 1, and confirm the analytical prediction, which is the exact reproduction of previous analyses with subtractive inhibition (Treves & Rolls, 1991). Note that a value of the storage parameter α = 0.3, for example, corresponds to 900 stored patterns. We have carried out simulations of a more detailed network model with spiking units and conductance-based synaptic action, to understand whether realistic inhibition still allows retrieval of more than two or three patterns (the limit we had on similar simulations with purely subtractive inhibition) and, once disposed of this limitation, to address anew, in a realistic context, the issue of the time scales for recurrent memory retrieval.
4 Simulations Show Stability and Fast Retrieval The simulated network consisted of Nex = 800 excitatory units and Nin = 200 inhibitory ones. Each integrate-and-fire unit represents a neuron as a single-branch, compartmented dendrite through which the cell receives all its input, and a pointlike soma, where spikes are generated. Though very simple, the compartmental model is still computationally demanding and severely limits the size of the network that we could implement on a Linux workstation. The current flowing from each compartment to the external
Recurrent Processing in Autoassociative Memories
439
a
10
1
0.1
Sparseness
0.1
Figure 1: Simulation results for the capacity of a network of 3000 thresholdlinear neurons (5000 for a = 0.05) are compared with the theoretical prediction (solid line) at different values of the sparseness a. The prediction arises from equations identical to those found by Treves (1990).
medium is written as I(t) = gleak (V(t) − V 0 ) +
X
gj (t)(V(t) − Vj ),
(4.1)
j
where gleak is a constant, passive leakage conductance, V 0 the membrane resting potential, gj (t) the value of the jth synapse conductance at time t, and Vj the reversal potential of the jth synapse. V(t) is the potential in the compartment at time t. Synaptic conductances have an exponential decay time behavior, obeying the equation X gj dgj j = − + 1gj δ(t − tk ), dt τj k
(4.2)
where τj is the synaptic decay time constant and 1gj is the amount the conductance is increased when the presynaptic unit fires a spike. 1gj thus represents the (unidirectional) coupling strength between the presynaptic j and the postsynaptic cell. tk is the time at which the presynaptic unit fires its kth spike. For each time step of 1 ms, the cable equation for the dendrite is integrated (MacGregor, 1987) with a finer time resolution of 0.1 ms and the somatic potential is compared with the spiking threshold V thr . When this is exceeded, postsynaptic conductances are updated, and the somatic potential is reset to the after-hyperpolarization value V ahp throughout the neuron.
440
Francesco P. Battaglia and Alessandro Treves
Connections from excitatory to inhibitory, from inhibitory to excitatory, and between inhibitory units are taken to be homogeneous, that is, all of the same strength. Synaptic parameters depend only on the type of presynaptic and postsynaptic unit. The connectivity level is 0.25 between populations and 0.5 within the inhibitory population; that is, each unit synapses onto a fraction of the units of the receiving population, chosen at random. The excitatory units, in contrast, are all connected to each other. This very high connectivity, out of the actual anatomical range, is necessary, because of the small size of the simulated network, to produce sufficient statistical averaging in the synaptic input to each unit. Excitatory-to-excitatory connections encode in their strength p memoµ rized patterns of activity ηi , consisting of binary words with sparseness (in this simple binary case, the fraction of 1s, or active cells in the pattern) a = 0.1. Encoding is implemented through a modified Hebb rule. In contrast with equation 2.7, which includes a baseline weight, all conductances are initially set to zero, and then, for each pattern, the synapse from the ith to the jth unit is modified by a covariance term, ! ¶ Ã ηµ µ µ gEE ηi j −1 −1 . (4.3) 1g = CEE a a If the conductance becomes negative, it is reset to zero. Memories are therefore stored through a “random walk with one reflecting barrier” procedure. The barrier acts as a “forgetting” mechanism (Parisi, 1986); whenever the conductance value bumps into the barrier, it loses memory about the previously presented patterns. Because there is no upper boundary, the average value of excitatory connection strengths grows with the number of memory items learned. The network is tested at low memory loading (p = 10). A systematic study of the storage capacity of the net would not be very meaningful because of the small size of the network. The excitatory synapses impinge on the distal end compartment of the postsynaptic dendrite, and they have a positive reversal potential (referred to as resting membrane potential). Inhibitory synapses are distributed uniformly along the dendritic body, and they have a reversal potential equal to the resting membrane potential (except for the simulations in Figure 2). Inhibition is therefore predominantly shunting, with a geometry very similar to the one considered in Abbott (1991), leading to a mainly multiplicative effect on the postsynaptic firing rate. Table 1 summarizes the parameters used for the simulations. Once the connection matrix is constructed, a test of the retrieval dynamics was performed according to the following protocol. The network is activated by injecting a current in a random fraction a = 0.1 of the units (see Figure 2A). The excitatory and the inhibitory population become diffusely active. Notice that units active in the memory pattern being tested are on average slightly more active than the other units. This is explained by the fact that they have
Recurrent Processing in Autoassociative Memories
60
A Firing Rate (Hz)
Firing Rate (Hz)
50 40 30 20 10 0 80
80
C
60
70
Firing Rate (Hz)
Firing Rate (Hz)
70
0 100 200 300 400 500 600 700 800 900 Time (msec)
50 40 30 20 10 0 80
0 100 200 300 400 500 600 700 800 900 Time (msec)
D
60 50 40 30 20 0 70
E
60 50 40 30 20
60
0 100 200 300 400 500 600 700 800 900 Time (msec)
F
50 40 30 20 10
10 0
B
10 0 100 200 300 400 500 600 700 800 900 Time (msec)
Firing Rate (Hz)
Firing Rate (Hz)
70
90 80 70 60 50 40 30 20 10 0
441
0 100 200 300 400 500 600 700 800 900 Time (msec)
0
0 100 200 300 400 500 600 700 800 900 Time (msec)
Figure 2: Firing rates computed with a time window of 30 msec are plotted for excitatory units for different geometries and reversal potential VI . Units are divided between the 1 population (upper trace), active in the recalled memory, and the 0 population (lower trace), that was silent in the recalled memory. (A) VI = 0 mV with respect to membrane equilibrium potential and inhibitory synapses are distributed along the dendritic body. In this condition, inhibition acts to some extent multiplicatively on the firing rate. Efficient retrieval of the memory is shown by sustained activity in the 1 population and complete activity suppression in the 0 population after the cue has been removed. (B–E): Inhibitory synapses are located on the edge of the dendritic cable. Reversal potential VI is 0 mV (with respect to equilibrium) (B), −10 mV (C), −20 mV (D), −30 mV, (E) and −40 mV (F). Whatever the reversal potential, the two populations are never satisfactorily discriminated.
442
Francesco P. Battaglia and Alessandro Treves
Table 1: Parameters Used for Integrate-and-Fire Simulations. Quantity
Symbol
Value
Number of excitatory cells Number of inhibitory cells Corruption level
NE NI ρ
800 200 0.3
Random activity period Cue period Retrieval period Sampling time window
tinit tcue tretr twin
100 (msec) 300 (msec) 200 (msec) 30 (msec)
Number of dendritic compartments
Ncmp
Dendritic compartment leakage conductance Somatic compartment leakage conductance Dendritic-dendritic axial conductance Excitatory somatic capacitance Inhibitory somatic capacitance
Gd0 Gs0 Gdd 0 Csoma,E Csoma,I
10 6.28 × 10−12 (S) 5 × 10−9 (S) 2.25 × 10−7 (S) 0.5 − 4 × 10−10 (F) 5 × 10−12 (F)
Cue current Firing threshold potential (excitatory) Firing threshold potential (inhibitory) After-spike hyperpolarization potential
Icue 2E 2I Vahp
0.25 (nA) 32 (mV) 25 (mV) −15 (mV)
Excitatory-excitatory connectivity level Excitatory-inhibitory connectivity level Inhibitory-excitatory connectivity level Inhibitory-inhibitory connectivity level “Unitary” excitatory-excitatory synaptic conductance (see equation 4.3) Excitatory-inhibitory synaptic conductance Inhibitory-excitatory synaptic conductance Inhibitory-inhibitory synaptic conductance Excitatory-inhibitory synaptic time constant Excitatory synaptic equilibrium (reversal) potential Inhibitory synaptic equilibrium (reversal) potential Excitatory-excitatory synaptic time constant Excitatory-inhibitory synaptic time constant Inhibitory-excitatory synaptic time constant Inhibitory-inhibitory synaptic time constant
CEE CEI CIE CII gEE
1 0.25 0.25 0.5 5 × 10−8 (S)∗
gEE gEE gEE τEI VE VI τEE τEI τIE τII
4 × 10−9 (S) 2 × 10−8 (S) 9 × 10−10 (S) 1 (msec) 65 (mV) 0 (mV) 5–40 (msec) 1 (msec) 1 (msec) 1 (msec)
Note: Ranges are indicated for quantities that varied within runs. Potential values are referred to membrane resting potential. ∗ Scaled when the synaptic time constant is varied, to preserve the total charge transmitted during a synaptic event (see text). The value given is the one used for τEE = 20 (msec). Simulations in Figure 2 are an exception as concerns inhibitory reversal potential (see the figure caption) and tretr , which is set at 500 (msec).
Recurrent Processing in Autoassociative Memories
443
on average a slightly stronger excitatory input, because the memory being tested contributes a positive term in the random walk construction of the connection strengths. Since p is not too large, even a single term makes a difference (Amit & Brunel, 1997). After 100 msec, the random current is replaced with a cue current, injected in a fraction a + %(1 − a) of the units active in the pattern being tested and in a fraction a(1 − %) of the units inactive in the pattern. In this way, the cue is again a binary word with sparseness a = 0.1, and % is the the average correlation between pattern and cue, which in the runs shown in the figures was set at % = 0.3. The cue current lasts for 300 msec. The average firing rate for the 1 units is much higher than for the 0 ones. When the cue current is removed, the 1 units sag briefly but then recover and stay steadily active, while activity in the others decays at zero firing or at a very low level. The memory pattern has therefore been successfully retrieved. To test the specific effect produced by the type of inhibition, we performed the stepwise manipulation shown in Figure 2. First (see panel B), all inhibitory connections to excitatory cells were moved to the end of the dendritic tree, colocalized with excitatory inputs. This made them somewhat less “multiplicative,” and also weaker. The result is that inhibition becomes unable to suppress the firing of excitatory units, which should be quiescent, and the network fails to retrieve correctly (the residual difference between 1 and 0 units being due to the finite-p effect). To make inhibition stronger again while maintaining its subtractive character, the equilibrium potential of inhibitory synapses was lowered in panels C through F in steps of 10 mV. The result is that inhibition tends to suppress activity across excitatory units, without ever allowing the retrieval state to reemerge after removing the cue. This manipulation then indicates that altering the form of inhibition makes the network cross its capacity limit. Since even the first form, with the inputs spread along the dendritic tree, is far from being purely multiplicative, this capacity limit is well below the upper limit predicted by nondynamical calculations. The simulations were repeated varying the neural and synaptic parameters—the excitatory synaptic time constant (changing at the same time the synaptic conductance to keep the strength of the connection invaried) and the somatic capacitance—in order to vary the firing rate. The inhibitory synaptic time constant was kept smaller than the excitatory time constant in order to speed up the stabilizing effect of recurrent inhibition. To assess the quality of retrieval, we have taken the same informationtheoretical measure used when recording from behaving animals (as opposed to immobilized ones, for example) (Rolls, Treves, & Tovee, 1997; Treves, Skaggs, & Barnes, 1996). The retrieval protocol is repeated for up to 30 trials for each stored memory. Ten randomly selected excitatory units are “recorded,” that is, sampled for the number of spikes they fire in a time window of 30 msec. The window slides with a step of 5 msec spanning the entire simulated time course. The firing rate vector thus constructed at any
444
Francesco P. Battaglia and Alessandro Treves
3
5 ms 10 ms 20 ms 30 ms 40 ms
2.5
Info (bits)
2 1.5 1 0.5 0
0
Cue onset
200
300 Cue Rem. Time (msec)
500
Figure 3: Information time course for different values of synaptic time constant. The transient corresponding to cue onset is well fitted by an exponential function. The rise is faster with shorter synaptic time constants.
time step of each trial is then decoded. This is done (Rolls et al., 1997) by matching it with the p = 10 mean firing rate vectors produced at the same time step when testing the retrieval of each of the memories and finding the closest match. The result of decoding all the trials is a probability table P(s0 |s) containing the likelihood that when testing for memory s, the activity of the sample of units was decoded as matching the average vector from pattern s0 . The mutual information between the actual and decoded pattern, I(s, s0 ) =
X1X s
p
s0
P(s0 |s) log2
P(s0 |s) , P(s0 )
(4.4)
was calculated and then corrected for limited sampling (Treves & Panzeri, 1995; Panzeri & Treves, 1996). To reduce fluctuations, results were averaged at each time step over a number of samples of recorded units from the same run. The resulting quantity is a sensitive measure of how well the activity of the network in the time window can be used to discriminate which cue was presented and, unlike simpler measures (such as the correlation of the firing vector with the underlying memory pattern), can be used with identical procedures in simulations and recording experiments. In Figure 3 we show the time course of the information for different values of the excitatory time constant. The mutual information stays close to zero during the random activity period (the small baseline is a remnant of the finite size error after the correction), and when the cue is presented, it
Recurrent Processing in Autoassociative Memories
Transient time constant (msec)
120
445
trans. time constant 2.538 * tau syn
100 80 60 40 20 0
0
5
10 15 20 25 30 35 Synaptic time constant (msec)
40
45
Figure 4: Transient time constant plotted against excitatory synaptic time constants. The firing rates were the same in each case, since conductance values were rescaled in order to equalize the charge entering into the cell through the synapse. The best linear fit line is shown. The slope of the fitted line is 2.538.
rises steadily to an equilibrium value, which depends on the correlation between the cue and the pattern, with a time course well fitted by a saturating exponential. This appears to be consistent with the linearized analysis for transients (Treves, 1993) and indicates that the transient modes that are activated in this condition belong to a single family; they share the same real part of the time constant. The time constant from the exponential fit is in a close-to-linear relationship with the synaptic (inactivation) time constant, as shown in Figure 4, with a best-fit proportionality coefficient of 2.538. Varying the firing rate does not appear to have a comparable effect on the transient time constant. Figure 5 plots the transient time constant relative to different values of somatic capacitance, corresponding to firing rates ranging from ∼15 to ∼100 Hz. When the cue is removed, the information rises again very rapidly to a higher equilibrium value, as the network is no longer constrained by the noisy cue, indicating that the network is acting as an “error corrector” during this later phase. The second transient is very rapid indeed, and it is in fact masked by an artifact induced by the finite size of the time window used to measure information (the artifact is that during the time window, what the measure reflects is actually a weighted sum of the lower value before cue removal and the higher value that is reached in a very short time). In fact, if one shrinks the sample window size, this linear raise shortens
446
Francesco P. Battaglia and Alessandro Treves 100
Transient time constant (msec)
Trans. time constant 80
60
40
20
0
0
0.5
1 1.5 2 Somatic Capacitance (1.e-10F)
2.5
Figure 5: Transient time constant plotted for different values of somatic capacitance. Firing rates during the cue phase ranged correspondingly from 15 to 100 Hz. No clear dependence of the information time course is apparent when firing rates are varied in this way.
correspondingly (not shown). Although the actual time structure of this transient is still to be clarified, it seems clear that it follows a very different mode in this path to equilibrium. The final approach to the retrieval attractor is thus essentially immediate. Finally, in Figure 6 we show the information behavior of the network when the excitatory collaterals are made informationless, or memoryless, by giving them all the same strength. A finite, small amount of information is seen in the cue phase only, at a much smaller level than for the structured network, and it falls to zero as the cue is removed. This demonstrates that selective activity, and in particular the capability of this network to retrieve memory patterns, depends crucially on the information encoded on its collaterals. 5 Implications for Recurrent Processing in the Brain The more effective control that shunting inhibition may exert on runaway recurrent excitation, compared with subtractive inhibition, is an intuitive principle that has informed direct experimental studies (Miles & Wong, 1987). What has been shown here is how shunting inhibition may help avoid a specific conflict between stability and extensive memory storage
Recurrent Processing in Autoassociative Memories
447
2.5
2
info (bits)
1.5
1
0.5
0
0
Cue onset
200
300 Cue rem. Time (msec)
500
Figure 6: Information time course plotted for the structureless network compared with the time course for the network structured as in previous figures. During the cue phase, information reaches just a fraction of the steady-state value in the structured case. After the cue is removed, information decays to zero, reflecting the absence of self-sustained activity.
that would otherwise prevent the applicability of the abstract concept of a recurrent autoassociator to actual recurrent networks in the brain. An attempt to demonstrate the large conductance changes that may underlie shunting inhibition (Douglas & Martin, 1991) has not confirmed the expectation; however, it is unclear to what extent the model used (the striate cortex of anesthetized cats) is relevant to the conditions we considered of massively reverberating excitation. Having ensured the possibility of stable, asynchronous firing attractor states, simulations of a model network with spiking units and synaptic conductances have been used to confirm and extend earlier analytical results on the time required for memory retrieval mediated by recurrent processing to occur. The time course of the initial approach to the attractor state is, as in the analytical treatment, a saturating exponential, or a mixture of exponentially relaxing transient modes with similar (real part of the) time constant. This retrieval time constant is a linear function of the time constant for the inactivation of excitatory synaptic conductances and depends only mildly on prevailing firing rates or on neuronal time scales (as determined, for example, by membrane capacitance). In practice, the contribution of recurrent processing, in this particular instance of an autoassociator, can be dominant within a few tens of milleseconds (with the parameters of Figure 3, within 2.5 times of the synaptic time
448
Francesco P. Battaglia and Alessandro Treves
constant, which can be thought of as being in the 10 msec range; Colquhoun et al., 1992). This leads to the conclusion that at least local recurrent processing can be fast and that it is wrong to exclude its relevance in cases in which neuronal activity is found to acquire its selectivity within a few tens of milleseconds of its onset (Thorpe & Imbert, 1989; Treves, Rolls, & Tovee, 1996). This result lends credibility to the hypothesis that recurrent autoassociation may be a ubiquitous function of local recurrent circuits throughout neocortex, as well as possibly the main function of recurrent connections in the hippocampal CA3 region (Treves & Rolls, 1991, 1994). At the same time, it raises the possibility of a direct manipulation of the time for such a function to be executed by acting on the inactivation kinetics of synaptic AMPA channels. Acknowledgments We are grateful to Mayank Mehta, Lorenzo Cangiano, and Martin Simmen, who participated in early phases of this project, and to Edmund Rolls, Stefano Panzeri, Carl van Vreeswijk, and Mike Hasselmo for extensive discussions. Partial support came from EC HCM contract CHRXCT930245 and CNR contribution 96.00287.CT02. References Abbott, L. F. (1991). Realistic synaptic input for model neural networks. Network, 2, 245–258. Abeles, M. (1991). Corticonics—Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Amit, D. J. (1995). The Hebbian paradigm reintegrated: Local reverberation as internal representation. Behav. Brain Sci., 18, 617–657. Amit, D. J., & Brunel, N. (1997). Global spontaneous activity and local structured (learned) delay period activity in cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1987). Statistical mechanics of neural networks near saturation. Ann. Phys. (N.Y.), 173, 30–67. Andersen, P., Eccles, J. C., & Loyning, Y. (1964). Location of post synaptic inhibitory synapses on hippocampal pyramids. J. Neurophysiol., 27, 592–607. Battaglia, F. P., & Treves, A. (1996). Information dynamics in associative memories with spiking neurons. Society for Neuroscience Abstracts, 22, 1124. Colquhoun, D., Jonas, P., & Sakmann, B. (1992). Action of brief pulses of glutamate on AMPA/kainate receptors in patches from different neurones of rat hippocampal slices. J. Physiol., 458, 261–287. Connors, B. W., Malenka, R. C., & Silva, L. R. (1988). Two inhibitory post-synaptic potentials, and GABAA and GABAB receptor-mediated responses in neocortex of rat and cat. J. Physiol., 406, 443-468. Deppisch, J., Bauer, H.-U., Schillen, T., Konig, ¨ P., Pawelzik, K., & Geisel, T. (1993).
Recurrent Processing in Autoassociative Memories
449
Alternating oscillators and stochastic states in a network of spiking neurons. Network, 4, 243–257. Domany, E., Kinzel, W., & Meir, R. (1989). Layered neural networks. J. Phys. A, 22, 2081–2102. Douglas, R. J., & Martin, K. A. C. (1991). A functional microcircuit for cat visual cortex. J. Physiol (London), 440, 735–769. Eccles, J. C. (1957). The physiology of nerve cells. Baltimore: Johns Hopkins University Press. Eccles, J. C. (1964). The physiology of synapses. New York: Academic Press. Golomb, D., Rubin, N., & Sompolinsky, H. (1990). Willshaw model: Associative memory with sparse coding and low firing rates. Phys. Rev. A., 41, 1843–1854. Hablitz, J. J., & Thalmann, R. H. (1987). Conductance changes underlying a late synaptic hyper-polarization in hippocampal CA3 neurons. J. Neurophysiol., 58, 160–179. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neur. Comp., 7, 307–337. Hebb, D. O. (1949). The organization of behaviour. New York: Wiley. Hestrin, S., Nicoll, R. A., Perkel, D. J., & Sah, P. (1990). Analysis of excitatory synaptic action in pyramidal cells using whole-cells recording from rat hippocampal slices. J. Physiol., 422, 203–225. Hopfield, J. J. (1982). Neural networks and physical systems with emerging collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Kohonen, T. (1977). Associative memory. Berlin: Springer-Verlag. Lapique, L. (1907). Recherches qualitatives sur l’excitation e´ lectrique de nerfs trait´es comme une polarisation. J. Physiol. Pathol. Gen., 9, 620–635. Little, W. A. (1974). The existence of persistent states in the brain. Math. Biosci., 19, 101–120. MacGregor, R. J. (1987). Neural and brain modeling. San Diego: Academic Press. Marr, D. (1971). Simple memory: A theory for the archicortex. Phil. Trans. Roy. Soc. (London) B, 262, 24–81. McBain, C., & Dingledine, R. (1992). Dual-component miniature excitatory synaptic currents in rat hippocampal CA3 pyramidal neurons. J. Neurophysiol., 68, 16–27. Miles, R., & Wong, R. K. S. (1987). Inhibitory control of local excitatory circuits in the guinea-pig hippocampus. J. Physiol., 338, 611–629. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107. Parisi, G. (1986). A memory which forgets. J. Phys. A, 19, L617. Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed encoding of information provided by population of neurons in the primate temporal visual cortex. Exp. Brain. Res., 114, 149–162. Simmen, M. W., Treves, A., & Rolls, E. T. (1996). On the dynamics of a network of spiking neurons. In F. H. Eekman & J. M. Bower (Eds.), Computations and neuronal systems: Proceedings of CNS95. Boston: Kluwer. Thorpe, S. J., & Imbert, M. (1989). Biological constraints on connectionist models. In R. Pfeifer, Z. Schreter, & F. Fogelman-Souli´e (Eds.), Connectionism in perspective (pp. 63–92). Amsterdam: Elsevier.
450
Francesco P. Battaglia and Alessandro Treves
Treves, A. (1990). Graded-response neurons and information encodings in autoassociative memories. Phys. Rev. A, 42, 2418–2430. Treves, A. (1993). Mean-field analysis of neuronal spike dynamics. Network, 4, 259–284. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neur. Comp., 7, 399–407. Treves, A., & Rolls, E. T. (1991). What determines the capacity of auto-associative memories in the brain? Network, 2, 371–397. Treves, A., & Rolls, E. T. (1994). Computational analysis of the role of the hippocampus in memory. Hippocampus, 4, 374–391. Treves, A., Rolls, E. T., & Tovee, M. J. (1996). In V. Torre & F. Conti (Eds.), Neurobiology: Proceedings of the International School of Biophysics, XXIII Course, May 1995 (pp. 371–382). New York: Plenum Press. Treves, A., Skaggs, W. E., & Barnes, C. A. (1996). How much of the hippocampus can be explained by functional constraints? Hippocampus, 6, 666–674. Tsodyks, M. V., & Feigelman, M. V. (1988). The enhanced storage capacity in neural networks with low activity level. Europhys. Lett., 46, 101. Tsodyks, M. V., Mitcov, I., & Sompolinsky, H. (1993). Patterns of synchrony in inhomogeneous networks of oscillators with pulse interactions. Phys. Rev. Lett., 71, 1280–1283. van Vreeswijk, C. A., Abbott, L. F. & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. J. Comput. Neurosci., 1, 313–321. Willshaw, D. J., Buneman, O. P., & Longuet-Higgins, H. C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Wilson, H. R., & Cowan J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1. Received January 3, 1997; accepted June 25, 1997.
Communicated by James Reggia
Synaptic Runaway in Associative Networks and the Pathogenesis of Schizophrenia Asnat Greenstein-Messica School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University, Tel Aviv 69978, Israel
Eytan Ruppin Departments of Computer Science and Physiology, Faculties of Exact Sciences and Medicine, Tel-Aviv University, Tel Aviv 69978, Israel
Synaptic runaway denotes the formation of erroneous synapses and premature functional decline accompanying activity-dependent learning in neural networks. This work studies synaptic runaway both analytically and numerically in binary-firing associative memory networks. It turns out that synaptic runaway is of fairly moderate magnitude in these networks under normal, baseline conditions. However, it may become extensive if the threshold for Hebbian learning is reduced. These findings are combined with recent evidence for arrested N-methyl-D-aspartate (NMDA) maturation in schizophrenics, to formulate a new hypothesis concerning the pathogenesis of schizophrenic psychotic symptoms in neural terms. 1 Introduction Learning in associative neural networks has generated a considerable amount of interest as a possible model of memory and learning in the brain (Dong & Hopfield, 1992; Hasselmo, Annderson, & Bower, 1992; Amit & Fusi, 1992; Hasselmo, 1993; Amit & Fusi, 1994). During learning, previously stored patterns may interfere with the learning of newly introduced ones. This may lead to the formation of erroneous synaptic connections and a marked degradation of memory retrieval performance, termed synaptic runaway (SR). Hasselmo and his coworkers, who coined the term synaptic runaway and were the first to point to its possible importance, have suggested that it is a significant problem in the normal functioning state of associative memory networks (Hasselmo, 1993). Searching for a mechanism that may counteract the occurrence of SR, they observed that acetylcholine suppresses synaptic transmission in intrinsic, intramodular fibers (i.e., connections between neurons belonging to the same cortical subnetwork), while having little effect on synaptic potentials elicited by extramodular fibers (long-range cortical Neural Computation 10, 451–465 (1998)
c 1998 Massachusetts Institute of Technology °
452
Asnat Greenstein-Messica and Eytan Ruppin
connections) (Hasselmo et al., 1992). This selective suppression may prevent the interference between memories and inhibit the resulting synaptic runaway (Hasselmo, 1993; Hasselmo et al., 1992). Moreover, they suggested that the disruption of this normally occurring synaptic suppression (due to cholinergic pathological alterations) or other alterations that may overtly increase the plasticity of internal synapses may play an important role in the initiation and progression of Alzheimer’s disease (Hasselmo, 1994). This article presents a new hypothesis concerning the role of synaptic runaway in the pathogenesis of schizophrenia. This hypothesis is based on two main findings. First, in previous work (Ruppin, Reggia, & Horn, 1996), we have studied associative memory networks that store new memory patterns via activity-dependent synaptic changes. We have shown that an overt increase in synaptic plasticity may eventually lead to the loss of the network’s memory retrieval capacities and to the formation of a pathological, global mixed-state attractor, a possible neural correlate of schizophrenic symptoms (Ruppin et al., 1996; Hoffman & McGlashan, 1993). Second, molecular studies have identified several NMDA receptor subtypes in the brain that are likely to regulate neural learning and plasticity (Bliss & Collingridge, 1993). During normal development, NMDA receptors undergo a process of maturation, where earlier subtypes, which provide greater plasticity, are replaced by less malleable forms, thereby stabilizing the mature synaptic connections (Johnston, 1995). A recent study measuring the messenger RNA (mRNA) levels of NMDA receptor subtypes in schizophrenic and normal brains (Akbarian et al., 1996) has found a 50% increase in the levels of mRNA for the immature NR2D receptor subtype in the prefrontal cortex of schizophrenic brains compared with controls. This finding strongly raises the possibility that NMDA receptor–delayed–maturation (NRDM) occurs in schizophrenia, but leaves open its actual causal pathogenic role. Bearing these results in mind, we put forward a new hypothesis about the pathogenesis of schizophrenia: Delayed maturation of NMDA receptors results in synaptic runaway; the latter causes the formation of pathological attractors that underlie some schizophrenic symptoms. To substantiate our hypothesis, this article presents a rigorous analysis of SR in associative memory networks and estimates its dependence on various factors. We find that the magnitude of SR is very sensitive to a synaptic modification threshold that governs synaptic plasticity, supporting the idea that immature NMDA receptors may indeed play a primary role in the pathogenesis of schizophrenia. The rest of this article is organized as follows. The models studied are presented in section 2. The development of synaptic runaway is analyzed in section 3 and studied numerically in section 4. Our results and their relevance to schizophrenia are discussed in section 5. A detailed presentation of the analysis is provided in the appendix.
Pathogenesis of Schizophrenia
453
2 The Model In the strict sense, the term synaptic runaway refers to the growth of erroneous synapses that may accompany activity-dependent memory storage. In a broader sense, we are interested in the functional implications of SR—that is, in quantifying its effects on the network’s memory performance. To this end, we distinguish between explicit and activity-dependent memory capacities. Explicit memory capacity is the network’s memory capacity limit while memories are stored without any interference (e.g., if the synaptic memory matrix is explicitly constructed from the stored patterns in accordance with a Hebbian law). Activity-dependent memory capacity is the network memory capacity if activity-dependent learning is employed and, consequently, SR may take place. Our goal is to compare the two, study the difference between them as a functional measure of SR extent, and investigate the factors that influence SR. We consider an associative neural network composed of N binary firing neurons, characterized by low coding level f (the fraction of firing neurons in the memorized patterns), and global inhibition H. We first study the case of the simple Willshaw model of associative memory with binary synapses (Willshaw, Buneman, & Longuet-Higgins, 1969; Golomb, Rubin, & Sompolinsky, 1989), and then the case of associative memory networks with unbounded Hebbian synapses (Tsodyks, 1989), where synaptic runaway is manifested by both the formation and the continuing growth of erroneous synapses. This enables us to cover the scope of generic binary-firing models and learn about the sensitivity of bounded versus unbounded synaptic models to the development of SR. 2.1 Associative Memory with Binary Synapses. The network state S evolves according to
N X H Sj (t) + Ii (t) , Si (t + 1) = 2 hi (t) − Nf j6=i hi (t) =
N 1 X Jij (t − 1)Sj (t) , Nf j6=i
(2.1)
(2.2)
where Si stands for neuron ith firing state (0,1), 2 is Heaviside’s step function, and hi (t) is its local field (membrane potential). Here Jij (t) is the synaptic weight projecting from the jth to the ith neuron, and Ii (t) = a · ηµ i is an external stimulus projecting on the ith neuron, presenting memory pattern ηµ . To study explicit memory capacity, we sequentially stored a set of M patterns in the network in an explicit-learning manner. In each iteration, a
454
Asnat Greenstein-Messica and Eytan Ruppin
memory ηµ is stored by explicitly updating the synaptic matrix in accordance with h i (2.3) Jij (t) = 2 Jij (t − 1) + ηµ i ηµj . Retrieval performance is measured by initializing the network with its stored patterns and examining their stability. At some memory storage level, the limit of network capacity is reached, and its performance degrades markedly. To study the extent and functional effects of synaptic runaway, the same procedure is repeated, but the synaptic matrix is updated in an activity-dependent learning manner, that is, replacing equation 2.3 with £ ¤ (2.4) Jij (t) = 2 Jij (t − 1) + 1Jij (t) , where
N X L Sj (t) + Ii (t) Sj (t) . 1Jij (t) = ² · 2 hi (t) − Nf j6=i
(2.5)
Equations 2.4 and 2.5 model in a simplified manner Hebbian long-term potentiation carried out by voltage-dependent NMDA activation. In accordance, L is the postsynaptic threshold of synaptic modification, and ² determines the learning strength. The extent of synaptic runaway is defined as the fraction of erroneously generated synapses, determined by comparing the synaptic matrix obtained by activity-dependent learning to the reference synaptic matrix obtained via explicit learning. Its functional consequences are determined by measuring the difference between explicit and activitydependent memory capacities. This difference, normalized to the explicit memory capacity, is termed the capacity degradation measure. 2.2 Associative Memory with Unbounded Hebbian Synapses. We also studied synaptic runaway in an associative neural network with unbounded Hebbian weights (Tsodyks, 1989). The model equations are identical to those in the bounded case, except that now µ µ
Jij (t) = Jij (t − 1) + ηi ηj ,
(2.6)
in the explicit-learning scenario, and Jij (t) = Jij (t − 1) + 1Jij (t), with
N 2) X (L + Mf Sj (t) + Ii (t) Sj (t) , 1Jij (t) = ² · 2 hi (t) − Nf j6=i
(2.7)
in the activity-dependent learning scenario.
(2.8)
Pathogenesis of Schizophrenia
455
3 Analysis Perr , the probability for erroneous activation of a background neuron, is the main source of noise in the network leading to both memory deterioration and synaptic runaway. In the binary synapses case, conventional signal-tonoise analysis yields Perr =
· µ ¶¸ H − hRi (1 − f ) 1 − erf , √ 2 2σR
(3.1)
where hRi = 1 − γ and σR 2 =
(1 − γ ) + (δ − γ 2 ) Nf
are the mean and standard deviation of the local field of a background neuron. Z x 2 exp(−u2 ) du Erf (x) = √ π 0 denotes the error function, γ is the probability that a synapse remains zero after storing M memories, and δ is the probability that two different synapses of the same neuron both remain zero. The values for γ and δ are different in the explicit-learning and activity-dependent learning scenarios. In explicit learning, γ = (1 − f 2 )
M
(3.2)
and M
δ = (1 − 2 f 2 + f 3 ) ,
(3.3)
where in activity-dependent learning ³ h M 1 − f 2 + (1 − f )2 Perr (µ)Pl (µ) γ = 5µ=1
i´
+ (1 − f ) f (Perr (µ) + Pl (µ))
(3.4)
and ³ ´ M (1 − f ) [1 − Pl (µ)] + f (1 − f )2 [1 − Perr (µ)]2 . δ = 5µ=1
(3.5)
Note that in activity-dependent learning, we must consider also Pl , the probability that the input field of the neuron will be high enough to allow erroneous synaptic modification. Pl is obtained using equation 3.1, replacing
456
Asnat Greenstein-Messica and Eytan Ruppin
Figure 1: The fraction of erroneously firing neurons (Perr ) and the fraction of erroneous synapses are plotted versus M, the number of memories stored in the network. (a) Analytical results. (b) Simulation results. In both parts of the figure, the dot-dashed and dashed lines describe Perr for the explicit-learning and activity-dependent learning scenarios, respectively. The dotted line describes the fraction of erroneous synapses, formed as a result of synaptic runaway. The horizontal parallel line denotes the error criterion for determining the network’s capacity (Perr = 0.05).
the inhibition strength H with the learning threshold L. Perr and Pl depend on γ and δ (see equation 3.1), where γ and δ in turn depend on Perr and Pl (see equations 3.4 and 3.5). These four coupled equations should hence be solved in an iterative manner, where starting from M = 1, M is increased gradually. The derivation of equations 3.2 through 3.5 is given in the appendix. The analysis of the unbounded Hebbian case follows a similar pathway as in the binary synaptic case and is omitted. 4 Results Numerical simulations were performed for a 1000-neuron neural network, described by equations 2.1 through 2.5, with coding level f = 0.05, inhibition strength H = 0.9 (near-optimal capacity), and learning threshold equal to the firing threshold (L = H). Figure 1 depicts how SR develops as we gradually store more memories in the network (starting from a tabula rasa synaptic matrix). In each trial, for both the explicit-learning and activitydependent learning scenarios, the fraction of erroneously firing neurons
Pathogenesis of Schizophrenia
457
Figure 2: Capacity degradation due to synaptic runaway as a function of coding level and inhibition strength. (a) Binary weights neural network. (b) Hebbian weights neural network. As in Figure 1, the error criterion is Perr = 0.05.
(equivalent to Perr calculated analytically) and the fraction of erroneous (runaway) synapses were determined after 10 steps of synchronous updating. Defining a level of Perr = 0.05 as an arbitrarily chosen error criterion determining the network’s capacity, the capacity degradation measure is 0.2; that is, the memory capacity in the activity-dependent learning case is about 20% less compared to the capacity in the explicit-learning case, and is thus fairly mild. The dependence of capacity degradation on the coding level and inhibition values is depicted in Figure 2. In the binary synapses case in Figure 2a, capacity degradation is increased in the regime of low coding level. Even there, maximum capacity degradation does not exceed 20%. In the unbounded model in Figure 2b, the influence of synaptic runaway is more significant, but maximal capacity degradation does not exceed 40%. As before, capacity degradation is higher in the regime of low coding level. The analytical calculations were also used to estimate the dependence of capacity degradation on network size, plotted in Figure 3a. Although capacity degradation decreases with network size in the binary case, it mildly increases in the unbounded case.
458
Asnat Greenstein-Messica and Eytan Ruppin
Figure 3: (a) Capacity degradation as a function of network size. f = 0.05 and H = 0.9. The dashed line denotes the binary synaptic case and the solid line the Hebbian case. (b) Capacity degradation as function of the normalized difference between firing and learning thresholds. In both figures, the error criterion is Perr = 0.05.
Our initial choice of equal learning and firing thresholds is arbitrary. In fact, biological evidence indicates that the learning threshold under normal conditions is higher than the firing threshold, as the membrane potential required for the opening of NMDA channels is above the neuron’s firing threshold (Stein, Daw, & Fox, 1993). Holding the firing threshold fixed, we studied the effect of variation of the learning threshold on synaptic runaway. Figure 3b depicts analytical and numerical results of capacity degradation in a binary firing network as function of the difference between firing and learning thresholds, normalized to the firing threshold ((H − L)/H). As observed, the capacity degradation level is very sensitive to the learning threshold value. When the learning threshold is lower than the firing threshold (i.e., positive threshold difference values), synaptic runaway and memory capacity degradation become very significant. 1 A similar pattern is observed in the Hebbian, unbounded, case. 1 There is a capacity degradation even when the learning threshold is higher than the firing threshold because only the postsynaptic neuron is affected by the learning threshold, and erroneous synapses may still be formed even if the postsynaptic neuron is correctly activated while the presynaptic neuron is activated erroneously.
Pathogenesis of Schizophrenia
459
Figure 4: Plot of the mean and SD of network end state activity as a function of the number of memory patterns stored in the network during activitydependent learning.
The strong dependence of SR development on the value of the learning threshold supports the possibility that NMDA maturation failure may play an important role in schizophrenia. That is, modeling NMDA maturation failure as a decreased learning threshold, we expect that the network will become overloaded at an early stage, much before it reaches its explicit capacity limits. As a result, synaptic runaway will develop, pathological attractors may form, and the network’s memory performance should collapse. To test the computational feasibility of our hypothesis, we performed a simulation experiment in which we gradually stored memory patterns in a binary synapses network in an activity-dependent manner. As before, N = 1000, f = 0.05, and H = 0.9, but the learning threshold is lowered to L = 0.8. Figure 4 traces the network’s retrieval scenario at subsequent epochs. In each such epoch, five additional memories are stored in the network in an activity-dependent manner. To gauge the network’s memory landscape in each epoch, distorted versions of all memory patterns stored up to that epoch are presented as input cues to the network, and the end states the network converges to are recorded. The mean activity level of the end states and the standard deviation (SD) of the end states’ distribution (averaged over all neuron sites) are presented in Figure 4. Three phases in the network’s evolution may be distinguished. Until the memory load reaches about 80 stored patterns, there is nearperfect memory
460
Asnat Greenstein-Messica and Eytan Ruppin
retrieval (mean end state activity is equal to the memories coding level, and end state SD is almost constant). As more memories are learned, memory performance collapses, as is evident by noting that the mean activity of the retrieved states rises much above the network’s coding level. During this intermediate collapse stage (until the presentation of about the one hundredth memory pattern), the loss of memory retrieval results in disorganized network activity, and the SD rises. This stage is followed by the final, third stage, where a pathological attractor evolves and gradually takes over the whole retrieval scene, as manifested by the fact that the SD decreases to zero. Note that although the activity level of the pathological attractor is high, probably involving a few neural assemblies that coalesce, it remains stable at this level practically forever, and a global attractor involving complete network excitation is not formed. 5 Discussion We have shown that as long as the learning threshold is equal to or higher than the neuron’s firing threshold, synaptic runaway leads to less than 20% memory capacity degradation with binary weights (Willshaw network) and to less than 40% capacity degradation with Hebbian weights (Tsodyks network) over a broad range of parameters. The extent of synaptic runaway in binary firing networks 2 is fundamentally different from that previously found in networks with analog firing functions. Hasselmo and his coworkers reported that synaptic runaway is a serious problem even in the normal state of analog-firing networks (L = H) and suggested that the suppression of intramodular transmission is needed to prevent synaptic runaway. We find that in binary firing networks, the situation is different; in their baseline state, synaptic runaway is quite moderate. However, synaptic runaway increases sharply when the learning threshold is decreased, suggesting that NMDA receptor dysfunction may have an important causal role in the pathogenesis of schizophrenia. To this end, we have studied activity-dependent memory storage in the presence of a decreased learning threshold. The synaptic runaway that develops in this case leads to the gradual formation of a pathological attractor; the latter reflects the concomitant activation of a few memory patterns in the network, which dominate the memory retrieval scene. The possible clinical interpretation of such an attractor is an open question; it may stand for a cognitively meaningful activation of several memories, thus being a neural correlate of schizophrenic positive symptoms such as delusions or hallucinations, or it may constitute a cognitively meaningless activation underlying schizophrenic negative symptoms (see Hoffman & Dobscha, 1989; Hoffman
2 Note the distinction between binary and analog firing of the neurons, and binary and Hebbian synaptic weights.
Pathogenesis of Schizophrenia
461
& McGlashan, 1993; see particularly Ruppin et al., 1996, for a detailed discussion of this issue). It should be noted that, in contrast to the behavior observed in Ruppin et al. (1996), when SR occurs, the network’s retrieval becomes completely dominated by a single, mixed attractor. However, our simulations show that if spontaneous activation of memory patterns occurs (e.g., due to increased background noise), the network may undergo a long interim period (of a few hundred epochs) where a small set of the memories is strengthened and retrieved. This may parallel the known clinical observation in schizophrenia that positive symptoms (patterns close to a memory state) gradually subside and negative symptoms (global mixed states) eventually dominate the patients’ clinical manifestations (Gray, Feldon, Rawlins, Hemsley, & Smith, 1991; Carpenter & Buchanan, 1994). As shown, the formation of pathological attractors is coupled with memory loss. In schizophrenics, however, pathological attractor formation may be limited to a few modules (see Ruppin et al., 1996), and hence memory functions may remain almost undamaged in the initial stages of the disease. An important distinction should be made between the formation of pathological attractors and their activation. Although we hypothesize that NMDA receptor dysfunction is a primary factor in the formation of pathological attractors, other factors should activate them after they have been formed to generate psychotic symptoms. As Olney and Farber (1995) proposed, NMDA receptor hypofunction may lead to overexcitation of cortical networks (via disinhibition), which in turn could lead to activation of pathological attractors, previously formed by the delayed maturation of these receptors. Thus, in contrast to synaptic degeneration that results in passive memory decline in associative models of Alzheimer’s disease (Horn, Ruppin, Usher, & Herrmann, 1993; Ruppin & Reggia, 1995; Horn, Levy, & Ruppin, 1996), concurrent overexcitation in schizophrenia may repeatedly activate the pathological attractors formed and even enhance their selective, biased amplification (Ruppin et al., 1996). The process of pathological attractor formation described above would have occurred eventually at the model network even with a normal learning threshold level, given sufficiently high memory storage levels. However, our claim is that such high storage levels are simply not reached in normal subjects. In fact, due to the limited capacity of associative memory models and their known sensitivity to catastrophic breakdown, one must assume the existence of safeguard mechanisms that regulate memory storage and prevent overload in normal conditions. Such local activity-dependent synaptic maintenance mechanisms that in normal conditions can sense the formation of pathological attractors and counteract them have been presented by Horn, Levy, and Ruppin (1997). In a similar vein, synaptic runaway will not develop in other neural models employing implicit mechanisms for weight normalization, such as networks embodying competitive learning, as long as these mechanisms are intact. The NRDM hypothesis put forward in this article has several inter-
462
Asnat Greenstein-Messica and Eytan Ruppin
esting connections with the classical dopaminergic hypothesis, which assumes that dopaminergic overactivity plays a major role in the pathogenesis of schizophrenia. First, dopamine is thought to enhance synaptic plasticity (Karler, Calder, & Turkanis, 1991; Calabresi, Maj, Pisani, Mercuri, & Bernardi, 1992) and therefore is likely to exacerbate the effects of NMDA retarded maturation. This may explain why dopamine-blocking agents take weeks to exert their therapeutic effects; after the reinforcement of pathological attractors is blocked with these agents, it is likely to take weeks (the typical estimation of synaptic turnover period in mammals; Purves & Voyvodic, 1987; Wolff, Laskawi, Spatz, & Missler, 1995) for the underlying pathological synaptic connections to attenuate. Second, the combined role of NMDA receptor dysfunction and dopaminergic overactivity concords with the influential view of schizophrenia as a neurodevelopmental disorder (e.g., Weinberger, 1987), where a primary silent developmental deficit (malfunction of mechanisms regulating NMDA maturation) manifests itself clinically only later, when an additional pathological process takes place (excess of dopaminergic activity). Third, as suggested previously by Cohen and Servan-Schreiber (1992), increased dopaminergic activity may exert part of its pathogenic effects in schizophrenia by increasing the steepness of sigmoidal neuronal firing functions. This may further affect the magnitude of SR, depending on the relation between the firing and learning thresholds, and merits further study in the future. In summary, adding to Hasselmo’s hypothesis concerning the putative role of cholinergic disturbances and synaptic runaway in the progression of Alzheimer’s disease, we suggest that synaptic runaway may play an important role in the pathogenesis of schizophrenia. In the light of the fundamental function of memory in information processing in the brain, it is perhaps not surprising that synaptic runaway may be involved in the pathogenesis of numerous brain disorders. Appendix The calculation of memory capacity in the binary synapse model follows. A.1 Explicit Learning. The local field of a neuron is given by equations 2.1 and 2.2. Calculating the first two moments of the local field of a background (should be quiescent) neuron after the presentation of a stored memory pattern as an input to the network yields * PN hRi i =
µ j6=i Jij η j
Nf
+ =
N−1 (1 − γ ) ≈ (1 − γ ), N
(A.1)
where γ , the fraction of zero connections, is given by γ = (1 − f 2 )M .
(A.2)
Pathogenesis of Schizophrenia
463
The second moment of the local field of a background neuron is: *
+ N N X 1 X µ µ Jij η j Jik η k . hRi i = (Nf )2 j6=i k6=i 2
(A.3)
This summation can be divided into two sums, one over all j 6= k and the other over j = k: hRi 2 i =
N N 1 X 1 X µ µ hJ J ihη η i + hJij 2 ihηµ k 2 i . ij ik j k (Nf )2 j6=k (Nf )2 j
(A.4)
To evaluate the correlation hJij Jik i in the case j 6= k, we consider the probability that the product Jij Jik is zero: 1 − hJij Jik i = Pr(Jij = 0) + Pr(Jik = 0) − Pr(Jij = 0, Jik = 0).
(A.5)
The probability that Jij equals zero is γ (see equation A.2), and the same is true for Jik . The probability that both vanish is 1− f + f (1− f )2 = 1−2 f 2 + f 3 . Therefore hJij Jik i = 1 − 2γ + λ,
(A.6)
where λ ≡ (1 − 2 f 2 + f 3 )M . Hence, (N − 1)2 − (N − 1) N−1 (1 − γ ) (1 − 2γ + λ) + N2 N2 f 1−γ ≈ (1 − 2γ + λ) + Nf
hR2i i =
(A.7)
and σRi 2 = hRi 2 i − hRi i2 ≈
1−γ + λ − γ2 . Nf
(A.8)
A.2 Activity-Dependent Learning. The derivation is similar to that presented for explicit learning, except that now ³ h M 1 − f 2 + (1 − f )2 Perr (µ)Pl (µ) γ = 5µ=1 ¤¢ + (1 − f ) f (Perr (µ) + Pl (µ))
(A.9)
and ³ ´ M (1 − f ) [1 − Pl (µ)] + f (1 − f )2 [1 − Perr (µ)]2 . δ = 5µ=1
(A.10)
464
Asnat Greenstein-Messica and Eytan Ruppin
The calculation of γ and δ must be done iteratively in conjunction with the calculation of Perr and Pl (given by equation 3.1) as a function of the number of memories stored in the network at each iteration. The contributions of foreground and background neurons to synaptic formation should now be considered separately. For example, in the case of equation A.9, there are four terms in the square brackets. The first denotes the contribution from the (correct) firing of two foreground neurons, the second the contribution of the (erroneous) firing of two background neurons, the third the contribution of the firing of a background neuron onto a foreground neuron, and the last the contribution of the firing of a foreground neuron on a background neuron. The derivation of equation A.10 follows from similar considerations. Acknowledgments We thank Ralph Hoffman and David Horn for very helpful discussions and comments, and the referees for their insightful comments. E. R. was supported by an Alon Fellowship. References Akbarian, S., Sucher, N. J., Bradely, D., Tafazzoli, A., Trinh, D., Hetrick, W. P., Potkin, S. G., Sandman, A., Bunney, E., & Jones, E. G. (1996). Selective alterations in gene expression for NMDA receptor subunits in prefrontal cortex schizophrenics. Journal of Neuroscience, 16(1), 19–30. Amit, D. J., & Fusi, S. (1992). Constraints on learning in dynamic synapses. Network, 3, 443–464. Amit, D. J., & Fusi, S. (1994). Learning in neural networks with material synapses. Neural Computation, 6, 957–982. Bliss, T. V. P., & Collingridge, G. L. (1993). A synaptic model of memory: Longterm potentiation in the hippocampus. Nature, 361, 31–39. Calabresi, P., Maj, R., Pisani, A., Mercuri, N. B., & Bernardi, G. (1992). Longterm synaptic depression in the striatum physiological and pharmacological characterization. Journal of Neuroscience, 12(11), 4224–4233. Carpenter, W. T., & Buchanan, R. W. (1994). Schizophrenia. New England Journal of Medicine, 330(10), 681–690. Cohen, J. D., & Servan-Schreiber, D. (1992). Context, cortex, and dopamine: A connectionist approach to behavior and biology in schizophrenia. Psychological Review, 99(1), 45–77. Dong, D., & Hopfield, J. J. (1992). Dynamic properties of neural networks with adapating synapses. Network, 3, 267–283. Golomb, D., Rubin, N., & Sompolinsky, H. (1989). Willshaw model: Associatve memory with sparse coding and low firing rate. Physical Review A, 41(4), 1843–1854. Gray, J. A., Feldon, J., Rawlins, J. N. P., Hemsley, D. R., & Smith, A. D. (1991). The neuropsychology of schizophrenia. Behavioral and Brain Sciences, 14, 1–84.
Pathogenesis of Schizophrenia
465
Hasselmo, M. E. (1993). Acetylcholine and learning in a cortical associative memory. Neural Computation, 5, 32–44. Hasselmo, M. E. (1994). Runaway synaptic modification in models of cortex: Implications for Alzheimer’s disease. Neural Networks, 7(1), 13–40. Hasselmo, M. E., Annderson, B. P., & Bower, J. M. (1992). Cholinergic modulation of cortical associative memory function. Journal of Neurophysiology, 67(5), 1230–1246. Hoffman, R., & Dobscha, S. (1989). Cortical pruning and the development of schizophrenia: A computer model. Schizophrenia Bulletin, 15(3), 477. Hoffman, R. E., & McGlashan, T. W. (1993). Parallel distributed processing and the emergence of schizophrenic symptoms. Schizophrenia Bulletin, 19(1), 119– 140. Horn, D., Levy, N., & Ruppin, E. (1996). Neuronal-based synaptic compensation: A computational study in Alzheimer’s disease. Nerual Computation, 8, 1227– 1243. Horn, D., Levy, N., & Ruppin, E. (1997). Memory Maintenance via Neuronal Regulation. Neural Computation, 10(1), 1–18. Horn, D., Ruppin, E., Usher, M., & Herrmann, M. (1993). Neural network modeling of memory deterioration in Alzheimer’s disease. Neural Computation, 5, 736–749. Johnston, M. V. (1995). Neurotransmitter and vulnerability of the developing brain. Brain and Development, 17(5), 301–306. Karler, R., Calder, L. D., & Turkanis, S. A. (1991). DNQX blockade of amphetamine behavioral sensitization. Brain Research, 552, 295–300. Olney, J. W., & Farber, N. B. (1995). Glutamate receptor dysfunction and schizophrenia. Arch. Gen. Psych., 52, 998–1007. Purves, D., & Voyvodic, J. T. (1987). Imaging mammalian nerve cells and their connections over time in living animals. Trends Neurosci., 10, 398–404. Ruppin, E., & Reggia, J. (1995). A neural model of memory impairment in diffuse cerebral atrophy. Br. J. Psychiatry, 166(1), 19–28. Ruppin, E., Reggia, J., & Horn, D. (1996). A neural model of positive schizophrenic symptoms. Schizophrenia Bulletin, 22(1), 105–123. Stein, P. S. G., Daw, N. W., & Fox, K. The role of NMDA receptors in information processing. Annu. Rev. Neurosci., 16, 207–222. Tsodyks, M. V. (1989). Associative memory in neural networks with the Hebbian learning rule. Modern Physics Letters B, 3(7), 555–560. Weinberger, D. R. (1987). Implications of normal brain development for the pathogenesis of schizophrenia. Arch. Gen. Psych., 44, 660–669. Willshaw, D. J., Buneman, O. P., & Longuet-Higgins, H. C. (1969). Nonholographic associative memory. Nature, 222, 960–962. Wolff, J. B., Laskawi, R., Spatz, W. B., & Missler, M. (1995). Structural dynamics of synapses and synaptic components. Behavioural Brain Research, 66, 13–20. Received February 28, 1977; accepted July 23, 1997.
Communicated by Carl Van Vreeswijk
On Numerical Simulations of Integrate-and-Fire Neural Networks D. Hansel Centre de Physique Th´eorique UPR014 CNRS, Ecole Polytechnique, 91128 Palaiseau Cedex, France
G. Mato Centro At´omico Bariloche, Comisi´on Nacional de Energ´ıa Atomica, 8400 S. C. de Bariloche, Argentina
C. Meunier L. Neltner Centre de Physique Th´eorique UPR014 CNRS, Ecole Polytechnique, 91128 Palaiseau Cedex, France
It is shown that very small time steps are required to reproduce correctly the synchronization properties of large networks of integrate-and-fire neurons when the differential system describing their dynamics is integrated with the standard Euler or second-order Runge-Kutta algorithms. The reason for that behavior is analyzed, and a simple improvement of these algorithms is proposed.
1 Introduction Our theoretical understanding of the properties of large neuronal systems relies heavily on simulations of networks consisting of up to several thousand interacting neurons. Such simulations are highly time-consuming. For a general network architecture, the CPU time is dominated by the evaluation of the interactions between neurons and scales like N2 , where N is the size of the network. Moreover, one frequently needs to investigate the system’s behavior for many different sets of parameters or to perform a statistical analysis over many initial conditions. It is therefore crucial to integrate the dynamics with an appropriate algorithm that allows one to use as large a time step as possible. Integrate-and-fire (IF) neurons are frequently used for modeling networks of interacting neurons. In IF models, the active properties of the neural membrane responsible for spike generation are not explicitly taken into account. Only the passive membrane properties are incorporated in the equations, and one assumes that a spike is fired whenever the membrane potenNeural Computation 10, 467–483 (1998)
c 1998 Massachusetts Institute of Technology °
468
D. Hansel, G. Mato, C. Meunier, and L. Neltner
tial of the neuron crosses some prescribed threshold. In the simplest IF model (Lapicque, 1907; Tuckwell, 1988), this firing is followed by the instantaneous resetting of the membrane potential at its resting value. In other models, a potassium conductance opens at firing time and leads to the repolarization of the membrane potential (see, for instance, Wehmeir, Dong, Koch, & Van Essen, 1989, for a large-scale neural network using such a model). These passive IF models are less steep than the biophysically more realistic conductance-based models, which incorporate the voltage-dependent ionic currents responsible for the firing of action potentials (Hodgkin & Huxley, 1952), and it seems that larger time steps could be used to integrate their dynamics. However, the IF dynamics presents discontinuities of the membrane potential or its derivatives at firing times, which may cause severe numerical problems. This has led several authors to use “exact integration schemes” (Tsodyks, Mitkov, & Sompolinsky, 1993; Mascagni & Sherman, 1997), which are easily written for Lapicque neurons interacting through synaptic currents and run reasonably fast. However, such schemes cannot be extended to any IF model or to interactions described by synaptic conductances. A number of questions thus remain open regarding the numerical integration of IF models. Under which conditions, in particular for which time steps, can reliable results be obtained when a simple integration scheme, such as the Euler algorithm or the second-order Runge-Kutta algorithm (RK2) is used? Can other numerical schemes avoid the problems due to discontinuities without jeopardizing the computational efficiency? These issues are addressed in this article in the framework of large, globally coupled networks of identical IF neurons. In the next section we describe the IF network model used in this study and explain how the level of synchrony of the network can be characterized and how we estimate the error on the numerical integration of the dynamics. In section 3 we show that very small time steps are needed to integrate the dynamics of an IF network correctly when the Euler algorithm is used. We also show that using schemes that are a priori of higher order, such as the RK2 algorithm, does not help, because the global integration error is dominated by errors arising from the discontinuous nature of the dynamics. We then demonstrate that the addition of a simple interpolation scheme at firing times allows us to solve these problems. Section 4 summarizes and discusses our results.
2 Methods 2.1 The Model. We consider a network of N = 128 identical excitatory neurons. For the sake of simplicity, we assume that the neurons are coupled all-to-all. Neurons are described by a single-compartment model. The membrane potential evolves in the passive zone below the firing threshold
Numerical Simulations of Integrate-and-Fire Neural Networks
469
θ according to the linear differential equation C
dV = −gl (V − Vl ) + Isyn (t) + I0 , dt
(2.1)
where gl and Vl are the conductance and the reversal potential of the voltageindependent leak current, and Isyn is the synaptic current due to the action of the other neurons of the network. The current I0 is a constant external drive that makes the neurons fire periodically when uncoupled and will be set to I0 = 2.30 µA/cm2 throughout the article. The membrane capacitance C is set to 1 µF/cm2 . Requiring a passive membrane time constant τ = C/gl of 10 ms leads to a leak conductance gl = 0.1 mS/cm2 . The resting membrane potential Vrest = Vl is equal to −60 mV in our simulations. Whenever the membrane potential V reaches the threshold θ (−40 mV in our simulations), a spike is fired, and V is instantaneously reset to rest: V(t+ 0 ) = Vrest i f
V(t− 0 ) = θ.
(2.2)
We do not incorporate any absolute refractory period in the dynamics. When the change in the membrane conductance due to synaptic interactions is not taken into account, the synaptic current Isyn (t) is given by Isyn (t) =
I¯syn X X f (t − tspike ) N neurons spikes
(2.3)
where f (t) =
³ t ´ 1 − − t e τ1 − e τ2 . τ1 − τ2
(2.4)
The summation is performed over all the spikes emitted prior to time t by all the other neurons of the network. The normalization adopted here ensures that the time integral of f (t) is always unity. For such an interaction, which is used throughout the article with synaptic time constants τ1 = 3 ms and τ2 = 1 ms, one can write closed analytical expression for the evolution of the membrane potentials of the neurons during time intervals where no spike is fired in the network. Another and biophysically more realistic description of the synaptic current can also be used that incorporates changes in the membrane conductance, X X f (t − tspike ), (2.5) Isyn (t) = −gsyn (V − Vsyn ) neurons spikes
where f (t) has the same form as in equation 2.4 and Vsyn is the reversal potential of the synaptic current. A closed expression can still be derived for V, but it must now be evaluated by numerical integration.
470
D. Hansel, G. Mato, C. Meunier, and L. Neltner
2.2 Initial Conditions. At t = 0, all synaptic currents are set to 0 and the potentials of the neurons are chosen according to ¶¶ µ µ I0 i−1T (2.6) 1 − exp −c Vi (0) = Vl + gl N τ (i = 1, . . . , N), where ¶ µ gl T = −τ ln 1 − (θ − Vl ) I0
(2.7)
is the firing period of the neurons in the absence of interaction, and the coefficient 0 < c < 1 controls the degree of synchrony of the initial condition. Setting c = 0 yields a perfectly synchronized initial condition. On the opposite, c = 1 corresponds to a uniform distribution of firing times for uncoupled neurons. 2.3 Measure of Synchrony. To measure the degree of synchrony in the network, we follow the method proposed and developed in Hansel and Sompolinsky (1992), Golomb and Rinzel (1993), Ginzburg and Sompolinsky (1994), which is grounded on the analysis of the temporal fluctuations of the network activity. One evaluates at a given time t the average membrane potential: AN (t) =
N 1 X Vi (t). N i=1
Its time fluctuations can be characterized by the variance D E 1N = AN (t)2 − hAN (t)i2t . t
(2.8)
(2.9)
This variance is normalized to the population-averaged variance of single cell activity, 1=
N ³D E ´ 1 X Vi (t)2 − hVi (t)i2t . t N i=1
(2.10)
The resulting parameter, 6N =
1N , 1
behaves generally for large N as µ ¶ 1 a +O , 6N = χ + N N2
(2.11)
(2.12)
Numerical Simulations of Integrate-and-Fire Neural Networks
471
where a is some constant number, and χ, comprised between 0 and 1, measures the degree of coherence in the system in the infinite size limit. In particular, χ = 1 if the system is totally synchronized (i.e., Vi (t) = V(t) for all i) and χ = 0 if the state of the system is asynchronous. In the rest of this work, the computation of 6N is performed by running simulations over t = 10 s, and sampling the state of the network every 1 ms from time t = 5 s to time t = 10 s. The network size we use is not large enough to preclude finite size effects in the dynamical properties (see Hansel & Sompolinsky, 1996, for discussions of this issue in neural modeling). However, we are interested here in the accuracy in the estimate of 6N and not in its large size limit, and a size N = 128 is sufficient for this purpose. In the following, 6128 will be denoted simply by 6. 2.4 Exact Integration and Estimate of the Numerical Error in a Simulation. We first recall how the model defined by equations 2.1 through 2.4 can be integrated exactly. Solving equation 2.1 for the membrane potential, one finds, I0 (1 − e−t/τ ) + Vl (1 − e−t/τ ) gl Z t Isyn (t0 ) 0 0 dt e(t −t)/τ + C 0
Vi (t) = Vi (0)e−t/τ +
(2.13)
(i = 1, . . . , N) where the first three terms correspond to the single-cell dynamics and the last term stems from synaptic interactions. Since Isyn (t0 ) is a combination of exponential functions, the integral in the last term can be computed analytically. Starting from a given state of the network and a given value of the synaptic current, one uses equation 2.13 to calculate the first time at which a neuron fires. The potential of that neuron is then reset to rest, the membrane potential of the other neurons is evaluated, and a new spike is included in equation 2.3. This procedure is then iterated again and again. In this method, integrating the differential equations amounts to solving transcendental equations. This can be done at an arbitrary desired precision within the limits of computer accuracy, by using, for instance, a Newton-Raphson scheme—hence the name exact method. This method has been used by Tsodyks et al. (1993) to study the synchronization properties of IF networks. The results it gives will be used as a comparison point for the results we obtain with different algorithms. In the following, we are interested in characterizing the accuracy of the numerical integration of the network dynamics. We compute the relative error on the average firing rate (calculated between times t = 5 s and t = 10 s) of the whole population: ¯ ¯ ¯ fˆ − f ¯ ¯ ¯ (2.14) ²f = ¯ ¯. ¯ f ¯
472
D. Hansel, G. Mato, C. Meunier, and L. Neltner
The precision on the description of the collective properties of the network will be quantified by the relative error on the coherence: ¯ ¯ ¯ ¯6 ¯ ˆ −6¯ ²6 = ¯ ¯. ¯ 6 ¯
(2.15)
Here f and 6 represent the exact values of the firing rate and the coherence, ˆ are the numerical estimates of these quantities. while fˆ and 6 3 Results 3.1 Euler algorithms. 3.1.1 Standard Euler Algorithm. The Euler scheme with fixed time step is the simplest integration scheme that can be used to study the dynamics of the network. Knowing the membrane potential and the synaptic conductance at time t, one calculates the membrane potential at time t + 1t according to V(t + 1t) = V(t) + 1t(−gl (V(t) − Vl ) + Isyn (t) + I0 ).
(3.1)
When V(t + 1t) > θ , a spike is assumed to be fired at time t + 1t, and the membrane potential is reset. Synaptic conductances are then updated accordingly. Figure 1 displays the coherence 6 of the network as a function of the synaptic strength I¯syn for 1t = 0.25 ms, 1t = 0.1 ms, and 1t = 0.01 ms. We have also plotted on the same graph the results yielded by an exact integration. In all these simulations, the same partially coherent (c = 0.5; see section 2.2) initial condition was used. At large time—that is, we are in the asymptotic regime and the transient has died out—the exact value of the coherence is a decreasing function of the synaptic strength in agreement with the fact that excitation is desynchronizing for IF neurons (Hansel, Mato, & Meunier, 1995; Van Vreeswijk, Abbott, & Ermentrout, 1995; Van Vreeswijk, 1996). At large coupling, the network is in the incoherent state, and accordingly the synchrony is of order O( N1 ). The synchronization transition occurs for I¯syn ≈ 0.78 µA/cm2 . The curve obtained for a time step 1t = 0.25 ms deviates strongly from the exact result in all the range of coupling strength. Moreover, as shown in Figure 2, the numerical estimate of the coherence depends on the initial condition. For 1t = 0.1 ms some improvement is seen, but, again, the results depend on the initial condition. Moreover, the synchrony parameter does not display the correct qualitative behavior at small coupling, and the location of the transition is incorrectly predicted. This latter problem is alleviated with a time step of 1t = 0.01 ms, but the coherence in the region of large synchrony does not yet behave smoothly and still differs from the
Numerical Simulations of Integrate-and-Fire Neural Networks
S
473
1
0.8
0.6
0.4
0.2
0 0
0.2
0.4
−
Isyn
0.6
0.8
1
Figure 1: Coherence 6 as a function of the coupling strength I¯syn (µA/cm2 ) for the Euler algorithm and different time steps. Dashed line: 1t = 0.25 ms; dotted line: 1t = 0.1 ms; dashed-dotted line: 1t = 0.01 ms; solid line: exact integration. Network parameters: N = 128, τ1 = 3 ms, τ2 = 1 ms, and I0 = 2.3 µA/cm2 .
exact value by 15% to 20%. An integration time step of 1t = 0.001 ms is actually required to evaluate correctly the coherence of the network in this regime (result not shown). The firing rate of the network, evaluated with the exact method, varies from f ≈ 50 Hz at weak coupling to f ≈ 90 Hz near the transition. Therefore, the minimum time step required to describe correctly the synchronization properties corresponds to O(104 ) time steps per firing period. The order of the integration algorithm can be derived from the error ² f on the firing rate at different integration time steps. The results are displayed in Figure 3 (solid line). Here the synaptic strength has been fixed to I¯syn = 0.60 µA/cm2 , in the midst of the synchronized phase. For sufficiently small time steps, ² f goes to zero linearly in the time step. A fit of the results yields ² fE = AEf 1t,
(3.2)
with AEf ≈ 0.310 ms−1 . The relative error on the coherence is plotted in Figure 4. It decreases to zero linearly with the time step, although the scaling is not as precise as
474
D. Hansel, G. Mato, C. Meunier, and L. Neltner
S
1
0.8
0.6
0.4
0.2
0 0
0.2
0.4
−
Isyn
0.6
0.8
1
Figure 2: Coherence 6 as a function of the coupling strength I¯syn (µA/cm2 ) for the Euler algorithm (1t = 0.25 ms) and different initial conditions. Dashed line: c = 0.1; solid line: c = 0.5; dotted line: c = 0.9. Other parameters are the same as in Figure 1.
for ² f . This can be explained by the discontinuity inherent to IF neurons. Indeed, it may happen that at sampling time, some neurons are very close to the firing threshold. Depending on the time step, the numerically estimated value of the potential for these neurons will be either close to θ (the neuron is about to fire) or close to Vrest (the neuron has just fired). This introduces irregularities in the scaling behavior that would not be met for a model with a continuous evolution of the potentials. We have checked that last point by simulating a network of IF neurons, the dynamics of which incorporates an after-hyperpolarization current, which led to a much better scaling for the coherence 6. These results show that although the Euler algorithm is a first-order integration scheme and converges to the correct results in the limit 1t → 0, very small time steps, as compared to the firing period, are required to achieve a reasonable precision on the coherence 6 (see Table 1). 3.1.2 Improved Euler algorithm. The standard Euler algorithm can be modified to increase the accuracy on the determination of firing times by adding a linear interpolation scheme at firing times—when at time t + 1t the potential V0 (t + 1t) (obtained using the standard Euler algorithm) is
Numerical Simulations of Integrate-and-Fire Neural Networks
475
0
10
ef −2
10
−4
10
−6
10
−8
10
−10
10
−4
10
−3
−2
10
10
−1
10
0
10
dt(ms) Figure 3: Relative error on the firing rate ² f as a function of the time step for the different algorithms. Solid line: standard Euler; dotted line: Euler with interpolation; dashed line: standard RK2; dashed-dotted line: RK2 with interpolation. I¯syn = 0.60 µA/cm2 . Other parameters are the same as in Figure 1. Table 1: Maximum Time Step Allowed for Four Methods. Method
Time Step (ms) for 10% on 6
Standard Euler Euler with interpolation Standard RK2 RK2 with interpolation
10−2
3.1 6.2 10−2 6.2 10−2 0.25
Time Step (ms) for 1% on 6 2.0 10−3 2.0 10−3 3.9 10−3 6.3 10−2
Note: Same parameters as in Figure 3.
above threshold. This interpolated value of the firing time is then used to estimate the potential at time t + 1t, yielding, ¶ µ V(t) − Vrest + Vrest , (3.3) V(t + 1t) = (V0 (t + 1t) − θ ) 1 + gl 1t V0 (t + 1t) − V(t) and to update the synaptic current (see the appendix). Figure 5 displays the results of simulations performed with this algorithm. The interpolation greatly improves the performance of the integration, which now yields the correct qualitative behavior of the coherence. However, small time steps are still required to achieve good quantitative results.
476
D. Hansel, G. Mato, C. Meunier, and L. Neltner
0
10
eS
−1
10
−2
10
−3
10
−4
10
−5
10
−6
10
−7
10
−8
10
−4
10
−3
−2
10
10
−1
10
0
10
dt(ms) Figure 4: Relative error on the coherence ²6 as a function of the time step for the different algorithms. Conventions are the same as in Figure 3.
S
1
0.8
0.6
0.4
0.2
0 0
0.2
0.4
−
Isyn
0.6
0.8
1
Figure 5: Coherence 6 as a function of the coupling strength I¯syn (µA/cm2 ) for the modified Euler algorithm and different time steps. Conventions are the same as in Figure 1.
Numerical Simulations of Integrate-and-Fire Neural Networks
477
This modified Euler scheme is expected to be also first order in the time step. This is confirmed by Figures 3 and 4. The error on the firing rate behaves linearly, ² fEI = AEI f 1t,
(3.4)
−1 with AEI f ≈ 0.219 ms . Note that the prefactor is about 30% smaller than for the standard Euler algorithm.
3.1.3 Estimate of the error introduced by the discontinuity of the dynamics. The improvement in accuracy achieved with a linear interpolation originates in the elimination from the global error of a term of order O(1t) due to the discontinuity of the dynamics. When a neuron fires a spike between times t and t + 1t, a local error on the firing time of order O(1t) is made in the standard algorithm, which assumes that firing occurs systematically at time t + 1t. This has two consequences. First, it leads to an error of order O(1t) on the value of the potential at time t + 1t. The frequency of such errors depends on the firing rate but not on the time step. Their contribution to the global error is then of order O(1t). But the error on the firing time also introduces an error of order O(1t) on the interaction term, which itself produces at each time step an error on the potential of the neurons of order O(1t2 ) and leads to a global error of order O(1t). So the global error of order O(1t) is comprised of two terms: the former is directly related to the resetting and persists in the case of a single neuron, while the latter is mediated by the interaction and exists only in networks. Using a linear interpolation scheme enables one to increase the order in 1t of both error terms as the firing times are then estimated with a precision of O(1t2 ). The results of the previous section indicate that the error in the standard Euler algorithm stemming from the discontinuity of the dynamics is for small 1t of the same order of magnitude as the error due to the integration between firing times. More precisely, a lower bound for the importance of the error induced by the discontinuity of the dynamics is given by the quantity ² freset = ² fE − ² fEI ≈ 0.0911t.
(3.5)
3.2 Second-Order Runge-Kutta Algorithms. One might think that reliable results can be obtained for a larger time step and without relying on an interpolation by increasing the order of the discretization in time. If one uses, for instance, the standard RK2 algorithm, the potential is now updated according to (see Mascagni & Sherman, 1997) 1t (−gl (V(t) + 1tf1 (t) − Vl ) 2 + Isyn (t + 1t) + I0 + f1 (t))
V(t + 1t) = V(t) +
(3.6)
478
D. Hansel, G. Mato, C. Meunier, and L. Neltner
where f1 (t) = −gl (V(t) − Vl ) + Isyn (t) + I0
(3.7)
and Isyn (t+1t) is computed as if no spike were emitted between t and t+1t. However, this is not a second-order scheme in the present instance, since the error on firing times is still of order O(1t) (see Figures 3 and 4). The error on the firing rate, ² fRK2 , behaves at small time step as 1t, ² fRK2 = ARK2 f
(3.8)
where ARK2 ≈ 0.0947 ms−1 as expected for a first-order algorithm. Moreover, f the fact that ≈ AEf − AEI ARK2 f f
(3.9)
indicates that the global error in the standard RK2 scheme is dominated by the error due to the discontinuity of the dynamics, in agreement with the analysis of section 3.1.3. Substracting the errors is meaningful in the present case as we checked that the global error systematically decreases when moving from the standard Euler scheme to the modified Euler scheme, and then to the RK2 algorithm. The determination of the spiking times by linear interpolation is consistent with the order 2 of the RK2 algorithm. The firing times are estimated with a precision of O(1t2 ), which results in an error of order O(1t3 ) in the interaction, in agreement with the local error of the method. Therefore, one expects that supplementing the standard RK2 algorithm with a linear interpolation scheme (hereafter RK2I scheme) will yield a consistent secondorder algorithm for integrating IF networks. This is confirmed by Figures 3 and 4. The quantity ² fRK2I now behaves as (1t)2 , ² fRK2I = ARK2I f
(3.10)
with ARK2I ≈ 0.0470 ms−2 . Moreover, this modified RK2 scheme allows one f to use much larger time steps, in comparison with the previous algorithms (see Table 1). Figure 6 displays the coherence as a function of the coupling strength for the modified RK2 algorithm for 1t = 0.25 ms, 1t = 0.1 ms, and 1t = 0.01 ms. The numerical integration with 1t = 0.25 ms provides an excellent estimate in the whole range of coupling, and the curves obtained for 1t ≤ 0.1 ms can barely be distinguished from the results from the exact integration of the dynamics. 4 Summary and Discussion We have shown on the example of an all-to-all network of IF excitatory neurons that standard integration algorithms (Euler or Runge-Kutta) give
Numerical Simulations of Integrate-and-Fire Neural Networks
S
479
1
0.8
0.6
0.4
0.2
0 0
0.2
0.4
−
Isyn
0.6
0.8
1
Figure 6: Coherence 6 as a function of the coupling strength I¯syn (µA/cm2 ) for the modified RK2 and different time steps. Parameters are the same as in Figure 1. The results obtained for 1t = 0.01 ms cannot be distinguished from the results of the exact integration of the dynamics at the scale used in the figure.
reliable quantitative estimates of the synchrony properties for only very small integration time steps. A simple modification of these algorithms, where a linear interpolation procedure is used to determine the firing times, greatly improves the integration performance. In particular, we found that time steps larger by one to two orders of magnitude can be used when one adopts the modified RK2 scheme instead of the standard Euler or RK2 algorithms. At large time steps, standard algorithms can even lead to strong artifacts in the dynamics, under the form of spurious multistability. The dynamics then seems to display several asymptotic states, the state eventually reached depending on the initial condition. This originates in the fact that differences in the firing times of different neurons smaller than the integration time step are discarded. The results presented in this article have been obtained for networks of N = 128 neurons, but increasing the size of the network does not change our conclusions. Let us consider the case of the standard Euler algorithm and assume that 1t is sufficiently small for the average firing rate and the coherence to be accurately evaluated for N = 128. This means that the firing times distribution, which is then discretized in time with a bin of size 1t, is evaluated with sufficient accuracy. If one keeps the same value of 1t
480
D. Hansel, G. Mato, C. Meunier, and L. Neltner
Table 2: CPU Time per Time Step for the Four Integration Schemes Used. Method
CPU per Time Step (s)
Standard Euler Euler with interpolation Standard RK2 RK2 with interpolation
4.6 10−5 4.8 10−5 8.1 10−5 8.6 10−5
while increasing the size of the network, one expects that the firing time distribution of the neurons will remain unchanged, the number of neurons in each bin scaling with the network size. This spurious clustering will lead to the same calculated values of the average firing rate and of the coherence regardless of the system size. For instance, we have checked that for N = 128 − 5000 (same parameters as in Figure 3), a time step of 1t = 3.1 10−2 led to the same computed value of the coherence, determined with a 10% accuracy. It is clear that the scaling of equation 2.12 cannot be obtained when such a spurious clustering occurs as 6N saturates instead of converging to a large N limit. The error on 6N is then dominated by the finite size effects at N = 128 and not by the inaccuracy of the numerical integration. However, this problem is fully cured by relying on an interpolation scheme for the determination of firing times, and we checked that the correct scaling of 6N with N is then recovered. The CPU times required on a HP9000 (Model 819/K200) per time step of integration are given in Table 2 for the four integration methods we compared. One iteration of the RK2 scheme takes a longer time (approximately 50% more time) than an iteration of the Euler algorithm, as the derivative of the potential is evaluated twice per time step. Supplementing this algorithm with a linear interpolation affects this computation time very slightly and leads to an algorithm that is much more efficient than the three other integration schemes because the time step can be increased by an order of magnitude for a desired accuracy of the results. This modified algorithm is also much more efficient than the exact method. With a time step 1t = 0.1 ms, 8.6 seconds of CPU time are required to integrate 10 seconds of the dynamics (105 time steps). In contrast, the exact integration of the same 10 seconds requires about 500 seconds, a 60-fold decrease in computational efficiency. Actually even the Euler algorithm with a time step of 10−3 ms would be more efficient than the exact method. Note, however, that when the network is composed of all-to-all coupled identical neurons, the exact method can be considerably accelerated as the neurons always fire in the same order, and one can predict which neuron is going to fire next. Making use of that fact brings the exact integration on par with the modified RK2 scheme. It is often believed that integrating the dynamics of IF networks is much
Numerical Simulations of Integrate-and-Fire Neural Networks
481
faster than integrating the dynamics of similar networks of conductancebased neurons. Our results show that this may not be true when a standard integration algorithm is used. On the contrary, this claim is substantiated when using the RK2I algorithm. In the IF model studied in this article, we have neglected the change in membrane conductance due to synaptic interaction. This was motivated by the fact that it is then possible to compare the results obtained to those given by an exact integration scheme. However, our conclusions regarding the relative efficiency of the different algorithms are still valid when the changes in membrane conductance are taken into account and also when a refractory period is introduced in the model. A natural generalization of the RK2I algorithm would be a higher-order Runge-Kutta method supplemented with an appropriate interpolation scheme, also of higher order. However, if the change of the interaction term when a neuron fires a new spike is not carefully taken into account when computing the new value of the potential, an error of order O(1t2 ), inconsistent with any RK algorithm of order higher than 2, is introduced. One way to avoid this problem is to adopt an adaptive time step at firing times, but the computation time is then dominated for large time steps by the determination of the firing times, which depends little on 1t and scales like N2 . As a consequence, higher-order algorithms are less efficient than the RK2I scheme as the gain in precision is counterbalanced by the increase in CPU time. For instance, the RK2I algorithm with a time step of 0.1 ms always leads over the RK3I algorithm by a factor of at least 2.5. This shows that among Runge-Kutta algorithms, the RK2I scheme is optimal for integrating the deterministic dynamics of IF networks. Finally, we discuss the respective effects of the two types of error of order O(1t) exhibited by standard algorithms (see section 3.1.3). The first error originates in the discontinuity of the single-cell dynamics at firing times and does not arise in the smooth dynamics of conductance-based models. It occurs in IF models even when the potential is continuous, as shown by a frequently employed type of IF neuron, where the smooth repolarization of the membrane potential is due to an AHP current turned on by the spikes of the neuron: IAHP (t) = −gAHP (V − VAHP )
X
g(t − tspike ).
(4.1)
spikes of the neuron
Here g(t) is an alpha function with a time constant τAHP , and VAHP is the reversal potential. The membrane potential and its first derivative are then continuous, but its second derivative is discontinuous at firing times. The error at each time step is of order O(1t2 ), but it accumulates over time and still leads to a global error of order O(1t). Here, again, the problem can be corrected by an interpolation scheme.
482
D. Hansel, G. Mato, C. Meunier, and L. Neltner
The second error of order O(1t) occurs only in networks and is mediated by the interaction between neurons. It is caused by errors in the determination of spiking times, is not related to the precise form of the interaction, and is present whenever firing times are determined by the crossing of a threshold. This form of discontinuity in the dynamics occurs in networks of conductance-based neurons as well as in networks of IF neurons. The classical Runge-Kutta schemes (Euler, RK2, etc.) are all first order in that case, but one can devise a really second-order Runge Kutta scheme by determining firing times via a linear interpolation. However, the maximum time step is now fixed by the steepness of the problem, that is, by the rise time of spikes in the case of conductance-based models. This maximal time step is then of the order of 0.05 ms, which strongly limits the gains obtained from the use of the RK2I algorithm because the standard methods then allow one to estimate the coherence within a few percents. The error mediated by the interaction is expected to have no major consequences on the dynamics of conductance-based models in that range of time steps. All the results presented in this article concern deterministic models. Similar issues need to be adressed when noise is present in the network dynamics. A crucial question is to determine whether the noise occludes the errors induced by the discontinuous nature of the dynamics. This will be the object of a subsequent work fully dedicated to stochastic algorithms for IF neurons. Appendix Let us suppose that V(t) < θ and V0 (t + 1t) ≥ θ . Using equation 3.1, one evaluates by linear interpolation the time tspike (t < tspike < t + 1t) at which the membrane potential reaches threshold : tspike = t + 1t
θ − V(t) . V0 (t + 1t) − V(t)
(A.1)
The membrane potential is reset to Vrest at that time. Between tspike and t+1t, the time derivative of the membrane potential is given at lowest order by dV dV (tspike ) = (t) + gl (V(t) − Vrest ) dt dt V0 (t + 1t) − V(t) + gl (V(t) − Vrest ). = 1t
(A.2)
V(t + 1t) is equal at the order considered to V(tspike ) + (t + 1t − tspike ) dV dt . Using this equation and the estimated value of tspike , one gets equation 3.3. The synaptic current is updated at time t + 1t using the firing time of equation A.2. We do not take into account the change in the synaptic current in the time interval [t, t + 1t] due to the firing at time tspike since doing it
Numerical Simulations of Integrate-and-Fire Neural Networks
483
consistently requires determining which neuron fires first and adopting an adaptive time step. This would increase significantly the computation time. The global error associated with this approximation is of order O(1t2 ) and thus consistent with the improved versions of the Euler and RK2 algorithms. References Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Phys. Rev. E, 50, 3171–3191. Golomb, D., & Rinzel, J. (1993). Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, 48, 4810–4814. Hansel, D., Mato, G., & Meunier, M. (1995). Synchrony in excitatory neural network. Neural Comp., 7, 307–337. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Comp. Neurosci., 3, 7–34. Hansel, D., & Sompolinsky, H. (1992). Synchronization and computation in a chaotic neural network. Phys. Rev. Lett., 68, 718–721. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation nerve. J. Physiol. (London), 117, 500–544. Lapicque, L. (1907). Recherches quantitatives sur l’excitation e´ lectrique des nerfs trait´ee comme une polarisation. J. Physiol. Pathol. Gen., 9, 620–635. Mascagni, M. V., & Sherman, A. (1997). Numerical methods for neural modeling. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks (second ed.). Cambridge, MA: MIT Press. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Patterns of synchrony in integrate-and-fire network. Phys. Rev. Lett., 71, 1280–1283. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E, 54, 5522–5537. Van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1995). When inhibition not excitation synchronizes neural firing. J. Comp. Neurosci., 1, 313–321. Wehmeir, U., Dong, D., Koch, C., & Van Essen, D. (1989). Modeling the mammalian visual system. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks (335–359). Cambridge, MA: MIT Press. Received January 2, 1996; accepted June 4, 1997.
Communicated by Joshua Alspector
A Floating-Gate MOS Implementation of Resistive Fuse T. Sawaji Department of Electrical, Electronics, and Computer Engineering, Waseda University, Tokyo 169, Japan
T. Sakai Electronic Navigation Research Institute, Ministry of Transport, Tokyo 181, Japan
H. Nagai T. Matsumoto Department of Electrical, Electronics, and Computer Engineering, Waseda University, Tokyo 169, Japan
Resistive fuses are key elements in weak string filters, which smooth out noise while detecting step edges inherent in original data. A resistive fuse is implemented by two pairs of floating-gate metal oxide semiconductor (MOS) transistors in a chip by a standard double poly complementary MOS process. 1 Introduction The “resistive fuse” proposed by Harris et al. (1989, 1990) is a (nonlinear) circuit element whose I − V characteristic is given by Figure 1. For |v| ≤ v∗ , this element acts as a linear resistor i = G1 v (Ohm’s law). If |v| > v∗ , the current does not flow. If one connects many resistive fuses in parallel, as in Figure 2, then one can perform a rather interesting image processing. In Figure 2, the resistive fuses are represented by the parallel nonlinear resistors, and {uk } are the input image represented as current sources. If one inputs a noisy image as {uk }, then the resulting node voltage distribution {vk } smoothes out noise while preserving edges inherent to the original image. This filter is called the weak string filter (Blake & Zisserman, 1987). If one replaces the resistive fuses by linear resistors, these edges are blurred. Figure 3 shows simulated results of the weak string filter and the linear filter. An analytical result is given in the appendix. 2 Implementation with Floating-Gate MOS Transistors 2.1 Smooth characteristic. The I − V characteristic given in Figure 1 has practical and theoretical difficulties. The difficulties come from the fact that this characteristic is discontinuous. In order to circumvent these difficulNeural Computation 10, 485–498 (1998)
c 1998 Massachusetts Institute of Technology °
486
T. Sawaji, T. Sakai, H. Nagai, and T. Matsumoto
i G1 v* -v* 0
v
v*
- G1 v*
Figure 1: I − V characteristic of resistive fuse.
uk- 1
vk- 1
GT l
uk
vk
GT l
f
uk+1
vk +1
GT l
f
GT
f
Figure 2: Parallel network with resistive fuse.
ties, one can introduce an auxiliary parameter T and consider the I − V characteristic (see Figure 4 and the appendix), i = GT (G1 , α; T) :=
1+
G1 v 2 −(α−G 1 v )/T e
.
(2.1)
A Floating-Gate MOS Implementation
487
Input Data
Linear Filter
Weak String Filter
Figure 3: Simulated weak string filter.
Equation 2.1 is smooth for any finite T > 0, so that it may serve as a resistive fuse. Our purpose here is to implement a nonlinear element that behaves like this equation using floating-gate MOS transistors. 2.2 Floating-Gate MOS Transistors. The use of floating gate MOS transistors were proposed by Shibata and Ohmi (1992). Consider the NMOS(Nchannel MOS) structure shown in Figure 5, where G f stands for a floating gate. This structure can be easily implemented with double poly processes. The oxide thickness between the first poly (floating gate) and the second poly is generally several times greater than the gate oxide thickness. This is because the substrate has an extremely fine crystal structure, while the first poly has a much less fine structure. Let Ck be the capacitance between gate Gk , called the input gate, and the floating gate G f , k = 1, . . . , n, and let Cox be the gate oxide capacitance. Assuming that the initial charge at the
488
T. Sawaji, T. Sakai, H. Nagai, and T. Matsumoto
Figure 4: I − V characteristic GT (G1 , α∨).
G1
G2
Gn
Gf n+
n+ p - substrate
Figure 5: Floating-gate MOS structure.
A Floating-Gate MOS Implementation
489
floating gate is zero and neglecting capacitance nonlinearities, one has from the charge conservation that n X Ck (Vk − VGf ), (2.2) Cox VGf = k=1
so that n X
VG f =
Ck Vk
k=1
Cox +
n X
.
(2.3)
Ck
k=1
If VT is the threshold voltage as seen from the floating gate, the transistor turns “on” if n X Ck Vk VG f =
k=1
Cox +
n X
> VT .
(2.4)
Ck
k=1
After the transistor turns on, the current between drain and source follows the triode characteristic so that one can control the current by P controlling Vk , k = 1, . . . , n. Thus, one can control the current by controlling nk=1 Ck Vk . Shibata, Ohmi, and others have applied this structure to many interesting circuits (Shibata & Ohmi, 1992; Kotani, Shibata, & Ohmi, 1992; Ishii, Shibata, & Ohmi, 1992). Suppose n = 2 and let V1 = 0. Then C2 V2 . (2.5) VG f = Cox + C1 + C2 Thus, if V2 >
Cox + C1 + C2 VT , C2
(2.6)
then VGf > VT .
(2.7)
This means that if equation 2.6 holds, the transistor is “on” even when V1 = 0, which means that the MOS transistor as seen from gate 1 is in depletion mode. A similar argument is valid for P-channel MOS (PMOS). 2.3 The Circuit. Before presenting our circuit, let us briefly mention other implementations. Harris’s (1990) original implementation demands more than 30 transistors. Yu and Lee proposed 7- and 11-transistor fuses (Yu, Decker, Lee, Sodini, & Wyatt, 1992). Finally, Yu et al. proposed and implemented a 4-transistor fuse. As for this particular implementation, however,
490
T. Sawaji, T. Sakai, H. Nagai, and T. Matsumoto
+
Vin
_ Mn
+ Mn n
Vin-
n
Vbias p Vbias
Vbias p Vbias
_ Mp
+ Mp
I Figure 6: Resistive fuse with two pairs of floating-gate MOS transistors.
one cannot control the I − V characteristic after fabrication. In addition, the circuit requires depletion-mode transistors (e.g., ion implantation p of particular dopants). This means that one cannot control the values of α/G1 and √ αG1 in Figure 4. The circuit we propose is given in Figure 6, where two pairs of floatinggate MOS transistors are connected in a particular way. The configuration is essentially the same as the one in Yu et al. (1992); however, we can control the I − V characteristic after fabrication and do not need to use a nonstandard process to produce depletion-mode transistors. The only requirement here is a double poly process, which is easily available. Therefore, by controlling p n and Vbias , one can put the transistors of Figure 6 in depletion mode. Vbias p n and Vbias , one can control the values of Furthermore, by controlling Vbias p √ p n α/G1 and αG1 . For fixed values of Vbias and Vbias , the operating principles are straightforward and are given by: n − VTn = VT(0)
n Cn2 Vbias , Cnox + Cn1 + Cn2 p
p
p
VT = VT(0) −
p
p
C2 Vbias p
p
Cox + C1 + C2
k(− 34 V + VTn )V I= k(− 12 V + VTn )2 0
, V < VTn VTn ≤ V < 2VTn , 2VTn ≤ V
(2.8)
A Floating-Gate MOS Implementation
491
Vdd + Vin C1
+ Mn
C2 n
Vbias I
Vm p
Vbias C2 C1
+ Mp
_ V in Vss Figure 7: Subcircuit of a fuse.
p
n where VT(0) and VT(0) are the threshold voltages seen from the floating gate. In order to explain the operation in an intuitive manner, consider the left half − of the circuit as given in Figure 7. Suppose that V+ in increases, whereas Vin p + decreases. Then, since Mp is in depletion mode with appropriate Vbias value, + the drain-source current increases as Vin increases, while V− in decreases. After a certain value, the increase of the voltage at the C1 -terminal forces M+ p out of saturation and hence the drain-source current decreases; finally, the current will be very small. Transistor M+ n operates in a complementary manner. The
492
T. Sawaji, T. Sakai, H. Nagai, and T. Matsumoto
Figure 8: Comparison of equation (2.1) with (2.8) where α = 7.0 × 10−4 , G1 = 12.0, T = 7.0 × 10−7 , VTn = 0.3, k = 3.5 × 10−5 .
full fuse circuit (see Figure 6) is a series connection of the two half circuits with opposite polarity, which gives rise to the fuse characteristic. It should be noted, however, that an exact analytical form of the resistive fuse given by equation 2.1 is not realized simply due to the MOS transistor characteristics. The proposed circuit has an I − V characteristic given by equation 2.8, which is very similar to equation 2.1. 3 Chip Implementation A completely standard, 2µm double poly with two layers of metal was used. The first layer of poly was used as floating gates. Table 1 summarizes floating-gate and input gate sizes. Table 2 shows estimated capacitance values between input gates G1 , G2 , and the floating gate.
A Floating-Gate MOS Implementation
493
Table 1: Gate Sizes. W/L (µm)/(µm) Input Gate
NMOS PMOS
Floating gate
G1
G2
2.90/2.80 2.90/3.00
41.7/20.7 41.7/20.7
41.7/41.4 38.4/38.4
Table 2: Estimated Capacitances.
NMOS PMOS
C1 (fF)
C2 (fF)
500 500
1000 850
Figure 9 shows a photograph of the chip die. Figure 10 shows experip n took four mental results, where Vbias = −5.09[V] for all cases whereas Vbias different values: n n n n = 4.00[V], Vbias = 4.46[V], Vbias = 5.10[V], Vbias = 5.80[V]. Vbias
Observe that the desired resistive fuse characteristic is realized and that the n values as predicted. I − V characteristic can be controlled by the Vbias 4 Discussion There are slight asymmetries in the I − V characteristics shown in Figure 10. In order to explain this, let us look at half of the resistive fuse circuit as given in Figure 7. The threshold voltage of the NMOS is given by, hp i p Vsb + 2φF − 2φF hp i p n +γ Vm − VSS + 2φF − 2φF , = VT(0)
n +γ VTn = VT(0)
(4.1)
n where VT(0) is the threshold voltage when there is no body effect—when Vsb = 0, γ stands for the body effect coefficient and φF is the Fermi potential. + − and Vin values. Note that this Obviously Vm changes according to the Vin + − − Vin > 0, circuit (see Figure 7) shows the fuse characteristic when Vin + − whereas if Vin − Vin < 0 this circuit behaves like a forward-biased diode so that the right half of the resistive fuse circuit in Figure 7 is responsible for the fuse characteristics. Another issue which needs to be addressed is Q f (0) , the initial charge stored in the floating gate. If Q f (0) 6= 0, there can be a distortion. A reset circuit is proposed in (Sakai et al., 1996) to circumvent this problem.
494
T. Sawaji, T. Sakai, H. Nagai, and T. Matsumoto
Figure 9: Chip die photo.
A drawback of the configuration is that each resistive fuse needs four floating-gate biases, which gives rise to a problem in scalability. One of the means of solving this is to use tunnel current as in electrically erasable and programmable read-only memory to set the floating-gate voltage at a desired value (Shibata & Ohmi, 1992). Figure 11 (Sawaji, 1993) shows a configration where the two terminals are provided for the tunneling purposes. 5 Conclusion We have demonstrated that Harris’s resistive fuse can be implemented by two pairs of floating-gate MOS transistors by showing chip implementation results. The resistive fuse is known to be the key element for a weak string filter with which one can detect step edges while smoothing out noise. A similar implementation of weak rod filter is an interesting problem (Matsumoto & Kondo, 1994, 1996). Appendix This appendix gives several analytical aspects of the weak string filter where the resistive fuses play a fundamental role. Let a noisy input d =
A Floating-Gate MOS Implementation
495
Linear Resistor(27.8k)
5.10V 5.80V
4.00V 4.46V
Figure 10: Measured I − V characteristics.
(d1 , . . . , dn ) ∈ Rn be given. One way of smoothing out noise while restoring edges is to consider a new vector l = (l2 , . . . , ln ) ∈ {0, 1}n−1 where each lk indicates whether an edge is present, in addition to the intensity vector v = (v1 , . . . , vn ) ∈ Rn . The weak string filter (Blake & Zisserman, 1987) achieves this by minimizing E(v, l; d) := λ f kv − dk2 + λs
n X k=2
(vk − vk−1 )2 (1 − lk ) + λl
n X
lk ,
(A.1)
k=2
The minimization min E(v, l; d), v,l
(A.2)
is nontrivial since there is a product term (vk − vk−1 )2 (1 − lk ), and since lk is binary instead of real. One way to overcome this difficulty is to eliminate l,
496
T. Sawaji, T. Sakai, H. Nagai, and T. Matsumoto
+
Vin
C1 C2
+ Mn
VExt
C1 C2
_ Mn
Vin-
VInj
+ Mp
C2 C1
C2 C1
_ Mp
I Figure 11: The four biases can be eliminated by controlling the floating-gate voltage with tunnel current.
the line variable. A simple way of doing this is to minimize out l. Since E∗ (v; d) :=
min E(v, l; d) = λ f kv − dk2
l∈{0,1}n−1
+
n X k=2
h i min λs (vk − vk−1 )2 (1 − lk ) + λl lk ,
lk ∈{0,1}
(A.3)
that is, the minimization with respect to l can be done separately with each lk , one can show that E∗ (v; d) = λ f kv − dk2 +
n X
F(vk − vk−1 ),
(A.4)
k=2
where F(v) =
½
√ λs v2 , |v| ≤ √λl /λs λl , |v| > λl /λs .
√ Observe that if the optimal v satisfies |vk − vk−1 | ≤ λl /λs for all√k, k = 2, . . . , n, then v is linear with respect to d. If, however, |vk − vk−1 | > λl /λs for at least one k, the derivative of F vanishes, and the current is “cut,” which is the origin of the name fuse. A problem with √ equation A.4 is that E∗ (v; d) is not differentiable in v any more at |v| = λl /λs , although it is still continuous, so one cannot differentiate it to derive necessary conditions for a minimum. In addition,
A Floating-Gate MOS Implementation
497
one cannot use a gradient-type algorithm to compute a minimum. To circumvent this difficulty, Lumsdaine, Wyatt, and Elfadel (1989) used an MRF (Markov Random Field) argument and “integrated out” l. Fact. Consider (v, l) as a pair of random variables and assume that their joint probability distribution given d is Gibbsian : PT (v, l|d) =
1 −E(v,l;d)/T e , Z
(A.5)
where E(v, l; d) is defined by equation A.1, T > 0, and Z is the normalization constant. The parallel network given in Figure 2 with the nonlinear resistor (resistive fuse) characterized by GT (G1 , α; v) =
G1 v 2 1 + e−(α−G1 v )/T
(A.6)
(see Figure 4) realizes the weak string filter in the sense that for every T > 0, the node voltage distribution v maximizes the marginal probability distribution given d: PT (v | d),
(A.7)
where d = λ f u, λs = G1 , λl = α. Note that for each T, the function GT is now differentiable infinitely many times. Note also that as T ↓ 0, PT ( ·|d) tends to E∗ ( · ; d)+ constant, while GT (G1 , α; · ) does not converge to ½ G(G1 , α; v) :=
p G1 v, |v| ≤ pα/G1 0 , |v| > α/G1 .
Acknowledgments We express our appreciation to H. Yonetsu of Toyohashi University of Technology, T. Shibata of Tohoku University, and H. Kobayashi of Yokogawa Electric for their valuable discussions. Comments from the reviewers were also helpful. References Blake A., & Zisserman, A. (1987). Visual reconstruction. Cambridge, MA: MIT Press. Harris, J. G., Koch, C., Luo, J., & Wyatt, J., Jr. (1989). Resistive fuses: Analog hardware for detecting discontinuities in early vision. In C. Mead & J. Ismail (Eds.), Analog VLSI implementation of neural systems. Norwell, MA: Kluwer.
498
T. Sawaji, T. Sakai, H. Nagai, and T. Matsumoto
Harris, J. G., Koch, C., & Luo, J. (1990). A two-dimensional analog VLSI circuit for detecting discontinuities in early vision. Science, 248, 1209–1211. Ishii, H. , Shibata, T., Kosaka, H., & Ohmi, T. (1992). Hardware-learning of neuron MOS neural networks. International Electron Devices Meeting, San Francisco, CA, December 13–16. Kotani, K., Shibata, T., & Ohmi, T. (1992). Neuron-MOS binary-logic circuits featuring dramatic reduction in transistor count and interconnections. International Electron Devices Meeting, San Francisco, CA, December 13–16. Lumsdaine, A., Wyatt, J. L., Jr., & Elfadel, I. M. (1989). Parallel distributed networks for image smoothing and segmentation in analog VLSI. In Proc. 28th Conference on Decision and Control. Matsumoto, T., & Kondo, K. (1994). Realization of “weak rod” by double layer parallel networks. Neural Computation, 6, 944–956. Matsumoto, T., & Kondo, K. (1996). A massively parallel resistive network for “weak rod”: A double layer architecture with nearest neighbor connections. Neural Networks, 9, 523–541. Sakai, T., H. Nagai, and T. Matsumoto (1996). “A Reset Circuitry for the Differential Amplifier with Floating Gate Transistors”, ITEJ Technical Report, IPU 96–44/IDY 96–132, pp. 43–48, October. Sawaji, T. (1993). Resistive fuse with four νMOS transistors and the weak rod filter. Unpublished master’s thesis, Waseda University. Shibata, T., & Ohmi, T. (1992). A functional MOS transistor featuring gate-level weighted sum and threshold operations. IEEE Trans. Electron Devices, 39, 1444–1455. Yu, P. C. , Decker, S. J., Lee, H., Sodini, C. G., & Wyatt, J. L. (1992). CMOS resistive fuses for image smoothing and segmentation. IEEE J. Solid State Circuits, 27, 545–553.
Received February 19, 1997; accepted June 13, 1997.
ARTICLE
Communicated by Ralph Linsker
Toward a Biophysically Plausible Bidirectional Hebbian Rule Norberto M. Grzywacz Pierre-Yves Burgi Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115, U.S.A.
Although the commonly used quadratic Hebbian–anti-Hebbian rules lead to successful models of plasticity and learning, they are inconsistent with neurophysiology. Other rules, more physiologically plausible, fail to specify the biological mechanism of bidirectionality and the biological mechanism that prevents synapses from changing from excitatory to inhibitory, and vice versa. We developed a synaptic bidirectional Hebbian rule that does not suffer from these problems. This rule was compared with physiological homosynaptic conditions in the hippocampus, with the results indicating the consistency of this rule with long-term potentiation (LTP) and long-term depression (LTD) phenomenologies. The phenomenologies considered included the reversible dynamics of LTP and LTD and the effects of N-methyl-D-aspartate blockers and phosphatase inhibitors. 1 Introduction Hebb (1949) postulated a synaptic mechanism for learning where increases in synaptic strength are dependent on concurrent activity in pre- and postsynaptic cells. Stent (1973) further proposed that negative correlation in preand postsynaptic activity reduces synaptic strength. One of the most successful mathematical formulations of this Hebb-Stent bidirectional rule is the covariance rule (Sejnowski, 1977). This and related rules are the cornerstone of many proposed unsupervised learning schemes (von der Malsburg, 1973; Oja, 1982; Linsker, 1986; Miller, 1994; Burgi & Grzywacz, 1997). In this rule, changes in synaptic weights are proportional to the covariance between the firing rates of the pre- and postsynaptic cells. There is now neurophysiological evidence supporting a form of synaptic plasticity that is governed by something like the covariance rule (Stanton & Sejnowski, 1989; Xie, Berger, & Barrionuevo, 1992; Dudek & Bear, 1993; Fr´egnac, Burke, Smith, & Friedlander, 1994). However, application of this rule leads to two major physiological inconsistencies. First, the synaptic weight of any given synapse can become positive or negative. This situation is biologically unlikely because synapses are either excitatory or inhibitory.1 1
A possible exception is the GABAA synapses, which are excitatory early in devel-
Neural Computation 10, 499–520 (1998)
c 1998 Massachusetts Institute of Technology °
500
Norberto M. Grzywacz and Pierre-Yves Burgi
Second, the covariance rule predicts synaptic potentiation in the absence of pre- and postsynaptic activities. Computationally, potentiation without activity may not make sense, since quiescence in the brain most often means “no neural data” rather than “neural evidence for” jointly weak activity in two cells. And neurophysiological experiments using low-level presynaptic activity in conjunction with postsynaptic hyperpolarization indicate longterm depression (LTD) rather than long-term potentiation (LTP) (Xie et al., 1992). A more biophysically plausible bidirectional rule has been proposed: the BCM rule (Bienenstock, Cooper, & Munro, 1982; Yang & Faber, 1991; Intrator & Cooper, 1992). This rule proposes the existence of an activity-dependent threshold, such that presynaptic activities above and below this threshold lead to LTP and LTD, respectively. Although the BCM rule solves the second problem above, without modification, this rule cannot prevent synapses from changing polarity. Imposing hard bounds on the synaptic weights is possible, but this solution would leave unspecified what biological process sets these bounds. The BCM rule does not even specify what biological mechanism controls the varying threshold. This criticism should not be taken as an attack on the BCM rule, since it was designed as a computational, not biophysical, rule. The criticism only underscores the necessity of a more biophysical model. In section 5, we even point out that our rule is in some sense a biophysically plausible instantiation of the BCM rule. The main goal of this article is to present a new physiological bidirectional Hebbian rule, which has bounded nonnegative synaptic weights and does not change these weights in the absence of pre- and postsynaptic activity. The intention is not to model the exact biophysical details of synaptic plasticity, but to focus on mechanisms of synaptic stability during quiescence and bidirectionality. To achieve this, we restricted the underlying biophysics describing the new Hebbian mechanism to three simple constraints: (1) the overall response of the postsynaptic cell is the linear weighted sum of all presynaptic activities; (2) this response produces a feedback messenger, which at a given synapse is proportional to the local concentration of a postsynaptic agent; (3) enzymes controlling synaptic strength are activated at the presynaptic site by the evoked feedback messenger. This model’s essence is thus in the details of these enzymes, namely, constraint 3. The other constraints were chosen for computational simplicity to be linear (section 5). Furthermore, whether the enzymatic complex in constraint 3 is pre- or postsynaptic is irrelevant (section 5). To evaluate the neurophysiological and computational properties of the opment but become inhibitory later on (Cherubini, Rovira, Gaiarsa, Corradetti, & Ari, 1990; Walton, Schaffner, & Barker, 1993). However, even these synapses do not change polarity more than once, the changes are always in the same direction (from excitatory to inhibitory), and they always occur regardless of the network. Therefore, these changes do not have the flexibility required by the covariance rule.
Bidirectional Hebbian Rule
501
new rule, we apply a set of physiological and pharmacological tests to our model. Here, the tests are for one synapse, corresponding to the case of homosynaptic modifications.2 The results of these tests are compared qualitatively with similar experimental conditions in the hippocampus. The new rule and this comparison appeared in abstract form elsewhere (Grzywacz & Burgi, 1994). 2 Model In this model (see Figure 1A), a feedback messenger activates presynaptic enzymes controlling synaptic strength. These enzymes work by controlling the active and inactive states of a gating molecule. For computational simplicity, it is assumed that the synaptic weight (w) is proportional to the active state’s concentrations. (For ease of understanding, in Figure 1A, this active state is labeled w.) A possible interpretation of the active state is that it facilitates synaptic-vesicle docking (more on this in section 3). Let us suppose that at a given synapse, the maximal possible synaptic weight (proportional to the maximal concentration of the gating molecules) is w0 , that is, the pool of gating molecules is finite. In this case, w0 = w + w∗ , where w∗ is proportional to the inactive state’s concentration. Let us denote the presynaptic activities and synaptic strengths (transmitter release per unit of presynaptic activity) of a set of synapses onto a single postsynaptic cell by Ij and wj , respectively, where 1 ≤ j ≤ p and p is the number of synapses. The overall response of the postsynaptic cell is R=
p X
wj Ij
(2.1)
j=1
(an assumption shared by most other models; for a review, see Churchland & Sejnowski, 1992). We assume the concentration of a postsynaptic agent (C) at a given postsynaptic site to be proportional to the multiplication of R by the concentration of neurotransmitter (wj Ij ) in the synaptic cleft of that synapse. Because, for simplicity, it is also assumed that the concentration of feedback messenger ([mj ]) produced by synaptic activity is proportional to the local concentration of C, one gets [mj ] = αR wj Ij ,
(2.2)
where α is a constant. To obey one of the main constraints of the model—that no synaptic change should occur in the absence of activity—we postulate a molecular 2 Heterosynaptic tests using a simplified version of the rule appear elsewhere (Burgi & Grzywacz, in press).
502
Norberto M. Grzywacz and Pierre-Yves Burgi
Figure 1: Schematic representation of the new model. (A) The presynaptic input (I) causes the release of the transmitter (glutamate in the hippocampus), whose quantity (T) depends on a particle that determines synaptic strength (w). Specifically, T is proportional to the multiplication of I and w. A fraction of the T molecules of the transmitter attaches to special receptors (NMDA receptors in the hippocampus), which, if the postsynaptic site is depolarized, allow the entrance of an agent C (Ca2+ in the hippocampus) into it. This agent activates an enzyme (P1 ), which produces a messenger (m) that can propagate to the presynaptic terminal and activate an enzymatic switch (Es ). When this switch is on, it allows the flow between the active (quantified by w) and inactive (quantified by w∗ ) states of the particle determining synaptic strength. This flow is such that low concentrations of m cause the inactivation of the particle (with a stoichiometry of 1) through enzyme Ei , whereas high concentrations of m cause the activation of the particle (with a stoichiometry of n) through enzyme Ea . (B) The switch Es promotes the transitions w,w∗ * w∗∗ , where w∗∗ is a transient metastate, while the transitions w∗∗ * w,w∗ depend on the messenger through the inactivating (Ei ) and activating (Ea ) enzymes. Rates of the thermal decays are indicated by k4 to k5 .
Bidirectional Hebbian Rule
503
switch (see Figure 1A; see also Lisman, 1989). It would operate presynaptically and would have to be turned on by the messenger to allow for synaptic changes. A possible implementation of this switch is k1
E∗s,j + mj + Es,j *mj + 2Es,j ,
(2.3)
with the thermic reactions k2
Es,j E∗s,j , k3
(2.4)
where Es,j and E∗s,j are the active and inactive states of the switch, respectively, and k1 , k2 , and k3 are the rate constants of the reactions. What equation 2.3 expresses is an autocatalytic switch. The more active switch molecules there are (higher concentrations of Es,j ), the easier is to activate switch molecules. This is similar to Lisman’s self-phosphorylating enzymes (Lisman, 1989), which become active when phosphorylated. Lisman’s notion was that when activated, these enzymes would tend to become all phosphorylated due to their positive-feedback reaction. Consequently, the behavior of such enzymes would tend to be all or none, ideal for a switch. To Lisman’s switch notion, we add only the thermal reactions in equation 2.4 to allow the switch to be turned off. (For it to be rapidly turned off, one must assume k3 ¿ k2 .) From the chemical equations 2.3 and 2.4, one can express the variations of the concentrations, [Es,j ], of the switches as d[Es,j ] = −k2 [Es,j ] + ([Es,0 ] − [Es,j ])(k3 + k1 [mj ][Es,j ]), dt
(2.5)
where [Es,0 ] = [Es,j ] + [E∗s,j ] is the maximal concentration of active switches. The feedback messenger activates two presynaptic enzymes, which control synaptic strength by shifting the balance of the gating molecule toward either its inactive or active states. One enzyme, Ei , reduces synaptic strength in proportion to its current concentration and is activated by the feedback messenger with a stoichiometry of 1. The other enzyme, Ea , requires n > 1 molecules of the messenger to increase synaptic strength up to a maximal value. The postsynaptic agent C can thus control LTP and LTD through the different stoichiometries. For high C, the large-stoichiometry enzyme dominates, causing LTP, whereas for low C, the other enzyme dominates, causing LTD. The switch allows back-and-forth transitions between the active and inactive states of the gating molecule, depending on the amount of messenger. To achieve this goal, we propose that the switch promotes the transitions from these states to a transient unstable metastate (see Figure 1B). This metastate decays enzymatically back to the active and inactive states with rate constants that depend on the messenger (see Figure 1B). By writing the equations corresponding to the enzymatic reactions schematically shown
504
Norberto M. Grzywacz and Pierre-Yves Burgi
in Figure 1B and by assuming that the decay from the metastate is much faster than the arrival to it, one can express the variation of wj as follows (see the appendix for details): dwj −k9 [Es,j ](k4 +k7 β[mj ])wj +k8 [Es,j ](k5 +k6 γ [mj ]n )(w0 −wj ) = , (2.6) dt k4 +k5 +k6 γ [mj ]n +k7 β[mj ] where k4 and k5 are rate constants of thermic reactions; k6 , k7 , k8 , and k9 are rates of enzymatic reactions (see Figure 1B); and β and γ are constants linking mj to the enzymes (see the appendix). Because equations 2.1 and 2.2 have no dynamics and because equation 2.5 essentially represents an on-off switch, the important dynamics are in equation 2.6. The LTP-LTD phenomenology is introduced explicitly in the Ei and Ea reactions modeled by this equation. It says that when the switch is on (Es,j significantly larger than zero), weight changes can occur and depend on the present weight (wj ) and the feedback messenger ([mj ]). If the weight is near zero (wj ≈ 0) or near saturation (wj ≈ w0 ), then the synaptic weight tends to increase or decrease, respectively. For large amounts of feedback messenger (k7 β[mj ] ¿ k6 γ [mj ]n ), the tendency will be for synaptic weight to increase. For intermediate amounts (k7 β[mj ] À k6 γ [mj ]n ), weights will tend to fall. Finally, for low amounts, the switch will turn off, preventing weight changes. The denominator of equation 2.6 imposes bounds on how fast synaptic changes can occur. 3 Methods Model simulations were performed on a SPARC II workstation, using C++. To solve the differential equations, we used the fifth-order Runge-Kutta method with adaptive step size control (Press, Teukolsky, Vetterling, and & Flannery, 1992). The equations’ initial conditions were wj = w0 /2 (the middle of wj ’s range) and Es,j = Es,0 k3 /(k2 + k3 ), (the steady-state value of Es,j in the absence of inputs). The simulations involved solving numerically equations 2.1, 2.2, 2.5, and 2.6. A direct count revealed that these equations have 16 parameters: p, α, β, γ , k1 , k2 , k3 , k4 , k5 , k6 , k7 , k8 , k9 , [Es,0 ], w0 , and n. However, not all the parameters were independent. By substituting equation 2.2 for [mi ] in equation 2.5, one sees that α appears multiplied by k1 . Similarly, by substituting equation 2.2 for [mi ] in equation 2.6, one sees that α and α n always appear multiplied by k6 β and k7 γ , respectively. Moreover, β and γ do not appear in any other form. Therefore, without loss of generality, we set α = β = γ = 1, effectively eliminating these three parameters. For related reasons, the parameters [Es,0 ] and w0 could also be set to 1, which is akin to forcing [Es,j ] and wj to lie between 0 and 1. Because [Es,i ] always appeared multiplied by
Bidirectional Hebbian Rule
505
k8 and k9 in equation 2.6, the absolute amplitude of this variable is irrelevant. As for wj , its absolute value is irrelevant, because plots of synaptic modification are of relative, rather than absolute, modulations. Finally, k4 , k5 , k6 , and k7 are not mutually independent, since inspection of equation 2.6 (with [Es,0 ] = 1) reveals that k4 , k5 , k7 [mi ], and k6 [mi ]n all have the same physical dimension, which is arbitrary, since it cancels in a ratio. Hence, without loss of generality, we set k6 = 1. The only parameters that were independent in the model and thus were not automatically set to 1 were p, k1 , k2 , k3 , k4 , k5 , k7 , k8 , k9 , and n. In the simulations of the model, we set p = 1 not because it was dependent on other parameters, but because we wished to simulate the homosynaptic condition. As for the other parameters, we chose their values after a cursory exploration of the parameter space to yield simulations that resembled the hippocampus data (to restrict the parameter space, we attempted successfully to model the data with k4 = k5 and k8 = k9 ). Variations of each these parameters by factors of three up and down yielded data that were quantitatively but not qualitatively different, demonstrating the robustness of the model. The values of the parameters throughout the article were k1 = 10, k2 = 0.5, k3 = 0.001, k4 = k5 = 1, k7 = 100, k8 = k9 = 0.33, and n = 2. These parameter values are reasonable and do not violate the assumptions used to derive the model’s equation 2.6. The assumption that the decay from the metastate is much faster than the arrival to it is automatically satisfied in this equation. Multiplying k4 , k5 , k6 , and k7 by any common factor leaves the equation unchanged. Therefore, one can choose them to ensure an arbitrarily fast decay (see Figure 1B) without modifying the simulations. However, although the absolute values of k4 , k5 , k6 , and k7 are irrelevant, their relative values matter. We chose the decay through Ei (governed by k7 ) to be only 100 times faster than the other decays, which is biochemically reasonable (Mathews & van Holden, 1990). In contrast, the absolute values of k8 and k9 matter, since they set the maximal rate constants of synaptic changes. Their associated time constants correspond to the order of minutes, which is slow3 but not biochemically unreasonable (Mathews & van Holden, 1990). Finally, besides rate constants, equation 2.6 also has the parameter n. The choice n = 2 means that two feedback-messenger molecules are needed to activate one Ea enzyme, another biochemically reasonable assumption.
3 Slowness implies that E promotes a high energy-of-activation transition. This would s be the case if, for instance, our particles corresponded to (the “bulky”) vesicles and the enzymes to proteins involved in vesicle turnover and membrane docking. If such an interpretation were correct, it would point to an interesting relationship between our model and the synaptic-depression-gain-control ideas of Abbott, Varela, Sen, & Nelson (1997). Synaptic depression may involve depletion of vesicles immediately available for transmitter release (Takeuchi, 1958; Thies, 1965).
506
Norberto M. Grzywacz and Pierre-Yves Burgi
The parameters of the switch equation (2.5) are also reasonable. Similar to k8 and k9 , the time constant parameter for turning off the switch, 1/k2 , is also of the order of minutes. In turn, k3 corresponds to a very slow process. It was included only to prevent the active state of the switch from disappearing during quiescence and thus to allow the switch to turn on.4 As for k1 , this is not strictly speaking a rate constant, since k1 ’s dimension is that of the inverse of the multiplication of time by concentration squared (see equation 2.5). Because k1 [mj ][Es,j ] sums k3 , it is useful to think of k1 [mj ][Es,j ] as the real rate constant. This “rate constant” and thus, k1 , becomes relevant for the model only when k1 [mj ][Es,j ] ≥ k3 . Hence, k1 ’s role is to set the threshold above which the feedback messenger starts turning on the switch. A good working definition of the switch’s turning on is when [Es,j ] crosses Es,0 /2. In this case, if the threshold is that value of mj for which Es,j = 0.5Es,0 at steady state, then from equations 2.5 and k3 ¿ k2 , the threshold occurs when mj ≈ 2k2 /(k1 Es,0 ). The stimuli in the model simulations were chosen as follows. In LTP and LTD physiological experiments, stimuli consist of trains of short current pulses. These pulses are delivered to elicit action potentials, which are necessary to activate the synapses. However, we are not modeling action potentials, and our synapses can function with continuous stimuli. Hence, we use continuous stimuli with amplitudes Ij . The unit of the amplitude is hertz, and its numerical value is equal to the numerical value of the frequency of the current pulses. Among the experimental conditions simulated, two key ones include blockade of N-methyl-D-aspartate (NMDA) receptors and the effects of calyculin A, a phosphatase inhibitor. In hippocampus, blockade of NMDA receptors results in the reduction of the Ca2+ concentration in the postsynaptic site (Jahr & Stevens, 1987; Mayer, MacDermott, Westbrook, Smith, & Barker, 1987; Ascher & Nowak, 1988). We propose to identify Ca2+ with the model’s postsynaptic agent C (see the definition after equation 2.1). Consequently, in the model, NMDA antagonists should result in a reduction in the concentration of the feedback messenger. The effect of NMDA blockade is thus modeled by reducing parameter α in equation 2.2. (The percentage of NMDA blockade in Figure 5 is the percentage of α reduction.) To model the effects of calyculin A, one must identify the term in the switch equation (see equation 2.5) that would correspond to the phosphatase. As explained before equation 2.5, and in more detail, in section 4, for the hippocampus, we identify the first and second terms of the right-hand side of equation 2.5 with the phosphatase and an autophosphorylating enzyme respectively. Therefore, to model the effect of calyculin A, k2 is set to 0 in equation 2.5.
4 If k = 0 and there are no inputs to the system for a long time (namely, [m ] = 0), 3 j then the k2 term will drive [Es,j ] to zero. In this case, the switch cannot turn on, that is, its derivative remains at zero (see equation 2.5).
Bidirectional Hebbian Rule
507
4 Results This section presents numerical analysis of the model in the homosynaptic condition, that is, equations 2.1, 2.2, 2.5, and 2.6, with p = 1 (see footnote 2). In the simulations, the phenomena of homosynaptic LTD and LTP can be elicited by applying weak and strong electrical stimulation to the synaptic input, respectively. Three consecutive 1-minute-long 50-Hz trains cause increasing LTP, as shown in Figure 2A. The corresponding LTD effect is shown in Figure 2B using three consecutive 2-minute-long 2-Hz trains. After each train, the excitatory postsynaptic potential (EPSP) slope (taken to be proportional to synaptic gain) changes with respect to baseline level. The simulated LTP and LTD have transient and long-lasting components. The transience is due to the time it takes for the switch to turn off after the stimulus stops (see Figure 2C). For LTP, the slope increase following each subsequent electrical stimulation diminishes as synaptic weight approaches its upper saturating level. Similarly, for LTD, the slope decrease diminishes as synaptic weights approach zero. These effects are visible in both the transient and long-lasting components. Both the LTP and LTD are accompanied by transient “turning on” of the enzymatic switch (see Figure 2C). Although the switch rapidly turns on in both cases, it is faster for the LTP condition (which thus saturates more quickly) than for the LTD condition, because larger amounts of messenger are involved in the former condition. Reversibility of LTP and LTD is illustrated in Figure 3, where a sequence of inputs alternating between strong (3-minute-long 60-Hz trains) and weak (15-minute-long 4-Hz) electrical stimulation causes a sequence of responses alternating between LTP and LTD, respectively. Changes in the EPSP slope resulting from electrical stimulation are assessed through simulations of a range of frequencies (0–50 Hz). The changes for the long-lasting components (asymptotic) and the peak amplitude of the transient component are shown in Figures 4A and 4B, respectively. Five solid curves are shown in these figures. These curves correspond to different impulse duration (1, 2, 5, and 15 minutes) and to infinite duration (step function). (The infinite-duration condition yields the same curve for the long-lasting and transient components. We obtain it analytically after setting the derivatives in equations 2.5 and 2.6 to zero.) The strength of LTP and LTD varies with stimulus duration in such a manner that the crossing point between LTD and LTP always occurs at the same input frequency (with the parameters used, this frequency is 20 Hz). This crossing point is not absolute but depends on the quantity mj , which itself is dependent on synaptic strength and, thus, prior activity. In general, the higher the synaptic strength, the easier LTD becomes, and vice versa (see the discussion after equation 2.6). To test whether our model is further compatible with basic LTP-LTD phenomenology in hippocampus, two more experimental conditions have been simulated. The first involves the blockade of LTP and LTD with NMDA
508
Norberto M. Grzywacz and Pierre-Yves Burgi
Figure 2: Simulation of LTP and LTD. (A) Three consecutive strong inputs (1minute-long 50-Hz trains, indicated by the filled triangles) caused increasing LTP. This LTP had transient and long-lasting components. (B) In contrast, three consecutive weak inputs (2-minute-long 2-Hz trains indicated by the arrows) caused increasing LTD. This LTD also had transient and long-lasting components. (C) Both the LTP and LTD were accompanied by transient “turning on” of the enzymatic switch (solid lines for LTD and dashed lines for LTP).
Bidirectional Hebbian Rule
509
Figure 3: Reversibility of LTP and LTD. A sequence of inputs alternating between strong (3-minute-long 60-Hz trains) and weak frequencies (15-minutelong 4-Hz trains) caused a sequence of responses alternating between LTP and LTD.
antagonists. In hippocampus, the application of NMDA antagonists results in a reduction of the local concentration of postsynaptic Ca2+ , and thus in our model, in a reduction of the concentration of the feedback messenger. As the amount of messenger diminishes, the LTD term in equation 2.6 becomes more predominant. This effect can be verified in Figure 5A, where the crossing point for LTD and LTP shifts toward higher frequencies as the NMDA conductance is reduced. Therefore, for sufficient NMDA blockade, LTP disappears. Moreover, the amount of LTD falls and eventually disappears (see Figure 5B). This is because with little feedback messenger, the enzyme Ei and the switch (see Figure 1) do not work. The second extra experimental condition deals with the mechanism for the maintenance of synaptic modifications. In our model, a switch is turned on by activity (to allow for synaptic changes) and turned off when there is no activity (so that synaptic changes get frozen). Although this switch could be implemented in many ways, Lisman (1989) advanced for it the most experimentally successful hypothesis so far. He essentially proposed that the turning on of the switch may be due to self-phosphorylating enzymes. Hence, a phosphatase would turn it off. In support, administration of calyculin A, a phosphatase inhibitor, eliminates LTD maintenance (Mulkey, Herron, & Malenka, 1993). (The mechanism for LTP maintenance seems to be less clear; Malinow, Madison, & Tsien, 1988.) Our model is consistent
510
Norberto M. Grzywacz and Pierre-Yves Burgi
Figure 4: Frequency dependence. The amount of LTP and LTD as a function of input frequency and parametric on stimulus duration (1, 2, 5, 15, and ∞ minutes) were calculated. The calculations were performed for both the asymptotic behavior (A) and peak LTP and LTD (B). The amount of LTP and LTD increased with stimulus duration in such a manner that the crossing point between LTD and LTP always occurred at the same input frequency.
Bidirectional Hebbian Rule
511
Figure 5: NMDA blockade. (A) When the NMDA conductances were reduced by the percentage indicated in the various curves, the curves shifted to the right. Hence, for sufficient NMDA blockade, LTP disappeared at reasonable input frequencies. Moreover, the amount of LTD fell, eventually disappearing. These changes were such that LTP disappeared before LTD as function of NMDA blockade. To illustrate this point more clearly, (B) shows synaptic changes as a function of the percentage of NMDA blockade for the 80-Hz input frequency, indicated by a dashed line in (A).
512
Norberto M. Grzywacz and Pierre-Yves Burgi
Figure 6: Administration of calyculin A, a phosphatase inhibitor, at the times indicated by the horizontal solid line reverted the LTD condition generated by three low-frequency inputs (2-minute-long 2-Hz trains). Under this drug, the synaptic strength returned to baseline even in the absence of any stimulus.
with this phenomenology, because the second term of the right-hand side of equation 2.5 implements an autocatalytic reaction and the first term, a breakdown of this reaction. Consequently, one can think of the second term as an autophosphorylating enzyme and of the first term as a phosphatase. A simple way to model the switch being permanently on upon application of a phosphatase inhibitor (such as calyculin A) is to set k2 = 0 in equation 2.5. The resulting effect is shown in Figure 6, where LTD has been induced by three 2-minute-long low-frequency trains and was reset to near baseline by application of calyculin A. 5 Discussion We describe synaptic changes as being mediated by a retrograde messenger, which controls a particle’s activation state presynaptically. This control is through manipulation of an enzyme that activates and another that inactivates this particle. These enzymes yield LTP or LTD depending on the concentration of the feedback messenger. The messenger also controls an enzymatic switch that impedes synaptic changes from occurring in the absence of pre- or postsynaptic activity. In turn, this messenger depends on the concentration of a postsynaptic agent, C, which is proportional to the multiplication of postsynaptic response and the level of incoming transmitter.
Bidirectional Hebbian Rule
513
5.1 Properties of the Model and Data Accountability. In our bidirectional Hebbian rule, synapses have bounded nonnegative weights and do not change these weights in the absence of pre- and postsynaptic activity. These two properties are not simultaneously shared by any of the rules extant in the literature. Furthermore, the new rule explicitly states the biophysical mechanisms of bidirectionality, bounding, and synaptic stability. From the computational perspective, it is interesting that our rule shares the key property of the BCM rule (Bienenstock et al., 1982; Yang & Faber, 1991; Intrator & Cooper, 1992). That rule has a variable threshold deciding between LTP and LTD as a function of a nonlinear statistic of the past activity. This form of the variable threshold has some useful computational properties, as shown by Intrator and Cooper (1992). As discussed after equation 2.6, our rule has a variable threshold deciding between LTP and LTD, which is a function of the actual value of the weight. The way the threshold varies with past activity is qualitatively similar in both rules. From the mechanistic perspective, if one identifies C with Ca2+ entering the postsynaptic site via NMDA channels (Jahr & Stevens, 1987; Mayer et al., 1987; Ascher & Nowak, 1988) and the enzymatic switch with a system of phosphatase and self-phosphorylating enzymes, then our rule accounts for much of the LTP and LTD hippocampal phenomenology. Consistent with physiological data (Xie et al., 1992; Dudek & Bear, 1993; Heynen, Abraham, & Bear, 1996), the simulated LTP and LTD have transient and long-lasting components. Also in agreement with physiology, LTP and LTD are reversible (Heynen et al., 1996; Dudek & Bear, 1993; Mulkey et al., 1993), as illustrated in Figure 3. The strength of LTP and LTD varies with stimulus duration in such a manner that the crossing point between LTD and LTP always occurs at the same input frequency (Artola & Singer, 1993). Complete blockade of NMDA receptors prevents the LTP and LTD from happening (Morris, Anderson, Lynch, & Baudry, 1986; Goldman, Chavez-Noriega, & Stevens, 1990; Dudek & Bear, 1992). In addition, a bias in the amount of LTP and LTD is induced by either partial blockade of NMDA receptors (Cummings, Mulkey, Nicoll, & Malenka, 1996) or postsynaptic depolarization (simulations not shown, Artola, Brocher, ¨ & Singer, 1990; Xie et al., 1992; Lin, Way, & Gean, 1993). A phosphatase inhibitor, calyculin A, eliminates the maintenance of LTD (Mulkey et al., 1993). And prior synaptic activity influences the induction of subsequent LTP and LTD systematically (Dudek & Bear, 1992; Huang, Colino, Selig, & Malenka, 1992; Wexler & Stanton, 1993). 5.2 Limitations of the Model. We modeled our Hebbian rule as if LTP and LTD are expressed presynaptically. Evidence for a presynaptic site of gain control comes from studies on the probability of synaptic quantal events (del Castillo & Katz, 1954). In these studies, such a probability increases as synapses are potentiated (Bekkers & Stevens, 1990; Malinow, 1991) and decreases when they are depressed (Stevens & Wang, 1994; Oliet, Malenka, & Nicoll, 1996). These changes in frequency of quantal events
514
Norberto M. Grzywacz and Pierre-Yves Burgi
are interpreted as reflecting changes in the release of transmitter, indicating presynaptic changes. However, because there is also evidence against an increase in transmitter release (Manabe & Nicoll, 1994) and some evidence for an increase in quantal amplitude in LTP (Kullman & Nicoll, 1992; Oliet, Malenka, & Nicoll, 1996), the presynaptic interpretation remains still somewhat controversial. We do not regard our presynaptic assumption as essential. The postsynaptic agent C could instead control activating and inactivating enzymes (such as our enzymes Ea and Ei ) involved in the effectiveness of postsynaptic receptors. Similarly, the enzymatic switch complex could exist postsynaptically. Another limitation of the model has to do with the explanation of potentiation transients (see Figure 2A). The mechanisms for these transients include synaptic facilitation, augmentation, and short-term potentiation (STP; Feng, 1941; Liley, 1956; Magleby & Zengel, 1976). In hippocampus, these mechanisms can be induced without LTP (Malenka & Nicoll, 1993). Nevertheless, as far as we know, there is no evidence against the postulated enzymatic switch contributing to part of the potentiation’s transient as in our simulations. Furthermore, although STP can account for part of the potentiation transient, STP cannot account for LTD’s transient. In contrast, the new bidirectional Hebbian rule can account for the experimentally observed LTD transients. The LTP-LTD model also does not take into account nonlinear Ca2+ buffering (Holmes & Levy, 1990), voltage-dependent Ca2+ channels, or the details of glutamate and NMDA receptor functions (Madison, Malenka, & Nicoll, 1991; Clements, 1996). (As stated in section 1, the essence of this model is in the details of the Ea , Ei , and Es enzymes. The other mechanisms in the model are chosen for computational simplicity to be linear.) Therefore, it is unrealistic to assume that the postsynaptic Ca2+ is proportional to the concentration of transmitter. Nevertheless, we confirmed that our qualitative conclusions hold for reasonable monotonic nonlinear relationships between the levels of incoming transmitter and Ca2+ . 5.3 Is There a Switch? Is an enzymatic switch a reasonable feature of the model? An argument against this switch comes from synaptic remodeling in cortex being widely thought to engage the same mechanisms as LTP-LTD in hippocampus (Artola & Singer, 1993; Hirsch & Gilbert, 1993; Kirkwood, Dudek, Gold, Aizenman, & Bear, 1993; Castro-Alamancos, Donoghue, & Connors, 1995). Because a deprived eye loses cortical representation (Wiesel & Hubel, 1963; Baker, Grigg, & von Noorden, 1974), this suggests that silent terminals can have weights changed. However, the deprived eye sends spontaneous signals that may be sufficient to turn the switch on. There is evidence that spontaneous maintained discharge from the eye has an important role in the normal development of visual cortical binocularity. Binocular injection of tetrodoxin, but not binocular visual deprivation, prevents segregation of the geniculocortical afferents serving the two eyes (Stryker &
Bidirectional Hebbian Rule
515
Harris, 1986; Antonini & Stryker, 1993). Hence, deprivation results are not inconsistent with the switch, and there is evidence favoring it. As pointed out in section 4 for LTD, the switch’s assumption receives direct support from experiments involving phosphatase inhibitors (Mulkey et al., 1993).5 In addition, as discussed in section 5.2, at least part of the transience of LTP and LTD may be a consequence of the switch. Another piece of evidence in favor of a switch (though less direct) comes from experiments on the reversibility of LTP in the hippocampus. LTP is induced at synapses where the pre- and postsynaptic activities are correlated, and the signal for its induction is postsynaptic Ca2+ influx. Hence, the interruption of this influx through blockade of the NMDA receptor should induce LTD when the presynaptic activity is high. However, such an experiment fails to produce significant LTD (Goldman et al., 1990). Such a failure is consistent with our model for which lack of postsynaptic Ca2+ impedes the switch from turning on and consequently prevents any synaptic changes from occurring. Appendix This appendix derives equation 2.6 from the enzymatic reactions shown in Figure 1B and the assumption that the decay from the metastate is much faster than the arrival time to it. Instead of using symbols for the concentrations of the active and inactive states of the gating molecule, we work directly with w and w∗ , respectively, to simplify notation. These quantities are proportional to the above concentrations, but we absorb the constants of proportionality into the rate constants. Similarly, we use the variable w∗∗ to quantify the metastate. The kinetic diagrams for the reactions are: k9
Es + w * Es + w∗∗ k8
Es + w∗ * Es + w∗∗ k5
w∗∗ * w k4
w∗∗ * w∗ Ea + w
∗∗ k6
(A.1)
* Ea + w k7
Ei + w∗∗ * Ei + w∗ . From these diagrams, the differential equations for the metastate and active state are dw∗∗ = −(k4 + k5 + k6 [Ea ] + k7 [Ei ])w∗∗ + k9 [Es ]w + k8 [Es ]w∗ dt
(A.2)
5 Our model is not consistent with the failure of the phosphatase inhibitor, calyculin A, to eliminate LTP maintenance. One possibility is that two separate switches exist for LTP and LTD, with the former not involving a phosphatase.
516
Norberto M. Grzywacz and Pierre-Yves Burgi
and dw = −k9 w[Es ] + (k5 + k6 [Ea ])w∗∗ , dt
(A.3)
respectively. The assumption that the decay from the metastate is much faster than the arrival to it is akin to w and w∗ changing on slow time scales to the metastate and to w∗∗ remaining in equilibrium with w and w∗ , adjusting on the fast time scale of its decay. Consequently, it is reasonable to regard w∗∗ as in a steady state on the time scales over which w and w∗ change, allowing us to make the approximation dw∗∗ /dt = 0. (This approximation is similar to what is done in the derivation of the Michaelis-Menten equation; Walsh, 1977). Using this approximation, solving for w∗∗ in equation A.2, and inserting the result into equation A.3, we get −(k4 + k7 [Ei ])k9 [Es ]w + (k5 + k6 [Ea ])k8 [Es ]w∗ dw = . dt k4 + k5 + k6 [Ea ] + k7 [Ei ]
(A.4)
Using w∗ = w0 − w, and substituting βm and γ mn for Ei and Ea , respectively, where β and γ are constants, we obtain equation 2.6. These last substitutions are valid under the hypothesis that enzymes Ei and Ea are available in large amounts and that their kinetics of activation are much faster than m’s dynamics. Acknowledgments This work was supported by grants from the National Eye Institute (EY08921 and EY11170) and by the William A. Kettlewell Chair to N.M.G., by a grant from the Swiss National Fund for Scientific Research (8220-37180) to P.-Y.B., and by a core grant from the National Eye Institute to Smith-Kettlewell (EY06883). References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–223. Antonini, A., & Stryker, M. P. (1993). Development of individual geniculocortical arbors in cat striate cortex and effects of binocular impulse blockade. J. Neurosci., 13, 3549–3573. Artola, A., Brocher, ¨ S., & Singer, W. (1990). Different voltage-dependent thresholds for inducing long-term depression and long-term potentiation in slices of rat visual cortex. Nature, 347, 69–72. Artola, A., & Singer, W. (1993). Long-term depression of excitatory synaptic transmission and its relationship to long-term potentiation. TINS, 16, 480– 487.
Bidirectional Hebbian Rule
517
Ascher, P., & Nowak, L. (1988). The role of divalent cations in the N-methylD-aspartate responses of mouse central neurones in culture. J. Physiol., 399, 247–266. Baker, F. H., Grigg, P., & von Noorden, G. K. (1974). Effects of visual deprivation and strabismus on the responses of neurons in the visual cortex of the monkey, including studies on the striate and prestriate cortex in the normal animal. Brain Res., 66, 185–208. Bekkers, J. M., & Stevens, C. F. (1990). Presynaptic mechanism for long-term potentiation in the hippocampus. Nature, 346, 724–729. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Burgi, P.-Y., & Grzywacz, N. M. (1997). Possible roles of spontaneous waves and dendritic growth for retinal receptive field development. Neural Comput., 9, 533–553. Burgi, P.-Y., & Grzywacz, N. M. (in press). A biophysical model for the developmental time course of retinal orientation selectivity. Vis. Res. Castro-Alamancos, M. A., Donoghue, J. P., & Connors, B. W. (1995). Different forms of synaptic plasticity in somatosensory and motor areas of the neocortex. J. Neurosci., 15, 5324–5333. Cherubini, E., Rovira, C., Gaiarsa, J. L., Corradetti, R., & Ari, Y. B. (1990). GABA mediated excitation in immature rat CA3 hippocampal neurons. Int. J. Devl. Neurosci., 8, 481–490. Churchland, P. S., & Sejnowski, T. J. (1992). The computational brain. Cambridge, MA: MIT Press. Clements, J. D. (1996). Transmitter timecourse in the synaptic cleft: Its role in central synaptic function. Trends Neurosci., 19, 163–171. Cummings, J. A., Mulkey, R. M., Nicoll, R. A., & Malenka, R. C. (1996). Ca2+ signaling requirements for long-term depression in the hippocampus. Neuron, 16, 825–833. del Castillo, J., & Katz, B. (1954). Quantal components of the end-plate potential. J. Physiol., 124, 560–573. Dudek, S. M., & Bear, M. F. (1992). Homosynaptic long-term depression in area CA1 of hippocampus and effects of N-methyl-D-aspartate receptor blockade. Proc. Natl. Acad. Sci. USA, 89, 4363–4367. Dudek, S. M., & Bear, M. F. (1993). Bidirectional long-term modification of synaptic effectiveness in the adult and immature hippocampus. J. Neurosci., 13, 2910–2918. Feng, T. P. (1941). Studies on the neuromuscular junction. XXVI. The changes of the end plate potential during and after prolonged stimulation. Chin. J. Physiol., 16, 341–372. Fr´egnac, Y., Burke, J. P., Smith, D., & Friedlander, M. J. (1994). Temporal covariance of pre- and post-synaptic activity regulates functional connectivity in the visual cortex. J. Neurophysiol., 71, 1403–1421. Goldman, R. S., Chavez-Noriega, L. E., & Stevens, C. F. (1990). Failure to reverse long-term potentiation by coupling sustained presynaptic activity and N-
518
Norberto M. Grzywacz and Pierre-Yves Burgi
methyl-D-aspartate receptor blockade. Proc. Natl. Acad. Sci. U.S.A., 87, 7165– 7169. Grzywacz, N. M., & Burgi, P.-Y. (1994). A Hebbian rule consistent with physiology and producing sharp refinement of organization. Abstr. Third Ann. Comput. Neural Syst. Meeting (p. 38). Monterey, CA. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Heynen, A. J., Abraham, W. C., & Bear, M. F. (1996). Bidirectional modification of CA1 synapses in the adult hippocampus in vivo. Nature, 381, 163–166. Hirsch, J. A., & Gilbert, C. D. (1993). Long-term changes in synaptic strength along specific intrinsic pathways in the cat visual cortex. J. Physiol., 461, 247– 262. Holmes, W. R., & Levy, W. B. (1990). Insights into associative long-term potentiation from computational models of NMDA receptor-mediated calcium influx and intracellular calcium concentration changes. J. Neurophysiol., 63, 1148–1168. Huang, Y.-Y., Colino, A., Selig, D. K., & Malenka, R. C. (1992). The influence of prior synaptic activity on the induction of long-term potentiation. Science, 255, 730–733. Intrator, N., & Cooper, L. N. (1992). Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Net., 5, 3–17. Jahr, C. E., & Stevens, C. F. (1987). Glutamate activates multiple single channel conductances in hippocampal neurones. Nature, 325, 522–525. Kirkwood, A., Dudek, S. M., Gold, J. T., Aizenman, C. D., & Bear, M. F. (1993). Common form of synaptic plasticity in the hippocampus and neocortex in vitro. Science, 260, 1518–1521. Kullman, D. M., & Nicoll, R. A. (1992). Long-term potentiation is associated with increases in quantal content and quantal amplitude. Nature, 357, 240–244. Liley, A. W. (1956). The quantal components of the mammalian end-plate potentials. J. Physiol. Lond., 133, 571–587. Lin, J. H., Way, L.-J., & Gean, P.-W. (1993). Pairing of pre- and post-synaptic activities in hippocampal CA1 neurons induces long-term modifications of NMDA receptor-mediated synaptic potential. Brain Res., 603, 117–120. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation-selective cells. Proc. Natl. Acad. Sci. USA, 83, 8390–8394. Lisman, J. (1989). A mechanism for the Hebb and the anti-Hebb processes underlying learning and memory. Proc. Natl. Acad. Sci. USA, 86, 9574–9578. Madison, D. V., Malenka, R. C., & Nicoll, R. A. (1991). Mechanisms underlying long-term potentiation of synaptic transmission. Ann. Rev. Neurosci., 14, 379– 397. Magleby, K. L., & Zengel, J. E. (1976). Augmentation: A process that acts to increase transmitter release at the frog neuromuscular junction. J. Physiol. Lond., 257, 449–470. Malenka, R. C., & Nicoll, R. A. (1993). NMDA-receptor-dependent synaptic plasticity: Multiple forms and mechanisms. Trends Neurosci., 16, 521–527. Malinow, R. (1991). Transmission between pairs of hippocampal slice neurons: Quantal levels, oscillations, and LTP. Science, 252, 722–724.
Bidirectional Hebbian Rule
519
Malinow, R., Madison, D. V., & Tsien, R. W. (1988). Persistent protein kinase activity underlying long-term potentiation. Nature, 335, 820–824. Manabe, T., & Nicoll, R. A. (1994). Long-term potentiation: Evidence against an increase in transmitter release probability in the CA1 region of the hippocampus. Science, 265, 1888–1893. Mathews, C. K., & van Holden, K. E. (1990). Biochemistry. Redwood City, CA: Benjamin/Cummings. Mayer, M. L., MacDermott, A. B., Westbrook, G. L., Smith, S. J., & Barker, J. L. (1987). Agonist- and voltage-gated calcium entry in cultured mouse spinal cord neurons under voltage clamp measured using arsenazo III. J. Neurosci., 7, 3230–3244. Miller, K. D. (1994). A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activitydependent competition between ON- and OFF-center inputs. J. Neurosci., 14, 409–441. Morris, R. G., Anderson, E., Lynch, G. S., & Baudry, M. (1986). Selective impairment of learning and blockade of long-term potentiation by an N-methyl-Daspartate receptor antagonist, AP5. Nature, 319, 774–776. Mulkey, R. M., Herron, C. E., & Malenka, R. C. (1993). An essential role for protein phosphatases in hippocampal long-term depression. Science, 261, 1051–1055. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biol., 15, 267–273. Oliet, S. H. R., Malenka, R. C., & Nicoll, R. A. (1996). Bidirectional control of quantal size by synaptic activity in the hippocampus. Science, 271, 1294–1297. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C. Cambridge: Cambridge University Press. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4, 303–321. Stanton, P. K., & Sejnowski, T. J. (1989). Associative long-term depression in the hippocampus induced by Hebbian covariance. Nature, 339, 531–533. Stent, G. S. (1973). A physiological mechanism for Hebb’s postulate of learning. Proc. Natl. Acad. Sci. USA, 70, 997–1001. Stevens, C. F., & Wang, Y. (1994). Changes in reliability of synaptic function as a mechanism for plasticity. Nature, 371, 704–707. Stryker, M. P., & Harris, W. A. (1986). Binocular impulse blockade prevents the formation of ocular dominance columns in cat visual cortex. J. Neurosci., 6, 2117–2133. Takeuchi, A. (1958). The long-lasting depression in neuromuscular transmission of frog. Jpn. J. Physiol., 8, 102–113. Thies, R. E. (1965). Neuromuscular depression and the apparent depletion of transmitter in mammalian muscle. J. Neurophysiol., 28, 427–442. von der Malsburg, C. (1973). Self organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100. Walsh, C. T. (1977). Enzymatic reaction mechanisms. San Francisco: W. H. Freeman. Walton, M. K., Schaffner, A. E., & Barker, J. L. (1993). Sodium channels, GABAA receptors, and glutamate receptors develop sequentially on embryonic rat spinal cord cells. J. Neurosci., 13, 2068–2084.
520
Norberto M. Grzywacz and Pierre-Yves Burgi
Wexler, E. M., & Stanton, P. K. (1993). Priming of homosynaptic long-term depression in hippocampus by previous synaptic activity. NeuroReport, 4, 590– 594. Wiesel, T. N., & Hubel, D. H. (1963). Single-cell responses in striate cortex of kittens deprived of vision in one eye. J. Neurophysiol., 26, 1003–1017. Xie, X., Berger, T. W., & Barrionuevo, G. (1992). Isolated NMDA receptormediated synaptic responses express both LTP and LTD. J. Neurophysiol., 67, 1009–1013. Yang, X.-D., & Farber, D. S. (1991). Initial synaptic efficacy influences induction and expression of long-term changes in transmission. Proc. Natl. Acad. Sci. USA, 88, 4299–4303. Received October 22, 1996; accepted June 21, 1997.
NOTE
Communicated by Richard Cone
Axon Guidance: Stretching Gradients to the Limit Geoffrey J. Goodhill Georgetown Institute for Cognitive and Computational Sciences, Georgetown University Medical Center, Washington, DC 20007, U.S.A.
Herwig Baier Department of Biology, University of California, San Diego, La Jolla, CA 92093-0366, U.S.A.
Neuronal growth cones, the sensory-motile structures at the tips of developing axons, navigate to their targets over distances that can be many times greater than their diameter. They may accomplish this impressive task by following spatial gradients of axon guidance molecules in their environment (Bonhoeffer & Gierer, 1984; Tessier-Lavigne & Placzek, 1991; Baier & Bonhoeffer, 1994). We calculate the optimal shape of a gradient and the distance over which it can be detected by a growth cone for two competing mechanistic models of axon guidance. The results are surprisingly simple: Regardless of the mechanism, the maximum distance is about 1 cm. Since gradients and growth cones have coevolved, we suggest that the shape of the gradient in situ will predict the mechanism of gradient detection. In addition, we show that the experimentally determined dissociation constants for receptor-ligand complexes implicated in axon guidance are about optimal with respect to maximizing guidance distance. The relevance of these results to the retinotectal system is discussed. 1 Introduction The mechanisms that guide axons to appropriate targets in the developing brain are largely unknown. A popular notion, first suggested by Cajal, is that spatial gradients of axon guidance molecules are detected by the growth cone and provide directional information. Experimental evidence for the existence of such mechanisms is gradually mounting. However, so far there has been little consideration of the theoretical limits on axon guidance by gradients imposed by physical limits on the detection of a concentration difference across a small sensing device. Here, using a few pieces of experimental data and some simple approximations, we address these limits. For a growth cone to be guided by a gradient, it must be able to sense a sufficiently large difference in ligand concentration over its length. The ligand may be attractive or repellent, and may be substrate bound, freely Neural Computation 10, 521–527 (1998)
c 1998 Massachusetts Institute of Technology °
522
Geoffrey J. Goodhill and Herwig Baier
diffusing, or a combination of both. Two possible mechanisms for gradient detection by a growth cone are (1) internal amplification of a small percentage change in external concentration across the width w of the growth cone (Bonhoeffer & Gierer, 1984; Gierer, 1987), and (2) a shifting internal baseline that reduces the effective concentration at one edge of the growth cone to zero (Walter, Allsop, & Bonhoeffer, 1990). Gradient detection by the former mechanism requires a sufficiently high percentage change p in concentration over distance w, while the latter requires a sufficiently high absolute concentration difference 1C over w. Three additional constraints limit gradient detection. First, the local external concentration must be less than a critical value Chigh , at which most receptors are saturated. Second, it must be greater than a critical value Clow , at which an insufficient number of receptors are bound to overcome noise. Clow and Chigh vary relative to the dissociation constant kd for the receptor-ligand complex. Third, the local concentration must also be greater than a physical limit Cnoise , which is kd independent. At this concentration, the number of ligand molecules in the vicinity of the growth cone is so small that over the time scales of relevance to the growth cone, thermally induced fluctuations wash out the gradient signal (Tranquillo & Lauffenburger, 1987). 2 Maximum Guidance Distance What is the maximum range rmax for which guidance is possible for the two mechanisms above? The optimal gradient for case 1 has a constant fractional change across the width of the growth cone w for all positions: an exponential gradient. Consider C(r) = C0 e−ar where C is concentration, r is distance, and C0 and a are constants. Requiring a percentage change of p (= 1C/C) across distance w yields a = p/w. The maximum distance for which C ≥ Clow is achieved when C0 = Chigh . This gives rmax =
Chigh w loge . p Clow
(2.1)
The optimal gradient for case 2 has a constant absolute concentration change across the width of the growth cone: a linear gradient. Consider C(r) = C0 − ar. Requiring a concentration change of 1C over distance w yields a = 1C/w. Again the optimal value of C0 is Chigh . For the analogous case of leukocyte chemotaxis, it is known that sensitivity to gradients is optimized when the external concentration is equal to the dissociation constant kd of the relevant receptor (Devreotes & Zigmond, 1988), which yields 1C = pkd . This gives rmax =
w Chigh − Clow . p kd
(2.2)
Axon Guidance
523
What are plausible parameter values? We assume a growth cone diameter w (including filopodia) of 20 µm. Direct evidence (Baier & Bonhoeffer, 1992), analogous data for leukocyte chemotaxis (Devreotes & Zigmond, 1988), and theoretical considerations (Tranquillo & Lauffenburger, 1987) suggest that p is about 2 percent. Data for leukocyte chemotaxis suggest that Clow ≈ kd /100 and Chigh ≈ 10kd (the asymmetry is due to down-regulation of receptors at high external concentrations) (Zigmond, 1981). Assuming Clow > Cnoise yields rmax ≈ 0.7 cm for the exponential case (see equation 2.1) and rmax ≈ 1 cm for the linear case (see equation 2.2). Note that these values scale linearly with the size of the growth cone and do not depend on kd . The calculation assumes that the growth cone can detect p = 2 percent for Chigh ≥ C ≥ Clow , whereas in fact it is likely that p needs to be much larger away from C = kd . Correcting for this would reduce rmax in both cases. Similarly if growth cones employ a combination of the two mechanisms, rmax would again be reduced: 1 cm is thus an upper bound. Three obvious scenarios for how axons could be guided over larger distances are as follows. First, there could exist a series of spaced gradients of different ligands, each binding to the same or different receptors and guiding the growth cone over only a portion of the full distance. Second, there could exist overlaid gradients of different ligands, each competing for occupancy of the same receptor. Appropriate differences in affinity would allow guidance in multiple regions. Third, there could exist multiple receptors on the growth cone for the same ligand, with different affinities. Each would guide the growth cone over the segment of the gradient lying within its appropriate concentration range. Note that these considerations apply to attractant as well as repellent guidance molecules, or to combinations of both. 3 Noise Limits to Receptor-Ligand Affinity To maximize guidance distance, it is clearly necessary to choose Clow > Cnoise . An accurate calculation of Cnoise requires knowledge of parameters such as the length of time over which an axon integrates signals from its receptors before assessing a gradient value, which has not been measured. Here instead a conservative order of magnitude estimate for Cnoise is made. We assume, as an extremely rough estimate, that 100 molecules in the vicinity of the growth cone are sufficient for a 2 percent gradient to be detected. This means that the growth cone can distinguish 50 molecules on one side from 51 on the other. Imagine that the growth cone plus filopodia occupies a cube of side length 20 µm; this has a volume of approximately 10−11 liters. One hundred ligand molecules in this volume correspond to a ligand concentration CL ≈ 0.01 nM (note that the proportion of the cube occupied by the body of the growth cone, and thus unavailable to the ligand molecules, is small). Equating this with the lower limit due to the dynamics of receptor binding, Clow = kd /100, yields kd ≈ 1 nM. We suggest that a kd of very
524
Geoffrey J. Goodhill and Herwig Baier
roughly 1 nM represents a lower limit for axon guidance receptor-ligand complexes.1 A receptor-ligand affinity significantly higher than this (i.e., kd ¿ 1 nM) would not improve the accuracy of gradient reading. A significantly lower affinity would require comparatively large amounts of factor to be produced. An alternative reverse-engineering argument based on the same principle is that the kd of the receptor-ligand complex could predict the actual signal-to-noise requirements of gradient reading. 4 Applications to the Retinotectal System Two recently identified repellent axon guidance molecules are believed to be involved in the formation of the retinotectal projection: ephrin-A5 (Drescher et al., 1995) and ephrin-A2 (Cheng, Nakamoto, Bergemann, & Flanagan, 1995; Nakamoto et al., 1996). Both are expressed as gradients in the chick optic tectum, and both bind to one family of receptors, some members of which are expressed on retinal growth cones (for review, see Friedman & O’Leary, 1996a). The ephrin-A2 gradient spans the entire tectum, while the ephrin-A5 gradient is shifted posteriorly in the tectum, being absent from the anterior tectum (where retinal axons enter) (Cheng et al., 1995; Drescher et al., 1995; Nakamoto et al., 1996). kd values have recently been measured in vitro for ephrin-A5 and ephrin-A2 for three growth cone receptors: EphA3, EphA5, and EphA4. These values are as follows, for ephrin-A5 and ephrinA2, respectively: EphA3: 0.144 nM/0.86 nM; EphA5: 0.616 nM/8.62 nM; EphA4: 0.622 nM/12.7 nM (Monschau et al., 1997). In each case, the value for ephrin-A2 is roughly an order of magnitude higher than that for ephrin-A5. The chick optic tectum extends over 6–9 mm during formation of the retinotectal map. The distance that the farthest projecting retinal growth cones have to travel across its (bent) surface is well over 1 cm. Our calculations predict that if retinal axons are guided within the tectum solely by gradient mechanisms, then some method for extending guidance must be operating.2 We suggest that retinal growth cones could use the same receptor(s) for both ephrin-A2 and ephrin-A5, with the low-affinity ephrinA2 gradient providing guidance in the anterior tectum, the high-affinity ephrin-A5 gradient providing guidance in the posterior tectum, and a combination of both gradients providing guidance in the middle. In addition,
1 This calculation applies to both substrate-bound and freely diffusing ligands and also analogously to the sensing of a gradient on a two-dimensional surface. 2 The situation is apparently more involved: only the nasal-most retinal axons traverse the entire tectum. The more temporal the axons’ site of origin in the retina, the farther anteriorly they terminate in the tectum. This graded response to tectal cues, such as ephrinA5, is possibly reflected by a gradient of receptor level, such as EphA3 (Drescher et al., 1995), in the retina. However, temporal axons are able to navigate to their appropriate tectal target if misrouted or surgically displaced, suggesting that they can utilize gradient information in tectal regions that they normally do not encounter.
Axon Guidance
525
the affinity values for ephrin-A5 and ephrin-A2 given above are all within an order of magnitude of our theoretical lower limit of 1 nM, which is reasonable agreement given the crudeness of our calculation. (However, these are in vitro measurements, which may differ from values in vivo.) 5 Regulation of Gradient Shape An unresolved issue of both biological and theoretical interest is how gradient shape can be regulated in an embryonic field (Crick, 1970). Some axon guidance molecules, like netrin-1 (Kennedy et al., 1994; Serafini et al., 1994), are diffusible factors that are secreted by target cells (Tessier-Lavigne & Placzek, 1991; Kennedy et al., 1994). Simple diffusion yields gradients that are inefficient when growth cones have to traverse distances greater than 1 mm (Tessier-Lavigne & Placzek, 1991; Goodhill, 1997). Binding of the factor to the substrate (e.g., the extracellular matrix) could modify the shape of the gradient to maximize the distance and optimize the accuracy of guidance. The positional information conferred by the gradients of ephrin-A2 and ephrin-A5 in the tectum is initially set up by gradients of morphogens (Crick, 1970) and by transcription factors such as en-1 or en-2 (Itasaki & Nakamura, 1996; Logan et al., 1996; Friedman & O’Leary, 1996b). The local concentrations of these have to be translated into local concentrations of guidance molecules. The translation mode is unknown, but we expect, given the size constraints discussed here, that nature has made some effort to optimize it. 6 Conclusions For the two possible mechanisms of gradient detection across the width of the growth cone (measuring a fractional change versus a difference from an adjustable baseline), the maximum guidance distance is surprisingly similar (0.7–1.0 cm). However, the shape of the optimal gradient is different in the two cases (exponential versus linear). Therefore, it should be possible to predict the actual gradient-reading mechanism by accurately measuring the shape of gradients of axon guidance protein in situ. Our result also has important implications for the scalability of axon guidance mechanisms to animals substantially larger than the rats and chickens that are most commonly studied. Acknowledgments We thank Uwe Drescher, Friedrich Bonhoeffer, and colleagues for sharing results prior to publication. G. J. G. thanks Dennis D. M. O’Leary for very helpful discussions. H. B. thanks Bill Harris and Christine Holt for helpful discussions and support. H. B. was funded by the Alexander von Humboldt Foundation.
526
Geoffrey J. Goodhill and Herwig Baier
References Baier, H., & Bonhoeffer, F. (1992). Axon guidance by gradients of a target-derived component. Science, 255, 472–475. Baier, H., & Bonhoeffer, F. (1994). Attractive axon guidance molecules. Science, 265, 1541–1542. Bonhoeffer, F., & Gierer, A. (1984). How do retinal axons find their target on the tectum? Trends Neurosci., 7, 378–381. Cheng, H. J., Nakamoto, M., Bergemann, A. D., & Flanagan, J. G. (1995). Complementary gradients in expression and binding of Elf-1 and Mek4 in development of the topographic retinotectal projection map. Cell, 82, 371–381. Crick, F. H. C. (1970). Diffusion in embryogenesis. Nature, 255, 420–422. Devreotes, P. N., & Zigmond, S. H. (1988). Chemotaxis in eukaryotic cells: A focus on leukocytes and Dictyostelium. Ann. Rev. Cell. Biol., 4, 649–686. Drescher, U., Kremoser, C., Handwerker, C., Loschinger, J., Noda, M., & Bonhoeffer, F. (1995). In-vitro guidance of retinal ganglion-cell axons by RAGS, a 25 KDa tectal protein related to ligands for Eph receptor tyrosine kinases. Cell, 82, 359–370. Friedman, G. C., & O’Leary, D. D. M. (1996a). Eph receptor tyrosine kinases and their ligands in neural development. Curr. Opin. Neurobiol., 6, 127–133. Friedman, G. C., & O’Leary, D. D. M. (1996b). Retroviral misexpression of engrailed genes in the chick optic tectum perturbs the topographic targeting of retinal axons. J. Neurosci., 16, 5498–5509. Gierer, A. (1987). Directional cues for growing axons forming the retinotectal projection. Development, 101, 479–489. Goodhill, G. J. (1997). Diffusion in axon guidance. European Journal of Neuroscience, 9, 1414–1421. Itasaki, N., & Nakamura, H. (1996). A role for gradient en expression in positional specification on the optic tectum. Neuron, 1, 55–62. Kennedy, T. E., Serafini, T., de al Torre, J. R., & Tessier-Lavigne, M. (1994). Netrins are diffusible chemotropic factors for commissural axons in the embryonic spinal cord. Cell, 78, 425–435. Logan, C., Wizenmann, A., Drescher, U., Monschau, B., Bonhoeffer, F., & Lumsden, A. (1996). Rostral optic tectum acquires caudal characteristics following ectopic engrailed expression. Current Biology, 6, 1006–1014. Monschau, B., Kremoser, C., Ohta, K., Tanaka, H., Kaneko, T., Yamada, T., Handwerker, C., Hornbereger, M. R., Loschinger, ¨ J., Pasquale, E. B., Siever, D. A., Verderame, M. F., Muller, ¨ B., Bonhoeffer, F., & Drescher, U. (1997). Shared and distinct functions of RAGS and ELF-1 in guiding retinal axons. EMBO Journal, 16, 1258–1267. Nakamoto, M., Cheng, H. J., Friedman, G. C., Mclaughlin, T., Hansen, M. J., Yoon, C. H., O’Leary, D. D. M., & Flanagan, J. G. (1996). Topographically specific effects of ELF-1 on retinal axon guidance in-vitro and retinal axon mapping in-vivo. Cell, 86, 755–766. Serafini, T., Kennedy, T. E., Galko, M. J., Mirzayan, C., Jessell, T. M., & TessierLavigne, M. (1994). The netrins define a family of axon outgrowth-promoting
Axon Guidance
527
proteins homologous to C. Elegans UNC-6. Cell, 78, 409–424. Tessier-Lavigne, M., & Placzek, M. (1991). Target attraction—are developing axons guided by chemotropism? Trends Neurosci., 14, 303–310. Tranquillo, R. T., & Lauffenburger, D. A. (1987). Stochastic-model of leukocyte chemosensory movement. J. Math. Biol., 25, 229–262. Walter, J., Allsop, T. E., & Bonhoeffer, F. (1990). A common denominator of growth cone guidance and collapse? Trends Neurosci., 11, 447–452. Zigmond, S. H. (1981). Consequences of chemotactic peptide receptor modulation for leukocyte orientation. J. Cell. Biol., 88, 644–647. Received February 14, 1997; accepted July 7, 1997.
NOTE
Communicated by Read Montague
Equivalence of a Sprouting-and-Retraction Model and Correlation-Based Plasticity Models of Neural Development Kenneth D. Miller Departments of Physiology and Otolaryngology and Neuroscience Graduate Program, W. M. Keck Center for Integrative Neuroscience, Sloan Center for Theoretical Neurobiology at University of California, San Francisco, CA 94143-0444, U.S.A.
A simple model of correlation-based synaptic plasticity via axonal sprouting and retraction (Elliott, Howarth, & Shadbolt, 1996a) is shown to be equivalent to the class of correlation-based models (Miller, Keller, & Stryker, 1989), although these were formulated in terms of weight modification of anatomically fixed synapses. Both models maximize the same measure of synaptic correlation, subject to certain constraints on connectivity. Thus, the analyses of the correlation-based models suffice to characterize the behavior of the sprouting-and-retraction model. More detailed models are needed for theoretical distinctions to be drawn between plasticity via sprouting and retraction, weight modification, or a combination. The model of Elliott et al. involves stochastic search through allowed weight patterns for those that improve correlations. That of Miller et al. instead follows dynamical equations that determine continuous changes of the weights that improve correlations. The identity of these two approaches is shown to depend on the use of subtractive constraint enforcement in the models of Miller et al. More generally, to model the idea that neural development acts to maximize some measure of correlation subject to a constraint on the summed synaptic weight, the constraint must be enforced subtractively in a dynamical model. 1 Introduction Models of activity-dependent, correlation-based mechanisms of neural development (Miller 1990a, 1996a) have, for simplicity, typically used weight modifications of anatomically fixed synapses (but see von der Malsburg, 1979; Fraser & Perkel, 1990; Montague, Gally, & Edelman, 1991; Colbert, Fall, & Levy, 1994; Elliott, Howarth, & Shadbolt, 1996a, b). However, anatomical changes in connectivity—for example, via synaptic sprouting and retraction guided or stabilized by correlation-based rules—also may play important roles in activity-dependent development and learning. For example, retraction of axons and dendrites plays a role in many forms of activity-dependent development. In ocular dominance plasticity, thalaNeural Computation 10, 529–547 (1998)
c 1998 Massachusetts Institute of Technology °
530
Kenneth D. Miller
mocortical afferents withdraw from regions of cortex that come to be dominated by afferents serving the opposite eye (LeVay, Stryker, & Shatz, 1978). Dendrites of postsynaptic neurons also have been shown to avoid inappropriate ocular dominance columns (Katz & Constantine-Paton, 1988; Katz, Gilbert, & Wiesel, 1989; Kossel, Lowel, ¨ & Bolz, 1995). At the neuromuscular junction, activity-dependent competition normally leads to anatomical retraction of the axons of all but one input to each muscle fiber (Purves & Lichtman, 1985). However, retraction of an axon at the neuromuscular junction is preceded by significant loss of its physiological synaptic strength (Colman, Nabekura, & Lichtman, 1997). Thus, these regressive anatomical changes might simply follow physiological change and not play a leading or guiding role in activity-dependent development. Stronger evidence for a central role for anatomical changes is found in optic tectum, where there is a continuous retinotopic reorganization of the visual map throughout life. This implies that both retinotopic and, when they occur, ocular dominance maps in this system must be continually maintained amid constant anatomical rearrangement of inputs (ConstantinePaton, Cline, & Debski, 1990; Debski, Cline, & Constantine-Paton, 1990). Striking anatomical changes also occur during visual cortical development. During the normal development of ocular dominance columns in cat visual cortex, there is a huge increase in the overall number of synapses in visual cortex (Cragg, 1975). At the same time, the terminal arbors of thalamic afferents to visual cortex increase in density and branching complexity, though not in overall extent (Antonini & Stryker, 1993a). However, in both optic tectum and visual cortex, control of synapse number during development appears to be activity independent (Hayes & Meyer, 1989; Bourgeois, Jastreboff, & Rakic, 1989), suggesting that activity-dependent processes help determine which connections survive but do not influence the overall number of survivors. Recent results suggest that activity-dependent processes can directly influence axonal sprouting in visual cortex, at least under abnormal conditions causing denervation of a cortical region. In kitten visual cortex, closure of one eye (monocular deprivation) for short times leads to dramatic loss of arborizations of the thalamocortical afferents corresponding to the closed eye (Antonini & Stryker, 1993b). Reverse deprivation (opening the closed eye and closing the open one) following similar periods of deprivation leads to a strong anatomical shift in favor of the originally closed eye (Movshon & Van Sluyters, 1981). Thus, thalamocortical afferents from the originally closed eye appear to show significant axonal sprouting after reverse deprivation, although this sprouting might occur only into regions that have already been denervated by the newly closed eye (Mioche & Singer, 1989). Similarly, following retinal lesions in adult cats, intracortical axonal sprouting appears to occur into the visual cortical region that formerly responded to the lesioned area (Darian-Smith & Gilbert, 1994). A rapidly accumulating set of evidence suggests a possible role for neu-
Equivalence of Models of Neural Development
531
rotrophins in activity-dependent synaptic plasticity in visual cortex and hippocampus (e.g., Cabelli, Wohn, & Shatz, 1995; Fiorentini, Berardi, & Maffei, 1995; Kang & Schuman, 1995; Korte et al., 1995; McAllister, Lo, & Katz, 1995; all reviewed in Thoenen, 1995, and Bonhoeffer, 1996). This also suggests a role for anatomical synaptic rearrangements, because in many systems neurotrophins play a strong role in influencing sprouting and retraction of axonal and dendritic branches (e.g., Purves & Lichtman, 1985, McAllister et al., 1995). However, it remains unclear whether neurotrophins play a specific instructional role in activity-dependent plasticity, as opposed to, say, a nonspecific role in regulating overall levels of sprouting. Furthermore, neurotrophins are also involved in weight modification of anatomically fixed synapses (Figurov, Pozzo-Miller, Olafsson, Wang, & Lu, 1996; Kang & Schuman, 1995; Korte et al., 1995). Finally, many other experiential modifications can lead to anatomical changes in synaptic connectivity in a variety of other neural structures (Weiler, Hawrylak, & Greenough, 1995). Mechanisms of activity-dependent synaptic sprouting and retraction have not been explicitly included in previous correlation-based models for at least two reasons. First, formulation of the dynamics of such mechanisms has seemed forbidding in the absence of improved experimental characterization. Second, the study of anatomically fixed synapses seems potentially adequate to understand the behavior of more general correlation-based models. For example, if the range of retinotopically allowed axonal exploration in a correlation-based sprouting-and-retraction model is equivalent to the range of initial axonal connections in a similar model using fixed synapses, then both models would explore the same space of possible connections, and both should converge to the same “most-correlated” set of connections within that space. This article first shows that there is a precise such equivalence between one very simple sprouting-and-retraction model—that of Elliott et al., (1996a)—and previously formulated models using fixed synapses (Miller, 1990a). Then it shows that the use of resource limitations or competitive constraints (Miller & MacKay, 1994) that are linear in the synaptic weights, such as weight normalization, in the sprouting model naturally corresponds to subtractive implementation of such constraints in the fixed-synapse models. More generally, subtractive implementation of linear constraints emerges as a natural result of an energy-minimization viewpoint, whereas it appeared quite arbitrary from the viewpoint of dynamical models. 2 Equivalence of a Sprouting-and-Retraction Model and a Fixed-Synapse Model Elliott et al. (1996a) consider a sprouting-and-retraction model of ocular dominance development (see Figure 1A). A two-dimensional grid of cortical cells receives synapses from two two-dimensional input layers, one layer
532
Kenneth D. Miller
B) x
A) 1
6
y
S(x, a)
z S(z, a)
a
C)
x I(x,y) y
S(x, a)
S(y, b) a C( a,b) b
Figure 1: Equivalence of the models of Elliott et al. and Miller et al. For simplicity, the models are illustrated for the case of only a single input type (e.g., a single eye). (A) Model of Elliott et al. Each synapse is given an individual label. Here, six synapses are shown from the central presynaptic cell to the three postsynaptic cells; these synapses are labeled from 1 to 6 (only the labels 1 and 6 are shown). The activity at each synapse is σi ∈ {−1, 1}, where i is the synapse label. All synapses from the same input cell have the same activity; hence, in this case σi is identical for i = 1, . . . , 6. Synaptic connections are made and broken, P within retinotopic limits described by an arbor function A, to maximize j,i∈Nj σi σj , where Nj is the cortical neighborhood of synapse j, defined in the text. (B) Model of Miller et al. The presynaptic cells are labeled by an index α, β, . . . , and the postsynaptic cells are labeled by an index x, y, . . . . There is a single variable, S(x, α), describing the total synaptic strength of connection between cells α and x; this is equivalent to the number of synapses connecting the two cells in the model of Elliott et al. Thus, given the configuration shown in (A), the equivalent strengths in (B) in the model of Miller et al. would be S(x, α) = 2, S(y, α) = 3, S(z, α) = 1. (C) The energy maximized by both models. The graphic shows the factors underlying the interaction energy between two weights, S(x, α) and S(y, β). This energy is the product of the four factors shown: the two weights, the correlation between the two presynaptic cells C(α, β), and the intracortical influence between the two postsynaptic locations I(x, y). Both models maximize an energy (negative of equation 2.2, which is minimized) given by the sum over all weight pairs of this interaction energy.
serving each eye. The input layers represent cells in the lateral geniculate nucleus (LGN). Let each synapse from LGN to cortex be labeled by an index i, j, . . . . Assume that all LGN inputs have two activity states, on or off. Let σi be a variable with value 1 if the input to synapse i is on and value −1 if that input is off. Note that multiple synapses come from the same input cell; the σ ’s for all synapses from a single input are always identical. Let Nj be the cortical neighborhood of synapse j, defined as the set of synapses on the same cortical cell as synapse j or on any of the four adjacent cortical cells. Elliott et al. (1996a) posit P that synaptic rearrangements occur so as to minimize the energy E = − 12 j,i∈Nj σi σj , that is, so as to maximize the correlations between the activities of synapses on the same and neighboring
Equivalence of Models of Neural Development
533
cortical cells. Arborizations are taken to be localized, so that synapses may be made only within a certain arborization radius rA of the input cell. Rearrangements occur subject to this and other constraints on connectivity. Input activity patterns are selected from some ensemble. For each input pattern, a small number of randomly chosen allowed rearrangements are tried. Most of the model results were obtained at zero temperature, meaning that only E-reducing rearrangements are accepted; a few results were obtained at finite temperature, meaning that E-increasing rearrangements were accepted probabilistically. This discussion focuses on the zero-temperature case. To see the equivalence to previous fixed-synapse models, let Greek letters α, β , . . . represent retinotopic position in the input layers, and Roman letters x, y, . . . position in the cortical layer. Let SI (x, α) represent the number of synapses from the LGN neuron at position α in input layer I to the cortical cell at x. Let σ I (α) be the activity (±1) of the neuron at position α in input layer I. Define an intracortical function I(x, y) to be 1 if x = y orP if x and y are cortical neighbors, and 0 otherwise. Then the energy E = − 12 j,i∈Nj σi σj can be rewritten as E=−
h ih i 1 X I(x, y) SI (x, α)σ I (α) SJ (y, β)σ J (β) . 2 x,y,α,β,I,J
(2.1)
Assuming that the number of synaptic rearrangements per input pattern is small, so that LGN activity patterns may be averaged over, this energy can be rewritten as (see Figure 1C) E=−
1 X SI (x, α)I(x, y)CIJ (α, β)SJ (y, β), 2 x,y,α,β,I,J
(2.2)
® where CIJ (α, β) = σ I (α)σ J (β) , the angle brackets signify average over input patterns, and the notation E has been retained for hEi. Define the arbor function by A(x, α) = 1, |x − α| ≤ rA ; A(x, α) = 0, otherwise. Then the localization of arborizations is taken into account by minimizing E subject to SI (x, α) = 0 whenever A(x, α) = 0. The correlation-based models studied in Miller et al. (1989) and Miller (1990a, 1990b, 1994) begin from equations that describe (1) simple Hebbian or other correlation-based plasticity of anatomically fixed synapses and (2) cortical activity as a function of the input activity pattern. These equations are combined and averaged over the ensemble of input activation patterns, to arrive at equations for the development of synaptic connectivity of the form X d I S (x, α) = A(x, α) I(x, y)CIJ (α, β)SJ (y, β). dt y,β,I,J
(2.3)
534
Kenneth D. Miller
Here, SI (x, α) represents the total synaptic strength—the average synaptic strength times the number of synapses—from α in layer I to x. A(x, α), the arbor function, describes the retinotopically allowed anatomical strength of connections from an input at α to a cortical cell at x. I(x, y), the intracortical interaction function, describes the influence of synapses at cortical position y on the growth of simultaneously active synapses at position x, via intracortical connectivity and/or diffusion of trophic or modulatory factors. The correlation functions CIJ (α, β) describe the correlation in activity between an input at position β in LGN layer J and one at α in LGN layer I. Assuming that I(x, y) = I(y, x) and CIJ (α, β) = CIJ (β, α), equation 2.3 can be easily shown to represent gradient descent minimization of the energy E of equa√ tion 2.2 in the variables TI (x, α) = SI (x, α)/ A(x, α).1 Again, this energy is minimized subject to SI (x, α) = 0 whenever A(x, α) = 0. Thus, an identical energy (equation 2.2; Figure 1C) is minimized by both the sprouting-and-retraction model of Elliott et al. (1996a) and the correlation-based models of Miller et al. (1989), which assume plasticity of anatomically fixed synapses. However, the energies are minimized subject to differing constraints, discussed in the appendix. The constraints imposed in the two models have a similar character, involving upper and/or lower limits on the strengths of individual connections and on the summed presynaptic and/or postsynaptic strength associated with each cell. The choices of constraints do not represent intrinsic differences between sprouting-andretraction and fixed-synapse models. Either type of model might be run with either set of constraints. However, a given set of constraints may be dynamically enforced in different, inequivalent ways (Miller & MacKay, 1994). I now address the form of enforcement needed to render a dynamical model equivalent to the model of Elliott et al. 3 Constraint Enforcement and Energy Minimization Constraints on connectivity are used in developmental models to incorporate several biological facts: that biological development is competitive, so that differences between the inputs’ activities, rather than the amounts √ Symmetrize equation 2.3 by the transformation TI (x, α) = SI (x, α)/ A(x, α), I A(x, α) 6= 0; T (x, α) = 0, otherwise (Miller, 1990a; Miller & Stryker, 1990; MacKay & Miller, 1990). Note that SI (x, α) = 0 whenever A(x, α) = 0, so no information is lost by this transformation. With the stated assumptions on C and I, the resulting equation for TI (x, α) is symmetric under (x, α) ↔ (y, β), and performs gradient descent in the energy 1
E=−
Since
√
1 2
X
p
TI (x, α)
p
A(x, α)I(x, y)CIJ (α, β)
A(y, β)T J (y, β)
x,y,α,
β ,I,J
A(x, α)TI (x, α) = SI (x, α), this E is identical to that of equation 2.2.
Equivalence of Models of Neural Development
535
of their activities, determine their fate in synaptic competition (Wiesel & Hubel, 1965; Guillery, 1972; Lo & Poo, 1991; Balice-Gordon & Lichtman, 1995); that resources are limited; and that weights are single signed and finite. Because we know virtually nothing of the mechanisms by which biological competition is achieved, there is little biological guidance as to the form such constraints should take or the manner in which they should be implemented (see the discussion in Miller, 1990a, 1996b; Miller & MacKay, 1994). In the statistical model of Elliott et al., connection patterns are searched for those that decrease the energy. The search consists of a sequence of small, random perturbations of synaptic connectivity. Constraints on connectivity are enforced by limiting the perturbations to those that obey the constraints. For example, to enforce constraints preserving total synaptic strength projected by each afferent, and retinotopically restricting the cortical locations accessible to each afferent, perturbations consist of movement of a randomly chosen synapse to a new cortical location chosen randomly from among those retinotopically accessible to the corresponding afferent. In the dynamical model of Miller et al., a deterministic time derivative of the weights is computed at each time step. The equations must be constrained to ensure that this time derivative respects the constraints on connectivity. There are many ways to achieve this, which are not equivalent either mathematically or in developmental outcome (Miller & MacKay, 1994). Intuitively, the space of allowed connectivity patterns forms a constraint surface in the space of weights. The unconstrained time derivative may point off of this surface, in which case it must be corrected by return to the constraint surface. But to which point on the constraint surface should the weight pattern be moved? This freedom corresponds to the multiple, inequivalent ways of dynamically constraining the equation. Equivalence to the model of Elliott et al. is achieved if constraints achieve energy minimization within the constrained space of possible weight patterns. Thus, given an unconstrained energy and a corresponding unconstrained gradient descent dynamics, the constraint implementation must achieve gradient descent in the energy on the constraint surface. Mathematically, this is achieved by perpendicular projection of the unconstrained gradient descent dynamics onto the constraint surface.2 In previous work (Miller & MacKay, 1994), two methods were studied of enforcing constraints limiting the sum of weights over a cell, subtractive and multiplicative. At each time step, after adding the unconstrained 2
Beginning from a point x, the local change in energy per unit movement in direction v, ˆ vˆ · vˆ = 1 is given by ∇E(x) · v, ˆ where ∇E(x) is the gradient of the energy at x. Thus, the direction of maximum decrease of the energy along the constraint surface is the direction with maximum dot product with the negative of the gradient vector. This is the direction found by perpendicular projection of the negative of the gradient vector—the derivative vector under gradient descent dynamics—onto the constraint surface.
536
Kenneth D. Miller
derivative to the weight vector, subtractive enforcement involves subtracting a fixed amount from each weight to return the weight vector to the constraint surface. Multiplicative enforcement involves multiplying each weight by a fixed amount, and thus subtracting an amount proportional to the weight itself. P In the case of a constraint on the sum of the synaptic weights, such as α,I SI (x, α) = Spost —or, more generally, of a constraint on a linear combination of the synaptic weights, which yields a hyperplane constraint surface—perpendicular projection of the gradient vector corresponds to subtractive constraint enforcement (Miller & MacKay, 1994).3 For a hypersphere surface, multiplicative enforcement is perpendicular projection. For other surfaces, perpendicular projection cannot be characterized as either multiplicative or subtractive. If there are multiple constraints, one on each postsynaptic cell (or one on each presynaptic cell), perpendicular projection onto the full, multicell constraint surface is accomplished by perpendicular enforcement of each singleto a constraint cell constraint.4 These arguments apply also to restrictions P max I region bordered by hyperplanes, such as Smin post ≤ α,I S (x, α) ≤ Spost . Such a constraint can be imposed dynamically by allowing unconstrained development on the interior of the region and subtractive enforcement of the constraint at each border. This will result in gradient descent minimization of the energy over the allowed region. The case of simultaneous constraints on both pre- and postsynaptic cells is similar. Perpendicular constraint enforcement of the full set of constraints minimizes energy. Assuming the single-cell constraint surfaces are hyperplanes, perpendicular enforcement corresponds to subtractive enforcement. However, the details of formulating perpendicular constraints in this case
3 There are some technicalities involving the use of the arbor function, discussed in more detail in Miller and MacKay (1992, App. B). Briefly: one must work in the symmetric T representation discussed in footnote 1, in which the dynamics are gradient descent. There, the subtractive constraint is a perpendicular projection onto the constraint surface, although it generally does not appear perpendicular in the S representation. In the S P representation, the constraint conserving α,I SI (x, α) is enforced through subtraction of
a multiple of A(x−α) from dtd SI (x, α). Transforming to the T representation, the constraint P √ I conserves α,I A(x − α)T (x, α) and is enforced through subtraction of a multiple of √ A(x − α). Thus, the constraint is perpendicular in the T representation: it conserves n(x) · T(x), √ and is enforced through subtraction of a multiple of n(x), where n(x) has elements A(x − α). 4 The full constraint surface is the intersection of the constraint surfaces associated with each cell. The normals of the single-cell constraint surfaces are all normal to the full constraint surface and to one another. The sequence of perpendicular projections onto each single-cell constraint surface results in a projection onto the full constraint surface along a linear combination of the single-cell normals, and any such linear combination is perpendicular to the full constraint surface. See Miller and MacKay (1992, App. C) for mathematical formulation of the projection operators that implement constraints on a network of cells.
Equivalence of Models of Neural Development
537
are somewhat more complicated. These are described briefly in Miller (1997). Constraints on the strength of individual weights are also linear constraints. Thus, perpendicular projection onto the hyperplane bounding the constrained volume of allowed synaptic weights is achieved by subtracting off the nonallowed weight change from the gradient descent vector. This corresponds to a “hard nonlinearity”: dynamics are linear within the constrained volume, but no weight changes are allowed that take a weight beyond its allowed limits. Single-synapse constraints fail to commute with constraints over summed pre- and/or postsynaptic weights. To avoid this conflict, the methods for formulating perpendicular constraints described in Miller (1997) can be used. In practice, more ad hoc methods are generally used to resolve the conflict. While this seems unlikely to alter significantly the developmental outcome from constrained energy minimization, this has not been studied. In summary, perpendicular constraint enforcement achieves gradient descent on the multicell constraint surface. If constraints are linear in the synaptic strengths, this is achieved by subtractive enforcement. Subtractive enforcement thus realizes the intuitive picture that the biological dynamics are searching, via a correlation-based mechanism, for a “most correlated” set of inputs subject to such linear constraints. Such a search might occur by sprouting and retraction, synaptic strength modification, or some combination. While previous work (Miller & MacKay, 1994) demonstrated the differences in outcome resulting from, and in overall energies minimized by, subtractive versus multiplicative enforcement, the relationship to energy minimization or correlation maximization over the constraint surface was not previously noted. 4 Application of the Equivalence to Understand the Results of Elliott et al. The main results of the model of Elliott et al. are to show that ocular dominance segregation will occur under their model, that the periodicity of the resulting columns corresponds to the arbor diameter, and that the degree of segregation increases with the distance over which inputs within an eye are correlated.5 All of these results follow directly from prior quantitative analyses of the model of Miller et al. (which used subtractive constraint enforcement) (Miller et al., 1989; Miller, 1990a). The occurrence of ocular dominance column formation was shown in 5
Again, this describes the zero-temperature version of the model. By studying finite temperature, Elliott et al. found evidence for a phase transition that separates a hightemperature (high noise in the weight modification process), disordered regime from a low-temperature, ordered regime. Finite temperature results, of course, also apply to either model; one can construct a stochastic dynamics corresponding to the deterministic dynamics at finite temperature (e.g., van Kampen, 1992, Chap. 9).
538
Kenneth D. Miller
the previous analysis to depend on two factors. First, the input activities must be such that, after averaging over activity patterns, two inputs of the same eye are always better correlated in their activities than two inputs of the opposite eye at the same retinotopic separation, at least for small retinotopic separations; and two same-eye inputs must be no worse correlated than two opposite-eye inputs for larger retinotopic separations up to an arbor radius.6 This condition is easily met by the activity patterns used by Elliott et al.: a circle of activated cells within a single eye, centered at a random position, with all other cells inactive. Second, there must be some locally excitatory interactions between cortical cells, so that the ocular dominance of neighboring cells is coupled. This is achieved by the neighborhood function of Elliott et al., which favors the development of correlated receptive fields on neighboring cortical cells. The spatial period of the resulting ocular dominance columns was shown previously to depend on the intracortical interaction function and the constraints, if any, on the total strength of projection of each input cell. The intracortical interaction used by Elliott et al. is purely excitatory, and one of their constraints is that the total projection from each presynaptic cell is held constant. For this case, the period of ocular dominance columns was previously shown to correspond to an arbor diameter, as found empirically by Elliott et al. The degree of segregation was shown in the previous analysis to depend on the correlations in input activities. Two factors are involved. First, when the same-eye correlations are larger than the opposite-eye correlations only over distances small relative to an arbor radius, segregation is weak; but as the distance over which the same-eye correlations are larger than the opposite-eye correlations becomes larger, segregation becomes stronger. Second, anticorrelations between the two eyes increase the degree of segregation, and positive correlations between the two eyes reduce it. Elliott et al. find that increasing the radius of the circle of activated cells used in their activity patterns leads to sharper segregation, and for larger arbor radius, larger activity circles are needed to get a similar degree of segregation. Furthermore, they consider two models, one of which has anticorrelations between the eyes, the other of which has either reduced anticorrelations or positive correlations between the eyes, and find that the latter model requires much wider-ranging within-eye correlations (larger activity discs for a given arbor width) to achieve a similar degree of segregation, relative to the former model.7 These results are all as predicted by the previous 6 Mathematically: the assumption is made of symmetry between the eyes, so that CLL = CRR and CLR = CRL . Then the condition is that CD (α) ≡ CLL (α) − CLR (α) has the peak of its Fourier transform at frequency 0 or at a frequency that corresponds to a wavelength long relative to the arbor diameter. 7 In the “relocation” model of Elliott et al., all synaptic changes occur by choosing an active synapse and relocating it; the change is accepted if it decreases the energy (in the
Equivalence of Models of Neural Development
539
analysis. A quantitative account could be made by computing the precise within- and between-eye correlations determined by their activity patterns and sampling procedures. Elliott et al. also model monocular deprivation. Because they conserve the total strength projected by each presynaptic arbor, they must model deprivation by hand-setting the strength of each eye’s arbors. Thus, their model deals only with the arrangements of these inputs, given a fixed difference in input strength between the eyes. As found previously in the model of Miller et al., the period of the ocular dominance stripes is unchanged, but one eye’s stripes become thinner and the other’s thicker. In the studies of Miller et al., the final strength of each eye’s projection, as well as the stripe layout, emerged dynamically. 5 Discussion I have shown that a simple, stochastic, correlation-based sprouting-andretraction model of synaptic development can be understood within the framework of correlation-based developmental models (Miller, 1990a, 1996a), although those models were formulated in terms of modification of anatomically fixed synapses. The various models all descend in an identical energy, corresponding to maximizing a certain measure of synaptic correlations. The analyses of the correlation-based framework accurately predict the outcomes of the sprouting-and-retraction model. The sprouting-andretraction model emphasizes energy minimization or, equivalently, correlation maximization. This perspective in turn provides a principled reason for the use of subtractive enforcement of linear constraints on connectivity (or for multiplicative enforcement of quadratic constraints). Thus, the hypothesis in the introduction has been confirmed and sharpened. Models of sprouting and retraction and of modification of anatomically fixed synapses can both explore the same space of possible connections and converge to the same “most-correlated” set of connections within that space. This occurs provided both maximize the same measure of correlation over the constrained set of allowed weight patterns. The correlation-based framework was developed to allow analysis of the outcome of synaptic competition under a wide group of mechanisms. Dezero-temperature version). Because all changes are conditional on a synapse’s being active, the correlations CIJ (α, β) = hσ (α)σ (β)i must also be computed conditional on this. Since the opposite eye is always uniformly inactive, the two eyes are completely anticorrelated, CLR (α, β) ≡ −1 for all α, β. In their “interchange” model, synaptic changes occur by interchanging two synapses, one active and one inactive; again, the change is accepted if it decreases the energy. Because the changes involve an inactive as well as an active synapse, the correlations of an inactive synapse with the inactive neurons in the opposite eye must also be considered in computing the correlations, so the anticorrelations between the eyes will be reduced and/or positive correlations between the eyes may be induced, depending on the sizes of the activity circle and the arbors.
540
Kenneth D. Miller
velopment in this framework depends on biologically measurable functions that describe correlations in the thalamic inputs, retinotopic restrictions on arborizations, and intracortical interactions. The dependence on these functions has been characterized through both analysis and simulations (Miller et al., 1989; Miller, 1990a, 1994, 1996a). Different biological mechanisms are distinguished by the different predictions they make for the shape of these functions. For example, the model of Elliott et al. is restricted to positive intracortical interaction functions. Previously, such correlation-based models were shown to embrace very simple models of a variety of underlying mechanisms, including activitydependent competition for diffusible modification factors (e.g., neurotrophins) by anatomically fixed synapses (Miller et al., 1989; Miller, 1990a). Sprouting and retraction mechanisms were not previously included because it seemed likely that density-dependent and diffusive terms may arise in simplified models of their dynamics, for example, if there is a tendency for sprouting to occur from more-occupied areas into less-occupied areas; and no such terms were included in the framework studied.8 Such terms do not arise in the model of Elliott et al. because of their assumption that all synaptic changes allowed by the constraints (see the appendix) and by the relocation or interchange rule (the effects of which can be incorporated in the correlation function; see footnote 7) are tried with equal probability, and are accepted or rejected according to their contribution to the energy. That is, synaptic densities and gradients do not affect sprouting-and-retraction probabilities other than through the energy and the constraints. One potential problem with the fixed-synapse formalism is that it contains an initial bias in favor of weight patterns that involve uniform connectivity: if weights are initialized as small perturbations around some anatomical arbor function, then a weight pattern similar to the arbor function initially has a much larger size than weight patterns orthogonal to it. This can lead to a bias in favor of the development of such patterns, even if such patterns have less favorable energy than others (e.g., see the discussion of the effect of the DC mode’s “head start” in MacKay & Miller, 1990). Although the zero-temperature model of Elliott et al. does not necessarily avoid this problem, a sprouting-and-retraction framework more generally may pro-
8
The probability for an Pinput from α of type I to sprout to x may increase with its local innervation density, x0 f (|x − x0 |)SI (x0 , α) (here, f (x) is some weighting function); P it may decrease with total innervation density at x, SJ (x, α0 ), or with the gradient J,α0
P
of this density ∇x J,α0 SJ (x, α0 ) (if areas that are less innervated, either absolutely or relative to neighboring areas, are more likely to receive sprouts; note, however, that if total synaptic strength is conserved on each postsynaptic cell, this density is constant, and its gradient The probability might also depend onP the input’s total projection P Iis zero). strength, S (x0 , α), or on the gradient of this total, ∇α x0 SI (x0 , α) (e.g., smaller x0 arbors, either in absolute terms or relative to neighboring arbors, might have a better chance of making new sprouts).
Equivalence of Models of Neural Development
541
vide a natural means of avoiding such bias. The addition of noise, as in the high-temperature version of the Elliott et al. model, may also be of benefit. It is often stated that Hebbian rules result in a set of weights corresponding to the principal component of the input data: the principal eigenvector of the input covariance matrix. This is based on the simple case of a linear covariance plasticity rule for a single, linearly activated postsynaptic cell. For this case, multiplicative constraint enforcement does lead to the principal component (e.g., Miller & MacKay, 1994), which in this case minimizes the energy of equation 2.2 over any hypersphere in weight space. However, even for this case, on the hyperplane corresponding to a linear weight constraint, the principal eigenvector is not an energy minimum, and in fact it is not in general a “special” point of any kind for the energy. That is, if Hebbian development with linear weight constraints acts to maximize correlations, then even in this simple case, the dynamics do not evolve to the principal component of the data. More complex plasticity or activity rules or network connectivity also ensure that the dynamics do not evolve to the principal component. It is therefore quite incorrect to equate Hebbian development with learning of the principal component. In previous work (Miller et al., 1989; Miller & MacKay, 1994), it was noted that ocular dominance segregation can emerge under subtractive constraint enforcement even in the presence of partial correlation of the activities of the two eyes (as is presumably induced by vision). Under multiplicative enforcement, in contrast, ocular dominance segregation cannot emerge unless there is anticorrelation between the two eyes. Thus, subtractive enforcement seems to give a better match to the biology, at least in this respect. By connecting subtractive enforcement to correlation maximization over the allowed weight patterns, the work described in this article gives a natural biological grounding to a constraint enforcement method that also seems favored (at least relative to multiplicative enforcement) by the match of developmental outcome to biology. Because subtractive enforcement can lead weights to saturate at their most extreme allowed values, it has been suggested (Miller & MacKay, 1994) that it may be a poor choice for models of adult plasticity, where continuous plasticity occurs in response to changing input activity distributions (Kaas, 1991). However, we demonstrate in Miller (1997) that this tendency to saturation can be eliminated by combining pre- and postsynaptic constraints. Elliott et al. (1996b) have found that their model, which does not show such saturation (discussed in the appendix), can account for some aspects of adult plasticity. Thus, these aspects can be accounted for by subtractively constrained dynamics. However, in their model, the number of synapses projected by different input classes in response to a given input activity regime is set by hand, and only the arrangement of these synapses is determined by the learning rule. Thus, it remains an open question under what conditions the energy-minimizing models with linear weight constraints discussed here can account for the reversible changes
542
Kenneth D. Miller
in projection strength that occur with changes in input activities in adult cortical plasticity. Presumably the key requirement is that the perturbations of activity sufficiently shift the energy landscape on the constraint surface so that the locations of the minima are shifted. Do biological correlation-based mechanisms rely more on sprouting and stabilization-retraction of synapses of fixed strength or on modification of synaptic strengths? There is little evidence to guide us. The two processes are obviously not mutually exclusive; for example, the decision whether to stabilize or retract newly sprouted synapses may be guided by the strengthening or weakening of their physiological strengths. In kitten visual cortex, the rapidity of monocular deprivation effects, which emerge physiologically within hours (Mioche & Singer, 1989), weakly suggests that physiological changes in synaptic efficacy may precede (and thus perhaps guide) anatomical restructuring of connections. This is also suggested by recent results showing that anatomical changes in thalamocortical afferent arbors are incomplete after 4 days of monocular deprivation (Antonini & Stryker, 1996), whereas physiological effects of monocular deprivation appear complete after just 2 days (Hensch et al., 1995). Similarly, at the neuromuscular junction, anatomical elimination of a synapse is preceded by loss of its synaptic efficacy (Colman et al., 1997), including loss of postsynaptic receptors (Balice-Gordon & Lichtman, 1993, 1995). That weight modification and axonal sprouting and retraction may be coupled is suggested by studies in several systems showing that decrease in the number of synapses onto a postsynaptic cell is associated with increase in strength of the remaining synapses (Herrera & Grinnell, 1980, 1981; Jackson & Parks, 1982; Pockett & Slack, 1982; Liu & Tsien, 1995). Can modeling help us to distinguish between mechanisms relying on sprouting and retraction, and those relying on modification of synaptic strengths? To the extent to which the two are mathematically equivalent, of course, no theoretical distinctions can be drawn. It is conceivable that practical experimental distinctions could be drawn from simple models to which density-dependent and diffusive terms are added, as described above. However, to draw firm theoretical distinctions, more realistic knowledge of the rules governing synaptic sprouting, retraction, and modification is needed. It remains to be determined how such knowledge will modify the basic understandings of correlation-based development already achieved through study of the simple models discussed here. Appendix: Constraints in the Two Models The energy E of equation 2.2 is minimized in the two models subject to the following constraints: 1. Localization of connectivity: SI (x, α) = 0 whenever A(x, α) = 0, in both models.
Equivalence of Models of Neural Development
543
2. Upper and lower limits on connection strength • Elliott et al.: 0 ≤ SI (x, α) for all I, x, α. No explicit upper limit. • Miller et al.: 0 ≤ SI (x, α) ≤ smax A(x, α) for all I, x, α, where smax represents the maximum value of a synaptic weight. 3. Presynaptic total connection strength • Elliott et al.: The total number of Psynapses per presynaptic neuron is constant and conserved: x SI (x, α) = Spre for all I, α and for some constant Spre . To model monocular deprivation, this constant is decreased for inputs from one eye. • Miller et al.: The effects of imposing P a limitation on the sum over presynaptic weights, such as x SI (x, α) = Spre or 0.5Spre ≤ P I x S (x, α) ≤ 1.5Spre , are considered, but such a limitation is usually not used. 4. Postsynaptic total connection strength • Elliott et al.: The total number of synapses per postsynaptic neuron is constrained to remain within some range: Smin post ≤ P max min I α,I S (x, α) ≤ Spost for all x, where Spost (generally 1) and Smax post are constants. A model in which the number of synapses per cell is constant and conserved—that is, in which P cortical I (x, α) = S S post for all x—is also considered. α,I • Miller et al.: The total synaptic Pstrength per postsynaptic neuron is constant and conserved: α,I SI (x, α) = Spost for all x and for some constant Spost . The constraints imposed in the two models have a similar character. One notable difference is the presence of an upper limit on the strength of individual connections in the model of Miller et al., and the absence of such a limit in the model of Elliott et al. In Miller and MacKay (1994), it was shown that an upper limit on the strength of individual connections is generally needed under subtractive constraints to attain distributed receptive and/or projective fields (see the discussion in Miller, 1997). Yet Elliott et al., in the absence of an explicit upper weight limit, find apparently stable weight configurations with distributed receptive and projective fields. There seem to be at least two explanations. First, as discussed in Miller (1997), the combination of pre- and postsynaptic constraints can eliminate the need for an upper weight limit. Second, the particular intracortical interaction used by Elliott et al. may play a role. The self-interaction Lii of a conunder the gradient descent nection Ti ≡ TI (x, α√) (see footnote 1) with itself √ dynamics is Lii = A(x, α)I(x, x)CII (α, α) A(x, α). The interaction Lij between two different connections from the samepinput cell, Ti ≡ TI (x, α) and √ I Tj ≡ T (y, α) is Lij = A(x, α)I(x, y)CII (α, α) A(y, α). Given constraints
544
Kenneth D. Miller
on only presynaptic cells, an upper weight limit is required if Lii > |Lij | for all j 6= i, where j and i are labels for two weights from a single presynaptic cell (Miller & MacKay, 1994). This condition is not satisfied in the model of Elliott et al. because I(x, y) = 1 both for y = x and for y the neighbor of x, and A is constant where it is nonzero. Thus, a given input may equally well distribute its weights among a cell and its four nearest neighbors, or concentrate its inputs onto the central cell, without altering the interactions between its connections. Thus, even if only presynaptic constraints are applied, the model of Elliott et al. may not require an upper weight limit to achieve distributed receptive and projective fields. Acknowledgments I thank Terry Elliott for stimulating me to think about these issues and for useful discussions, and Michael Crair, Virginia De Sa, Ed Erwin, David MacKay, Michael Silver, and Todd Troyer for helpful comments on the manuscript. This work is supported by NIH grant EY11001 and grants from the Searle Scholar’s Program and the Lucille P. Markey Charitable Trust. References Antonini, A., & Stryker, M. P. (1993a). Development of individual geniculocortical arbors in cat striate cortex and effects of binocular impulse blockade. J. Neurosci., 13, 3549–3573. Antonini, A., & Stryker, M. P. (1993b). Rapid remodeling of axonal arbors in the visual cortex. Science, 260, 1819–1821. Antonini, A., & Stryker, M. P. (1996). Plasticity of geniculocortical afferents following brief or prolonged monocular occlusion in the cat. J. Comp. Neurol., 369, 64–82. Balice-Gordon, R. J., & Lichtman, J. W. (1993). In vivo observations of pre- and postsynaptic changes during the transition from multiple to single innervation at developing neuromuscular junctions. J. Neurosci., 13, 834–855. Balice-Gordon, R. J., & Lichtman, J. W. (1995). Long-term synapse loss induced by focal blockade of postsynaptic receptors. Nature, 372, 519–524. Bonhoeffer, T. (1996). Neurotrophins and activity-dependent development of the neocortex. Curr. Opin. Neurobiol., 6, 119–126. Bourgeois, J.-P., Jastreboff, P. J., & Rakic, P. (1989). Synaptogenesis in visual cortex of normal and preterm monkeys: Evidence for intrinsic regulation of synaptic overproduction. Proc. Natl. Acad. Sci. USA, 86, 4297–4301. Cabelli, R. J., Hohn, A., & Shatz, C. J. (1995). Inhibition of ocular dominance column formation by infusion of NT-4/5 or BDNF. Science, 267, 1662–1666. Colbert, C. M., Fall, C. P., & Levy, W. B. (1994). Using adaptive synaptogenesis to model the development of ocular dominance in kitten visual cortex. In F. H. Eeckman (Ed.), Computation in neurons and neural systems (pp. 139–144). Boston: Kluwer. Colman, H., Nabekura, J., & Lichtman, J. W. (1997). Alterations in synaptic
Equivalence of Models of Neural Development
545
strength preceding axon withdrawal. Science, 275, 356–361. Constantine-Paton, M., Cline, H. T., & Debski, E. (1990). Patterned activity, synaptic convergence and the NMDA receptor in developing visual pathways. Ann. Rev. Neurosci., 13, 129–154. Cragg, B. G. (1975). The development of synapses in the visual system of the cat. J. Comp. Neurol., 160, 147–166. Darian-Smith, C., & Gilbert, C. D. (1994). Axonal sprouting accompanies functional reorganization in adult cat striate cortex. Nature, 368, 737–740. Debski, E. A., Cline, H. T., & Constantine-Paton, M. (1990). Activity-dependent tuning and the NMDA receptor. J. Neurobiol., 21, 18–32. Elliott, T., Howarth, C. I., & Shadbolt, N. R. (1996a). Axonal processes and neural plasticity. I: Ocular dominance columns. Cerebral Cortex, 6, 781–788. Elliott, T., Howarth, C. I., & Shadbolt, N. R. (1996b). Axonal processes and neural plasticity. II: Adult somatosensory maps. Cerebral Cortex, 6, 789–793. Figurov, A., Pozzo-Miller, L. D., Olafsson, P., Wang, T., & Lu, B. (1996). Regulation of synaptic responses to high-frequency stimulation and LTP by neurotrophins in the hippocampus. Nature, 381, 706–709. Fiorentini, A., Berardi, N., & Maffei, L. (1995). Nerve growth factor preserves behavioral visual acuity in monocularly deprived kittens. Vis. Neurosci., 12, 51–55. Fraser, S. E., & Perkel, D. H. (1990). Competitive and positional cues in the patterning of nerve connections. J. Neurobiol., 21, 51–72. Guillery, R. W. (1972). Binocular competition in the control of geniculate cell growth. J. Comp. Neurol., 144, 117–130. Hayes, W. P., & Meyer, R. L. (1989). Impulse blockade by intraocular tetrodoxin during optic regeneration in goldfish: HRP-EM evidence that the formation of normal numbers of optic synapses and the elimination of exuberant optic fibers is activity independent. J. Neurosci., 9, 1414–1423. Hensch, T. K., Crair, M. C., Ruthazer, E. S., Fagiolini, M., Gillespie, D. C., & Stryker, M. P. (1995). Robust two-day ocular dominance plasticity revealed by single-unit recording and intrinsic signal imaging of kitten area 17. Soc. Neuro. Abs., 21, 2023. Herrera, A., & Grinnell, A. (1980). Transmitter release from frog motor nerve terminals depends on motor unit size. Nature, 287, 649–651. Herrera, A., & Grinnell, A. (1981). Contralateral denervation causes enhanced transmitter release from frog motor nerve terminals. Nature, 291, 495–497. Jackson, H., & Parks, T. (1982). Functional synapse elimination in the developing avian cochlear nucleus with simultaneous reduction in cochlear nerve axon branching. J. Neurosci., 2, 1736–1743. Kaas, J. H. (1991). Plasticity of sensory and motor maps in adult mammals. Ann. Rev. Neurosci., 14, 137–167. Kang, H., & Schuman, E. M. (1995). Long-lasting neurotrophin-induced enhancement of synaptic transmission in the adult hippocampus. Science, 267, 1658–1662. Katz, L. C., & Constantine-Paton, M. (1988). Relationships between segregated afferents and postsynaptic neurons in the optic tectum of three-eyed frogs. J. Neurosci., 8, 3160–3180.
546
Kenneth D. Miller
Katz, L. C., Gilbert, C. D., & Wiesel, T. N. (1989). Local circuits and ocular dominance columns in monkey striate cortex. J. Neurosci., 9, 1389–1399. Korte, M., Carroll, P., Wolf, E., Brem, G., Thoenen, H., & Bonhoeffer, T. (1995). Hippocampal long-term potentiation is impaired in mice lacking brainderived neurotrophic factor. Proc. Natl. Acad. Sci. USA, 92, 8856–8860. Kossel, A., Lowel, ¨ S., & Bolz, J. (1995). Relationships between dendritic fields and functional architecture in striate cortex of normal and visually deprived cats. J. Neurosci., 15, 3913–3926. LeVay, S., Stryker, M. P., & Shatz, C. J. (1978). Ocular dominance columns and their development in layer IV of the cat’s visual cortex: A quantitative study. J. Comp. Neurol., 179, 223–244. Liu, G., & Tsien, R. W. (1995). Properties of synaptic transmission at single hippocampal synaptic boutons. Nature, 375, 404–408. Lo, Y.-J., & Poo, M.-M. (1991). Activity-dependent synaptic competition in vitro: Heterosynaptic suppression of developing synapses. Science, 254, 1019–1022. MacKay, D. J. C., & Miller, K. D. (1990). Analysis of Linsker’s applications of Hebbian rules to linear networks. Network, 1, 257–298. McAllister, A. K., Lo, D. C., & Katz, L. C. (1995). Neurotrophins regulate dendritic growth in developing visual cortex. Neuron, 15, 791–803. Miller, K. D. (1990a). Correlation-based models of neural development. In M. A. Gluck & D. E. Rumelhart (Eds.), Neuroscience and connectionist theory (pp. 267–353). Hillsdale, NJ: Erlbaum. Miller, K. D. (1990b). Derivation of linear Hebbian equations from a nonlinear Hebbian model of synaptic plasticity. Neural Comput., 2, 321–333. Miller, K. D. (1994). A model for the development of simple cell receptive fields and the ordered arrangement of orientation columns through activitydependent competition between ON- and OFF-center inputs. J. Neurosci., 14, 409–441. Miller, K. D. (1996a). Receptive fields and maps in the visual cortex: Models of ocular dominance and orientation columns. In E. Domany, J. L. van Hemmen, & K. Schulten (Eds.), Models of neural networks III (pp. 55–78). New York: Springer-Verlag. Available as ftp://ftp.keck.ucsf.edu/pub/ken/miller95.ps. Miller, K. D. (1996b). Synaptic economics: Competition and cooperation in synaptic plasticity. Neuron, 17, 371–374. Miller, K. D. (1997). Simultaneous constraints on pre- and post-synaptic cells: Effects on synaptic saturation [On-line]. Available as ftp://ftp.keck.ucsf.edu/ pub/ken/pre post.ps.gz. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Miller, K. D., & MacKay, D. J. C. (1992). The role of constraints in Hebbian learning (Tech. Rep. Memo 19). Pasadena, CA: California Institute of Technology. Available as ftp://ftp.keck.ucsf.edu/pub/ken/constraints-tr.ps. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Comput., 6, 100–126. Miller, K. D., & Stryker, M. P. (1990). The development of ocular dominance columns: Mechanisms and models. In S. J. Hanson & C. R. Olson (Eds.), Connectionist modeling and brain function: The developing interface (pp. 255–350).
Equivalence of Models of Neural Development
547
Cambridge, MA: MIT Press/Bradford. Mioche, L., & Singer, W. (1989). Chronic recordings from single sites of kitten striate cortex during experience-dependent modifications of receptive-field properties. J. Neurophysiology, 62, 185–197. Montague, P. R., Gally, J. A., & Edelman, G. M. (1991). Spatial signaling in the development and function of neural connections. Cereb. Cortex, 1, 199–220. Movshon, J. A., & Van Sluyters, R. C. (1981). Visual neural development. Ann. Rev. Psych., 32, 477–522. Pockett, S., & Slack, J. (1982). Pruning of axonal trees results in increased efficacy of surviving nerve terminals. Brain Res., 243, 350–353. Purves, D., & Lichtman, J. W. (1985). Principles of neural development. Sunderland, MA: Sinauer Associates. Thoenen, H. (1995). Neurotrophins and neuronal plasticity. Science, 270, 593–598. van Kampen, N. G. (1992). Stochastic processes in physics and chemistry (2nd Ed.). Amsterdam: North-Holland. von der Malsburg, C. (1979). Development of ocularity domains and growth behavior of axon terminals. Biol. Cyber., 32, 49–62. Weiler, I. J., Hawrylak, N., & Greenough, W. T. (1995). Morphogenesis in memory formation: synaptic and cellular mechanisms. Behav. Brain Res., 66, 1–6. Wiesel, T. N., & Hubel, D. H. (1965). Comparison of the effects of unilateral and bilateral eye closure on cortical unit responses in kittens. J. Neurophysiol., 28, 1029–1040. Received January 31, 1996; accepted June 12, 1997.
NOTE
Communicated by Read Montague
Axonal Processes and Neural Plasticity: A Reply T. Elliott C. I. Howarth N. R. Shadbolt Department of Psychology, University of Nottingham, Nottingham, NG7 2RD, U.K.
We examine the claim that a class of sprouting-and-retraction models is mathematically equivalent to a fixed-anatomy model. We accept, subject to important caveats, a narrow mathematical equivalence of the energy functions in both classes of model. We argue that this narrow equivalence of energy functions does not, however, entail equivalence of the models. Indeed, the claim of complete model equivalence hides significant dynamical differences between the approaches, which we discuss. We also disagree that our work demonstrates that subtractive constraint enforcement is natural in fixed-anatomy models. Miller (see “Equivalence of a Sprouting-and-Retraction Model,” elsewhere in this issue) has claimed that (1) a sprouting-and-retraction framework for modeling neural plasticity (Elliott, Howarth, & Shadbolt, 1996a, 1996b) is mathematically equivalent to a fixed-anatomy model (Miller, Keller, & Stryker, 1989); (2) that this equivalence shows that subtractive normalization is natural in fixed-anatomy models; and (3) that no theoretical distinction can be drawn between sprouting-and-retraction models and fixed-anatomy models until more elaborate models are available. We briefly examine these claims. First, modulo important caveats, we agree that our framework minimizes, subject to similar constraints, the same energy function as Miller’s model. The caveats are that (1) the biological interpretation of our constraints is different, (2) our constraints do not enforce competition, and (3) competition emerges because of the range of the variables σi used in our framework, where letters such as i and j index axonal processes and σi ∈ {−1, 1} represents the activity of axonal process i. The first caveat is important because there is little biological support for synaptic normalization. Our constraints represent biologically plausible bounds on the minimum and maximum numbers of axonal processes supported by neurons and are not introduced as a mathematical device to enforce competition in the absence of more detailed knowledge concerning the biological mechanisms underlying competition (von der Malsburg, 1973). The second caveat requires a little explanation. Our “relocation” and “interchange” models, in which sprouting and retraction are always couNeural Computation 10, 549–554 (1998)
c 1998 Massachusetts Institute of Technology °
550
T. Elliott, C. I. Howarth, and N. R. Shadbolt
pled, automatically include constraints. However, another model based on our framework, the “sprouting-and-retraction” model, uncouples sprouting and retraction, and thus no constraints are enforced, either explicitly or implicitly. Perturbations in network connectivity represent the creation of a new connection or the destruction of an old one. This third model segregates competing afferents, and in modeling anomalous peripheral activity, no hand setting of parameters is necessary (Elliott, Howarth, & Shadbolt, 1996c); in contrast, Miller adjusts afferent normalization parameters. As for the third caveat, setting σi = −1 rather than σi = 0 to represent afferent inactivity is enough to induce competition. This is easily demonstrated by eliminating the contribution to the energy function that raises the energy when pairs of axonal processes have activities such that σi +σj = 0: afferent segregation then breaks down in the sprouting-and-retraction model (Elliott et al., 1996c). Curiously, elimination of this contribution from the energy function in the relocation and interchange models does not lead to a breakdown of afferent segregation, but this is not because these two models employ constraints. Rather, it is straightforward to see that the relocation and interchange rules lead automatically to the existence of a statistical force of attraction between axonal processes whose activities are positively correlated and a statistical force of repulsion between axonal processes whose activities are negatively correlated (see Elliott, Howarth, & Shadbolt, 1997, for an analysis). These noncompetitive forces are by themselves enough to induce afferent segregation, even in the absence of the competitive contribution to the energy function. Indeed, it is precisely these forces, particularly the force of repulsion, that lead to the emergence of “cortical gaps” in the relocation model (Elliott et al., 1996a). As to the justification for taking σi = −1 rather than σi = 0 to represent afferent inactivity, see the appendix to this article. Does this equivalence of the energy functions (modulo caveats) mean that the models are equivalent? Our approach represents a set of biological assumptions and computational techniques in addition to the energy function. This is best seen by comparing the behaviors of the relocation and interchange models. Both models exhibit different results when annealed and when quenched. This is because the dynamics of minimization are different in both models. Thus, even though both minimize the same energy function, the models are not equivalent; the final outcomes are not identical. The sprouting-and-retraction model also is not equivalent to the relocation and interchange models. For example, although the energy function is identical in all three of our models, no statistical forces exist in the sproutingand-retraction model (Elliott et al., 1997). Also, in contrast to Miller’s model (Miller, 1990; Miller & MacKay, 1994), the sprouting-and-retraction model does not exhibit an organizationally determined critical period and so is able to accommodate adult plasticity in, for example, the somatosensory
Axonal Processes and Neural Plasticity: A Reply
551
cortex.1 This significant difference between our framework and Miller’s model arises because the dynamics of minimization are different. Thus, the models are different, even though the energy functions are equivalent (modulo caveats). It may be argued that the minimization dynamics of a model are to some extent arbitrary. This is true of models that have no clear biological interpretation or for which the minimization procedure is simply a computational algorithm for finding a minimum. But our minimization procedure is constructed as an explicit model of neurons sprouting into regions of high neurotrophic support and retracting from regions of low neurotrophic support, with temperature as some measure of the noise affecting an otherwise orderly process (Elliott et al., 1996a). The dynamics of minimization in our three models represent different assumptions regarding how neurons might maximize their neurotrophic support. We do not therefore accept that the dynamics of minimization are less important than what is being minimized. Indeed, the different dynamics of our three models lead to demonstrably different behaviors and final outcomes, even though the energy functions are identical. Second, we do not accept that our work shows that subtractive normalization is natural in fixed-anatomy models. Synaptic normalization is typically imagined to result from the decay of synaptic weights, although there is little reason to believe that global normalization will result from local decay. The decay rate will almost certainly be a nonconstant function of the concentration(s) of the decaying substance(s). Biologically, it is difficult to motivate subtractive normalization, since it assumes that the decay rate is independent of the concentration. Computationally, we agree that subtractive normalization is natural, since it leads to steepest-descent minimization. Thus, there is a tension between these two requirements, so the claim that subtractive normalization is natural in fixed-anatomy models entails that such models are not in fact biologically plausible. A related point is that it is difficult biologically to motivate simultaneous afferent and efferent normalization in fixed-anatomy models. Because normalization is typically imagined to represent a limitation on some resource, afferent (efferent) normalization represents a limitation on a presynaptic (postsynaptic) resource. However, a synaptic weight will be some function 1 To switch off plasticity in layer IV of the visual cortex, we invoke biochemical factors (e.g., Kasamatsu, 1983; Fox, Sato, & Daw, 1989; Gu, Liu, & Cynader, 1994), where these are controlled, presumably, by the overall level of electrical activity in the visual pathways rather than its specific pattern (e.g., Cynader, Berman, & Hein, 1976; Cynader, 1983). An organizationally determined critical period, such as that in Miller’s model, could be refuted by rearing kittens under blockade of retinal activity and with direct, simultaneous stimulation of both optic nerves (e.g., Stryker & Strickland, 1984; Weliky & Katz, 1997) so that ocular dominance columns do not form, and then testing for a response to monocular deprivation to see whether the critical period has been extended. Our prediction is that it would not be.
552
T. Elliott, C. I. Howarth, and N. R. Shadbolt
of both pre- and postsynaptic resources. Hence, normalizing weights over both afferents and efferents is biologically problematic: afferent (efferent) normalization should be over the afferent (efferent) resource only. Thus, any attempt to suggest that simultaneously rather than consecutively implemented afferent and subtractive efferent normalization will permit adult plasticity in fixed-anatomy models runs afoul of the same biological-versuscomputational dilemma raised above. Finally, even were the models mathematically equivalent, there can be no justification for the claim that no theoretical distinction can be drawn between sprouting-and-retraction models and fixed-anatomy models. Biological models are characterized by their underlying biological assumptions, interpretations, and evidence, in addition to their mathematical structure. To emphasize the latter at the expense of the former, given that the former often legitimate procedures and techniques in one approach but not in a mathematically equivalent alternative, risks significantly limiting the power of modeling and theoretical science. Theoretical distinctions are mostly, although not always, meaningless only when experimental discrimination is impossible. Anatomical plasticity is plainly experimentally distinguishable from physiological plasticity. For a comprehensive examination of the central role of anatomical plasticity in the development of the nervous system, and an attack on the view that neural development is entirely regressive, as implicit in models such as Miller’s, see, for example, Bailey and Kandel (1993), Purves (1994), and Quartz and Sejnowski (in press). Appendix Here we justify the use of σi = −1 rather than σi = 0 to represent afferent inactivity. To do this, first we derive a general expression for an energy function; then we manipulate it so as to extract the σi = −1 convention. Let the activity of afferent process i be denoted by ai ∈ {0, 1}, that is, inactivity is represented by ai = 0. Notice that the σi variables are related to the ai variables through σi = 2a i − 1. The total input to the cell on which P process i synapses is taken to be j Dij aj , where Dij = 1 if, and only if, processes i and j synapse on the same cell, and is zero otherwise. The level of neurotrophic factor P (NTF) released by the cell on which process i synapses is taken as Ri = fR ( j Dij aj ), where fR is some model of the overall production and release process. The NTF released is taken to diffuse rapidly through the target field, with the amount P available at each target cell following P dif˜ ij Rj , where 1 ˜ ij = 1ij /sj with sj = k Djk fusion assumed to be Ai = j 1 being the total number of processes synapsing on the cell on which process j synapses. The function 1ij characterizes the diffusion process and is assumed to be appropriately normalized; we have previously used only the nearest–neighbor function. The level of NTF available at each synapse on the cell on which process i synapses is then assumed to be A˜ i = Ai /si .
Axonal Processes and Neural Plasticity: A Reply
553
Uptake of NTF is taken to be activity dependent and is assumed to promote anatomical change, by inducing either sprouting or retraction, depending on how much is taken up, and the resting level required to maintain existing terminals. This is quantified by defining E˜ i = −ai (A˜ i − τ −1 ), where τ is some measure of either an activity-dependent requirement or a P baseline, resting requirement by each process. Writing E˜ i = Ei /(τ j Dij ), we have à ! X X X ˜ Ei = −τ ai 1ij fR Djk ak + ai Dij . (A.1) j
j
k
This is a general expression for the energy of process i. Setting τ = 2 and with fR (x) = x as a simple model of the production and release of NTF, this reduces to Ei = −2ai
X
1ij aj + ai
j
Replacing
P
j Dij
Ei = −ai
X
Dij .
(A.2)
j
by the approximate, “smeared” form
X
1ij (2aj − 1) = −ai
j
X
1ij σj .
P
j 1ij ,
we obtain (A.3)
j
Since, in our models, we consider only plasticity associated with active processes, we may replace ai by σi in this expression. Summing over i, we recover the full energy function using σi , and not ai , as the activity variables. Acknowledgments T. E. thanks the Royal Society for the support of a University Research Fellowship during the latter stages of this work. References Bailey, C. H., & Kandel, E. R. (1993). Structural changes accompanying memory storage. Annu. Rev. Physiol., 55, 397–426. Cynader, M. S. (1983). Prolonged sensitivity to monocular deprivation in dark– reared cats: Effects of age and visual exposure. Dev. Brain Res., 8, 155–164. Cynader, M. S., Berman, N., & Hein, A. (1976). Recovery of function in cat visual cortex following prolonged deprivation. Exp. Brain Res., 25, 139–156. Elliott, T., Howarth, C. I., & Shadbolt, N. R. (1996a). Axonal processes and neural plasticity. I: Ocular dominance columns. Cereb. Cortex, 6, 781–788. Elliott, T., Howarth, C. I., & Shadbolt, N. R. (1996b). Axonal processes and neural plasticity. II: Adult somatosensory maps. Cereb. Cortex, 6, 789–793.
554
T. Elliott, C. I. Howarth, and N. R. Shadbolt
Elliott, T., Howarth, C. I., & Shadbolt, N. R. (1996c). Neural competition and statistical mechanics. Proc. R. Soc. Lond. Ser. B, 263, 601–606. Elliott, T., Howarth, C. I., & Shadbolt, N. R. (1997). Axonal processes and neural plasticity. III: Competition for dendrites. Phil. Trans. R. Soc. Lond. Ser. B, 352, 1975–1983. Fox, K., Sato, H., & Daw, N. (1989). The location and function of NMDA receptors in cat and kitten visual cortex. J. Neurosci., 9, 2443–2454. Gu, Q., Liu, Y., & Cynader, M. S. (1994). Nerve growth factor-induced ocular dominance plasticity in adult cat visual cortex. Proc. Natl. Acad. Sci. U.S.A., 91, 8408–8412. Kasamatsu, T. (1983). Neuronal plasticity maintained by the central norepinephrine system in the cat visual cortex. Prog. Psychobiol. Physiol. Psychol., 10, 1–112. Miller, K. D. (1990). Correlation-based models of neural development. In M. A. Gluck & D. E. Rumelhart (Eds.), Neuroscience and connectionist theory (pp. 267–353). Hillsdale, NJ: Erlbaum. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Comp., 6, 100–126. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Purves, D. (1994). Neural activity and the growth of the brain. Cambridge: Cambridge University Press. Quartz, S. R., & Sejnowski, T. J. (in press). The neural basis of cognitive development: A constructivist manifesto. Behav. Brain Sci. Stryker, M. P., & Strickland, S. L. (1984). Physiological segregation of ocular dominance columns depends on the pattern of afferent electrical activity. Invest. Ophthalmol. Vis. Sci. (Suppl.), 25, 278. von der Malsburg, C. (1973). Self-organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100. Weliky, M., & Katz, L. C. (1997). Disruption of orientation tuning in visual cortex by artificially correlated neuronal activity. Nature, 386, 680–685. Received January 31, 1996; accepted July 21, 1997.
LETTER
Communicated by Ad Aertsen
Synaptic Delay Learning in Pulse-Coupled Neurons ∗ Harald Huning ¨
¨ Neuroinformatik, Universit¨at Ulm, Oberer Eselsberg, D-89069 Ulm, Abteilung fur Germany ∗∗ Helmut Glunder ¨
¨ Medizinische Psychologie, Ludwig-Maximilians-Universit¨at, Institut fur ¨ Goethestraße 31, D-80336 Munchen, Germany
Gunther ¨ Palm ¨ Neuroinformatik, Universit¨at Ulm, Oberer Eselsberg, D-89069 Ulm, Abteilung fur Germany
We present rules for the unsupervised learning of coincidence between excitatory postsynaptic potentials (EPSPs) by the adjustment of postsynaptic delays between the transmitter binding and the opening of ion channels. Starting from a gradient descent scheme, we develop a robust and more biological threshold rule by which EPSPs from different synapses can be gradually pulled into coincidence. The synaptic delay changes are determined from the summed potential—at the site where the coincidence is to be established—and from postulated synaptic learning functions that accompany the individual EPSPs. According to our scheme, templates for the detection of spatiotemporal patterns of synaptic activation can be learned, which is demonstrated by computer simulation. Finally, we discuss possible relations to biological mechanisms. 1 Introduction and New Learning Scheme The timing or coherence of a neuron’s input signals determines whether the neuron behaves as an integrator or coincidence detector (Abeles, 1982). Regarding the number of impulses that are required to exceed a voltage threshold—for example, at the axon hillock or a dendritic site with voltagedependent mechanisms—temporally incoherent signals are less effective than synchronized ones. However, if we take into account axonal and dendritic propagation times, significant coincidence cannot be expected for synchronous impulse emission (Glunder ¨ & Nischwitz, 1993). Consequently, and in contrast to the prevailing paradigm that learning manifests itself in the change of synaptic strengths, we took first steps toward a formalism for ∗ ∗∗
Present address: E/E P6, Imperial College, London SW7 2BT To whom correspondence should be addressed
Neural Computation 10, 555–565 (1998)
c 1998 Massachusetts Institute of Technology °
556
Harald Huning, ¨ Helmut Glunder, ¨ and Gunther ¨ Palm
unsupervised learning of individual synaptic delays that aims to produce coincident excitatory postsynaptic potentials (EPSPs) at a defined site. If this site differs significantly from that of the synapses (of course, within the same postsynaptic neuron), we confront the well-known communication problem associated with any form of nonlocal synaptic coincidence learning, Hebbian learning included (see, e.g., Palm, 1982, and section 5 of this article). While other authors account for delay changes by synaptic selection from a multiplicity of axonal or dendritic pathways with different propagation times (e.g., Gerstner, Ritz, & van Hemmen, 1993; Hopfield, 1995; Miller, 1989; Tank & Hopfield, 1987), we propose postsynaptic processes. Promising candidates for adjustable delays between synaptic activation (transmitter binding) and the generation of a postsynaptic potential (opening of ion channels) are experimentally demonstrated molecular messenger cascades (Hille, 1994; Wickman & Clapham, 1995) that we consider as structurally less costly than the approach noted in the previous sentence (cf. section 5). Aside from solving timing problems in neural circuits, delay learning can serve the evaluation of spatiotemporal patterns of synaptic activation (Carr, 1993; Eggermont, 1990; Wang, 1995). For such computational purposes, the idea of adjustable delays has been considered by several authors (Baldi & Atiya, 1994; Eckmiller & Napp-Zinn, 1993; Glunder ¨ & Nischwitz, 1993; Jansen, Bluhm, Napp-Zinn, & Eckmiller, 1991; Napp-Zinn, Jansen, & Eckmiller, 1996), and recently Hopfield (1995) has suggested a neural pulse position modulation with intensity-invariant demodulation by “coordinated time delays.” For nonspiking networks, learning schemes have been formulated by Baldi and Atiya (1994), Bell and Sejnowski (1995), Bodenhausen and Waibel (1991), and Tank and Hopfield (1987), but to our knowledge, no mathematical framework for unsupervised delay learning in pulse-coupled neurons has yet been published. This article relates our threshold rule (Glunder ¨ & Huning, ¨ 1996) for unsupervised learning of synaptic delays to the gradient descent scheme. Figure 1 shows three synapses of a neuron that are activated at times tact,i and their EPSPs delayed by τi . The idea is to determine the delay changes 1τi during every time interval T where the somatic or a local dendritic depolarization u(t), that is, summed EPSPs, is above a learning threshold θ (see Figure 2). For their computation we must assume a secondary process that accompanies each EPSP and determines the amount and direction of the changes. With this postulated learning function λ(t), the delay change is Z 1τi ∼
[u(t) − θ ] · λ(t − tact,i − τi ) dt.
(1.1)
T
Thus we propose delay changes proportional to the temporal integral of the weighted learning function, where the weighting term is the suprathreshold depolarization u(t) − θ ≥ 0. A good choice for the learning function λ(t) is the EPSP function’s negative derivative (see section 3).
Synaptic Delay Learning
557
Figure 1: Differently timed (tact,i ) activation of three synapses at a cell or cell patch evokes delayed (τi ) EPSPs that result in the net depolarization u(t). The continuously adjustable delays are to be learned for coincident EPSPs.
In the next section we present a gradient descent approach to synaptic delay learning, from which a first learning function is derived, and we introduce the computation of the delay changes at the time of maximum depolarization. In section 3, we further develop this scheme to our threshold rule and generalize the concept of the learning function. We proceed with the simulated formation of a detector that becomes tuned to a spatiotemporal pattern of synaptic activation, and we consider the issue of stability. In the conclusion, we briefly relate our theoretical investigations to known and expected biophysical and neurobiological mechanisms. 2 Relation to Unsupervised Gradient Descent Learning We relate the unsupervised learning of synaptic delays to schemes of error minimization by using the mathematically convenient parabolic EPSP function (see Figure 2A), ½ 1 − t2 for − 1 ≤ t ≤ 1 . hp (t) = 0 else For reasons that will soon become evident, we define the activation onset (beginning of the transmitter binding) of synapse i as tact,i = tref − ti . With a relative activation time ti > 0, it then precedes the reference time tref at which the delay changes are computed. We assume that N excitatory and
558
Harald Huning, ¨ Helmut Glunder, ¨ and Gunther ¨ Palm
linearly transmitting synapses contribute—each with a single EPSP—to a neuron’s depolarization u(t = tref ). For this to happen, the synaptic delays τi must obey the relation |ti − τi | < 1. Then the summed potential is X u(t = tref ) = N − (ti − τi )2 . i
Clearly, u(t = tref ) becomes maximum for N coincident EPSPs, which can be achieved by minimizing—through gradient descent—the squared Euclidean distance between the relative activation times ti and the associated delays τi . This leads to the well-known learning rule, here for synaptic delays 1τi ∼ ti −τi . Unfortunately, the relative activation times ti are unknown to the neuron. However, formally ti −τi can be expressed by the derivative of d the EPSP function as − 12 dt hp (t + ti − τi )|t=0 , using tref = 0 for simplicity. An essential point of this article is that such a secondary and clearly hypothetical signal is indispensable. It accompanies each EPSP (see similar ideas in Gerstner et al., 1993), and we refer to it as a synaptic learning function λ(t). For the parabolic EPSP hp (t), the learning function resulting from gradient d descent is λp (t) ∼ − dt hp (t) = 2t for −1 ≤ t ≤ 1 and zero where the EPSP is zero as well. At the reference time, all EPSPs’ learning functions are sampled to give the delay changes (learning increments) of the corresponding synapses. Although specific signals may exist that define a reference time, we now propose to consider the time at which the depolarization u(t) is maximum. If the sum of N parabolic EPSPs exhibits a single P maximum, then the hereby defined reference time becomes tmax = N1 i (τi − ti ) and the maximum potential is u(tmax ) = N − kEek2 , with the components of the error vector X ei = (ti − τi ) − N1 (tj − τj ). j
Hence, if we keep to the learning function λp (t), we arrive at the learning rule ¨ 1995). Here, the sampling of the learning function takes 1τi ∼ ei (Huning, place at the maximum of the depolarization. Although this signal-defined reference time is less ad hoc, maximum detection is difficult to implement, highly sensitive to noise, and thus biologically quite implausible. 3 Temporally Distributed Delay Learning (Threshold Rule) As a scheme for the unsupervised learning of synaptic delays that is more robust with respect to noisy potentials we finally propose the depolarization-dependent threshold rule (see equation 1.1). With this scheme, delay changes are executed either continuously during or at the end of learning intervals Tµ for which the net depolarization remains above a learning threshold θ (see Figures 2A and 2B, bottom). Although learning defined by equation 1.1 appears functional also without the suprathreshold function
Synaptic Delay Learning
559
Figure 2: Evaluation of individual synaptic delay changes 1τi from the net depolarization u(t) (bottom) of a cell patch with four active synapses. For every synapse, we show an EPSP, its learning function λ (dashed line, except bottom), the weighted learning function v · λ (with v(t) = u(t) − θ ≥ 0), and its integral (dotted line) that is proportional to the delay change. EPSP shape h(t): (A) parabolic, hp (t) = 1 − t2 for −1 ≤ t ≤ 1; (B) α-function, hα (t) = α 2 t · e−αt for t ≥ 0.
v(t) = u(t)−θ ≥ 0, we include this weighting to avoid conflicting and sometimes stable oscillatory delay changes that otherwise can result from concurring activation patterns at successive learning intervals (see section 4).
560
Harald Huning, ¨ Helmut Glunder, ¨ and Gunther ¨ Palm
If every parabolic EPSP and therefore the associated synaptic learning d function λp (t) ∼ − dt hp (t) entirely cover the learning interval, we obtain from equation 1.1 the learning rule 1τi ∼ Vµ · ei with the suprathreshold area R Vµ = Tµ [u(t) − θ ] dt. Figure 2A depicts a situation where three parabolic EPSPs cover the learning interval, while a fourth EPSP comes later and is not captured. In contrast to schemes relying on reference times, where EPSPs are not captured if they do not contribute to the sampled depolarization, our threshold rule shows a gradual coupling of EPSPs that only partly reach into the learning interval. In the latter case, the delay changes increase with every presentation of a spatiotemporal activation pattern until an EPSP’s maximum enters the learning interval. This behavior becomes pronounced with more realistic EPSPs—that is, with unimodal functions that steeply rise and slowly decay. (For asymptotically decaying EPSPs, we reasonably assume learning functions of finite duration λ(t > tε ) = 0, with h(t > tε ) < ε, where ε may depend on the noise level.) With this kind of asymmetric EPSP d h(t), early EPSPs will be captured long before their function and λ(t) ∼ − dt maxima enter the learning interval, whereas late EPSPs, which rise after the interval, either fail to be captured (the fourth synapse in Figure 2B) or create a separate learning interval (for a lower threshold than in Figure 2B). Because various synaptic learning functions are feasible for a given unimodal EPSP function h(t), we have investigated general requirements. Evidently learning functions must change sign from minus to plus in order to give the direction of the delay changes. Formally, we have found that all d f {h(t)} with any monotone increasing function learning functions λ(t) = − dt f comply with the demand that learning must stop, that is, the integral (see equation 1.1) must vanish, if coincidence of the EPSPs is reached (Huning, ¨ 1995). This holds for all threshold settings. Function f permits one to tailor the properties of the learning process. For instance, it may serve the smoothing of a learning function’s otherwise discontinuous onset and the restriction of its duration. Furthermore, we can conclude that coincidence learning still works with EPSPs of various amplitudes. 4 Simulation of Spatiotemporal Template Learning We demonstrate unsupervised synaptic delay learning by the simulated formation of detectors for spatiotemporal patterns of synaptic activation. As an example, we consider the time courses of activation at 10 synapses of a formal neuron (see Figure 3A). Each of the two distinct patterns lasts longer than a single EPSP. Before the repeated presentations of the pattern pair, the 10 synaptic delays are randomly distributed in the interval 0.51 ≤ τi ≤ 31, where 1 is the duration of the parabolic EPSP. Therefore, and because both patterns are well separated in time, the neuron, at best, will be tuned to one of them. A steady time course of the neuron’s depolarization (see Figure 3B) is reached after 21 presentations of the pattern pair. Owing to
Synaptic Delay Learning
561
Figure 3: Formation of templates for spatiotemporal patterns. (A) Sample of the stimulation patterns at 10 synapses of a neuron. (B) Steady time course of the net depolarization after delay learning. (C) Final delay configuration τ (i).
the greater similarity of the first pattern to the specific random initialization, the threshold rule has produced a detector for this pattern which is obvious from the final synaptic delay configuration (slightly imperfect template) depicted in Figure 3C. Apart from the functionality of delay learning with the threshold rule, our simulation illustrates the effect of temporally limited EPSPs and learning functions, as well as the competition of different patterns. As long as an EPSP contributes to the suprathreshold depolarization, the corresponding synaptic delay becomes adjusted, which in our example is not fulfilled for the second synapse. Furthermore, if both patterns produce suprathreshold depolarizations, we then obtain opposing delay changes. However, oscillations are avoided by the weighting term of equation 1.1, which drives the learning process toward the pattern that evoked the largest initial suprathreshold voltage. Because a neuron’s delay tuning is not changed by patterns that remain subthreshold, a sufficiently high learning threshold retains a tuning even without any further occurrences of the pattern that gave rise to it. Therefore, a threshold that adapts toward the peaks of the depolarization provides a stable delay tuning. Intermediate thresholds cause an adaptive averaging behavior. Accordingly, the delay tuning can follow a slowly changing and repeatedly presented pattern of synaptic activation (Napp-Zinn et al., 1996),
562
Harald Huning, ¨ Helmut Glunder, ¨ and Gunther ¨ Palm
provided the threshold is crossed at every occurrence. For small, random fluctuations of a pattern, the delays are expected to be tuned to the temporal mean, because the integral (see equation 1.1) behaves approximately linear around the zero crossing for realistic EPSP functions. 5 Conclusion In summarizing our results, in particular concerning their neurobiological implications, we have to speculate about possible biological mechanisms, not an easy task for theoreticians. However, our general impression of the recent progress in the investigation of synaptic mechanisms gives us a good confidence that suitable biological mechanisms for anything that is logically possible will be found eventually. So one should not worry too much about the concrete mechanisms proposed below. We have presented a systems view of unsupervised and robust coincidence learning in pulse-coupled neurons that essentially relies on three assumptions. 1. Only a sum of EPSPs is accessible at a defined measuring site at which the EPSP coincidence is to be established. Similar to long-term changes of synaptic strengths (Brown, Kairiss, & Keenan, 1990), synaptic delay changes also are assumed to depend on pre- and postsynaptic potentials. With respect to postsynaptic potentials, local dendritic learning is based on dendritic depolarization, whereas more global neural learning relies on the potential at a neuron’s axon hillock. The process of delay learning will lead to coinciding EPSPs at these sites. 2. The time course of the voltage above a learning threshold at the measuring site is available to the individual synapses. Voltage thresholds are biologically plausible (Artola & Singer, 1993), and their adaptation according to the long-term mean of the depolarization was proposed earlier (Bienenstock, Cooper, & Munro, 1982). In the case of local dendritic learning schemes, the suprathreshold depolarization can easily be sensed by synapse-related molecular mechanisms. Rules that are nonlocal within the postsynaptic cell require the suprathreshold depolarization to be instantaneously signaled, for example, from a neuron’s axon hillock, back to all its synapses, which appears more involved. This well-known and indeed fundamental communication problem exists with any form of nonlocal synaptic coincidence learning, Hebbian learning of synaptic strengths included. Except for the work reported by Stuart and Sakmann (1994), to date we have to rely more on speculations than on direct experimental evidence for possible communication mechanisms. Interestingly, Hebbian learning today generally is assumed local (Brown et al., 1990), although Hebb (1958) described a global scheme: “When an axon of a neuron x is near enough to fire a neuron y and does so, some change takes place such that x becomes more effective at exciting y. What is this change and how does it work? This is a question to which we have no final answer.” In case of passive or active dendritic
Synaptic Delay Learning
563
propagation of action potentials (e.g., back from the soma to the synapses), these potentials will act in the same way as the dendritic depolarizations do in our scheme; they will define the learning intervals and the weighting of the learning function. 3. A uniform learning function is attributed to every synapse and is triggered at the opening of its ion channels (EPSP onset). Delay learning, as proposed in this article, requires that the synapse contributes to the postsynaptic depolarization and that a postulated differentiating (biochemical) process parallels its individual contribution. This kind of process could be realized by the interaction of an intracellular messenger, such as an activated G protein, and channel proteins (Destexhe, Mainen, & Sejnowski, 1995). Under these circumstances, we have shown how to compute the delay change of an active synapse from the values of its learning function in conjunction with its suprathreshold depolarization. Biologically speaking, we assume the suprathreshold depolarization to have a nonlinear influence on the differentiating (biochemical) process. During the periods of suprathreshold depolarization, this process could, for example, modify the temporal behavior of intracellular messengers that determine the delay between the transmitter binding to a transmembrane receptor and the intracellular opening of ion channels. This modification could be similar to changes of presynaptic messenger cascades, as initiated by retrograde diffusion of nitric oxide (Montague, 1993), that are hypothesized to cause long-term changes of synaptic strengths. In this respect, we do not rule out alternative mechanisms for delay changes, such as modifications of the temporal behavior of presynaptic molecular processes. Currently there is increasing interest in membrane-delimited mechanisms of rather direct and thus comparatively fast (within a second) interaction between activated G proteins and ion channels (Hille, 1994; Wickman & Clapham, 1995), but to our knowledge, on a millisecond time scale, the properties and their modifiability of these interactions have not been investigated yet. Although the concrete biophysical or biochemical realization of delay learning is still unclear, we have demonstrated that this simple learning mechanism is well within the possibilities of our current neurobiological knowledge and would provide a useful addition to the commonly accepted plasticity of synaptic efficacy. Acknowledgments We thank A. Bibbig and T. Wennekers for helpful comments and R. Hudson for improving the text. References Abeles, M. (1982). Role of cortical neuron: Integrator or coincidence detector? Isr. J. Med. Sci., 18, 83–92.
564
Harald Huning, ¨ Helmut Glunder, ¨ and Gunther ¨ Palm
Artola, A., & Singer, W. (1993). Long-term depression of excitatory synaptic transmission and its relationship to long-term potentiation. Trends in Neurosciences, 16, 480–487. Baldi, P., & Atiya, A. F. (1994). How delays affect neural dynamics and learning. IEEE Transactions on Neural Networks, 5, 612–621. Bell, A. J., & Sejnowski, T. J. S. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 23–48. Bodenhausen, U., & Waibel, A. (1991). The tempo 2 algorithm: Adjusting timedelays by supervised learning. In R. P. Lippman, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems 3 (pp. 155–161). San Mateo, CA: Morgan Kaufmann. Brown, T. H., Kairiss, E. W., & Keenan, C. L. (1990). Hebbian synapses: Biophysical mechanisms and algorithms. Annual Review of Neuroscience, 13, 475–511. Carr, C. E. (1993). Processing of temporal information in the brain. Annual Review of Neuroscience, 16, 223–243. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1995). Synaptic currents, neuromodulation, and kinetic models. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 956–959). Cambridge, MA: MIT Press. Eckmiller, R., & Napp-Zinn, H. (1993). Information processing in biologyinspired pulse coded neural networks. In Y. Okabe (Ed.), Proc. Int. Joint Conference on Neural Networks ’93 (pp. 643–648). Nagoya, Japan. Eggermont, J. J. (1990). The correlative brain: Theory and experiment in neural interaction. Berlin: Springer-Verlag. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993). Why spikes? Hebbian learning and retrieval of time-resolved excitation patterns. Biological Cybernetics, 69, 503–515. Glunder, ¨ H., & Huning, ¨ H. (1996). Detection of spatio-temporal spike patterns by unsupervised synaptic delay learning. In N. Elsner & H.-U. Schnitzler (Eds.), Brain and evolution (Vol. 2). Stuttgart: Thieme. Glunder, ¨ H., & Nischwitz, A. (1993). On spike synchronization. In A. Aertsen (Ed.), Brain theory: Spatio-temporal aspects of brain function (pp. 251–258). Amsterdam: Elsevier. Hebb, D. O. (1958). Textbook of psychology. Philadelphia: Saunders. Hille, B. (1994). Modulation of ion-channel function by G-protein-coupled receptors. Trends in Neurosciences, 17, 531–536. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Huning, ¨ H. (1995). Lernen durch Laufzeitvariation in Netzwerken impulsgekoppelter Neurone. Diploma thesis, Rheinisch-Westf¨alische Technische Hochschule, Aachen. Jansen, M., Bluhm, M., Napp-Zinn, H., & Eckmiller, R. (1991). Asynchronous pulse-processing neural net hardware for dynamic functions based on frequency and phase information. In U. Ramacher, U. Ruckert, ¨ & J. Nossek
Synaptic Delay Learning
565
(Eds.), Proc. 2. Int. Conf. Microelectronics and Neural Networks (pp. 359–365). Munich: Kyrill & Method. Miller, R. (1989). Cortico-hippocampal interplay: Self-organizing phase-locked loops for indexing memory. Psychobiology, 17, 115–128. Montague, P. R. (1993). The NO hypothesis. In B. Smith & G. Adelman (Eds.), Neuroscience year: Supplement 3 to the encyclopedia of neuroscience (pp. 100–102). Boston, MA: Birkh¨auser. Napp-Zinn, H., Jansen, M., & Eckmiller, R. (1996). Recognition and tracking of impulse patterns with delay adaptation in biology-inspired pulse processing neural net (BPN) hardware. Biological Cybernetics, 74, 449–453. Palm, G. (1982). Neural assemblies: An alternative approach to artificial intelligence. Berlin: Springer-Verlag. Stuart, G. J., & Sakmann, B. (1994). Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature, 367, 69–72. Tank, D. W., & Hopfield, J. J. (1987). Neural computation by concentrating information in time. Proceedings of the National Academy of Science of the USA, 84, 1896–1900. Wang, D. (1995). Temporal pattern processing. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 967–971). Cambridge, MA: MIT Press. Wickman, K. D., & Clapham, D. E. (1995). G-protein regulation of ion channels. Current Opinion in Neurobiology, 5, 278–285. Received January 29, 1996; accepted September 9, 1997.
LETTER
Communicated by Ad Aertsen
Neural Processing in the Subsecond Time Range in the Temporal Cortex Kiyohiko Nakamura Department of Computational Intelligence and Systems Science, Tokyo Institute of Technology, Yokohama 226, Japan
The hypothesis that cortical processing of the millisecond time range is performed by latency competition between the first spikes produced by neuronal populations is analyzed. First, theorems that describe how the mechanism of latency competition works in a model cortex are presented. The model is a sequence of cortical areas, each of which is an array of neuronal populations that laterally inhibit each other. Model neurons are integrate-and-fire neurons. Second, the model is applied to the ventral pathway of the temporal lobe, and neuronal activity of the superior temporal sulcus of the monkey is reproduced with the model pathway. It consists of seven areas: V1, V2/V3, V4, PIT, CIT, AIT, and STPa. Neural activity predicted with the model is compared with empirical data. There are four main results: (1) Neural responses of the area STPa of the model showed the same fast discrimination between stimuli that the corresponding responses of the monkey did: both were significant within 5 ms of the response onset. (2) The hypothesis requires that the response latency of cortical neurons should be shorter for stronger responses. This requirement was verified by both the model simulation and the empirical data. (3) The model reproduced fast discrimination even when spontaneous random firing of 9 Hz was introduced to all the cells. This suggests that the latency competition performed by neuronal populations is robust. (4) After the first few competitions, the mechanism of latency competition always detected the strongest of input activations with different latencies. 1 Introduction Primates recognize and respond to complex visual stimuli within half a second. Signals from the retina take 25–50 ms to reach the primary visual cortex V1, and the motor system requires about 150 ms to produce a response (Kalaska & Crammond, 1992). This means that 200–300 ms is available for cortical processing. Not only is the processing time restricted to less than a third of a second, the firing rate of single cortical neurons is usually less than 50 Hz. Each neuron thus produces as few as 10 spikes during the processing. The temporal constraints also limit the number of processing steps (Oram & Perrett, 1994). The time between spike generation in one cell and its inNeural Computation 10, 567–595 (1998)
c 1998 Massachusetts Institute of Technology °
568
Kiyohiko Nakamura
fluence on the firing rate of a postsynaptic cell is about 5 ms. Latency of the earliest cell response in the superior temporal sulcus (STS) is 70–100 ms, and cells in the STS can discriminate forms of visual stimuli within 5 ms of response onset (Oram & Perrett, 1992). This suggests that the processing from the cortex V1 to the STS could take place within about 50 ms and that as few as 10 synaptic relays occur during the processing. The ventral pathway of form recognition passes through seven areas—V1, V2/V3, V4, PIT, CIT, AIT, and STPa—so only one or two synaptic relays can occur within each area. This article analyzes a hypothetical mechanism that performs the cortical processing within that time range (Nakamura, 1992, 1993a). The mechanism was intended to account for the following characteristics. First, each cell produces as few as 10 spikes during the few hundred milliseconds available for this processing. Second, cortical processing of that time range usually becomes possible only after training. For example, children need to experience repeated stimulus-response cycles before they learn to discriminate cats from other animals within a few hundred milliseconds. Third, neural mechanisms of the brain are robust against noise because brain cells are subject not only to sensory signals but also to many kinds of noise. The mechanism works through two processes: latency competition and reinforcement of cortical connections. The competition is between neuronal populations that laterally inhibit each other (Nakamura & Ichikawa, 1989). When the populations are activated simultaneously, those receiving the strongest activations fire first and inhibit the others from firing. This mechanism detects the most strongly activated populations in the time range of firing latency, because the first spikes indicate that the populations producing them have been most strongly activated. Suppose that the competition functions as neural processing of every cortical area. Then the cortical processing by way of multiple areas could be done within a few hundred milliseconds. The process of reinforcement is to increase the synaptic efficacy of cortical pathways that lead the first spikes to relevant cells of the motor cortex. Repeated learning selectively reinforces the pathways along which neuronal excitation producing relevant motor responses travels first (Nakamura, 1993a). A number of network models using lateral inhibition have been proposed. In many, modules producing the strongest outputs win the competition by inhibiting the others, and stimulus selectivity emerges gradually through computational cycles of connectivity change (Rumelhart & Zipser, 1986; Grossberg, 1987; Fukushima, 1980). A main difference from the mechanism presented here is that these earlier models did not take account of temporal aspects of the competition in each cycle. As a result, they failed to provide an estimate of time taken by each competitive cycle. Since they have not dealt specifically with time for each cycle, their response time is usually measured by time taken for the emergence of the stimulus selectivity. Consequently, the earlier models have been considered to describe slow
Neural Processing in the Subsecond Time Range
569
mechanisms. The model presented here has pointed out that winners of every neural competition are determined by firing latency and that the cycles of connectivity change correspond to learning process to form reinforced pathways producing the rapid and relevant responses. Coultrip, Granger, and Lynch (1992) presented a network model of the latency competition, in which single neurons compete and a single inhibitory neuron delivers lateral inhibition. This network architecture may not work in noisy environments because damage to single neurons immediately changes the results of the competition. Biological mechanisms need to be robust and to function appropriately despite noise. The circuit of the model presented here (Nakamura & Ichikawa, 1989) is composed of neuronal populations and encodes signals in ratios of firing neurons in the populations. Its behavior is hardly varied under noisy conditions. Another network model using a latency mechanism, presented by Opara and Worg ¨ otter ¨ (1996), has demonstrated that a delay mechanism in the early stage of visual pathway speeds up synchronization for cell assemblies representing different objects. Although they did not take neural plasticity into consideration, a number of studies have shown cortical plasticity in the visual pathway of the temporal lobe. Miyashita and colleagues (Miyashita & Chang, 1988; Sakai & Miyashita, 1991; Miyashita, 1990) have shown that extensive training affects the selectivity of cells in the AIT cortex. Kobatake, Tanaka, Wang, and Tamori (1993) also have reported that training makes cells of the inferotemporal cortex responsive to the presented stimuli. These observations suggest that the stimulus selectivity of cells might result from the adjustment of cortical connectivity. The model here accounts for the stimulus selectivity produced by the connectivity adjustment. Oram and Perrett (1992) have shown that cells of the monkey STS discriminate visual stimuli within 5 ms of response onset. We expect the hypothetical mechanism of latency competition and connectivity reinforcement to account for this discrimination. To see if it actually does, we developed a model of the ventral pathway of the temporal lobe and used a computer simulation of their experiment to see whether the model reproduced the neuronal activity of the monkey. 2 Model 2.1 Architecture of the Model Cortex. The model cortex consists of the sensory cortex, association areas, and motor cortex. These areas are connected in series, and each is represented by an array of neuronal populations (see Figure 1A). Every population has three types of neurons: excitatory stellate cells, pyramidal cells, and inhibitory interneurons (see Figure 1B). Stellate cells transmit signals to pyramidal cells, which project to inhibitory interneurons and also send fibers to the next areas. Inhibitory neurons extend lateral projections to nearby pyramidal cells. Each cell in Figure 1B represents not a single cell but a population of a type of cells. Because each
570
Kiyohiko Nakamura
Figure 1: Model cortex. (A) Arrays of neuronal populations and connections between them. Projections from the brain stem and limbic system pervade the whole cortex. (B) Neuronal circuitry of a single cortical area. (C) Populations of neurons of two types and connections between them.
cell in Figure 1B is modeled by a cell population as shown in Figure 1C, every connection in Figure 1B represents a bundle of fibers connecting cell populations. The neuronal populations need not be “columns,” but they may spatially overlap. What is necessary for our model is the neuronal connectivity shown in Figure 1. If the populations overlap, inhibitory collaterals (Kawaguchi, 1995; Kang, Kaneko, Ohnishi, Endo, & Araki, 1994) pervade many populations.
Neural Processing in the Subsecond Time Range
571
2.2 Model Neuron. Each neuron is represented by a single compartment, and its change in membrane potential V is given by C(dV/dt) = −
X
Gk V +
k
X
Gk Ek ,
(2.1)
k
where C is membrane capacitance and Gk and Ek are the channel conductance and electromotive force of ion k, respectively. Conductances Gk change when the neuron receives spikes, and we simplify the process of spike generation: When V exceeds a threshold h, the neuron fires and delivers spikes of duration τs . After that, the neuron enters a refractory state, which lasts τr and during which the neuron does not respond to any spikes that may arrive. By the end of the refractory period, V has returned to the resting potential Vo . Consecutive firing reduces the firing rate, and this is modeled by increasing the threshold h: dh/dt = 1h δ(t) − Ch (h − ho ),
(2.2)
where δ(t) = 1 at time t when the neuron begins to deliver a spike and 0 otherwise and where 1h , Ch , and ho are, respectively, the increment per spike, the coefficient of recovery, and the initial value of h. This equation implies that firing threshold h rises by 1h every firing and returns to the initial value ho at the rate Ch . 2.3 Input-Output Variables of Neuronal Populations. The nervous system is considered to encode information in ratios of firing neurons in neuronal populations. These ratios are averages across neurons, not over time. Let x(t) denote the ratio of firing neurons to all the neurons of a population at time t. Neurons producing spikes are counted as the firing neurons. The conductance Gk of a neuron receiving projections from the population is written in the form Gk (t) = [Wk x(t) + 1]Gok ,
(2.3)
where Gk (t) and Gok denote values of Gk at time t and at x(t) = 0, respectively, and Wk is a coefficient representing the effectiveness of synaptic transmission on change of conductance Gk . The value of Wk is derived from conductance changes produced by single spikes (for details, see Nakamura, 1993b). 2.4 Synaptic Plasticity. Synaptic plasticity of corticocortical connections is assumed, and the synapses of connections are reinforced when the following three conditions are satisfied simultaneously: (1) the connections deliver spikes, (2) the postsynaptic neurons fire, and (3) the projections from the brain stem and limbic system are activated. The first and third
572
Kiyohiko Nakamura
conditions were introduced according to findings concerning the plasticity of ocular dominance in the cat visual cortex (Karil, Dubin, Scott, & Stark, 1986; Gordon, Allen, & Trombley, 1988). Although this plasticity was found during development and in a specific part of the cortex, it is assumed here to function even in adults and all over the neocortex. The second condition is based on characteristics of Ca2+ , which is involved in switching biochemical processes of plasticity. Sufficient Ca2+ influx is necessary for the processes. Since Ca2+ channels are voltage dependent, a large increase in membrane potential by firing evokes the influx. The projections from the brain stem and limbic system may convey signals related to attention and drive. Their activation instructs the nervous system that the current sensory stimuli should be memorized. 3 Model Analysis of Latency Competition Mechanism The model describes how cortical processing that occurs in the millisecond time range is performed and is based on the work of Nakamura and Ichikawa (1989) and Nakamura (1993a). A sensory stimulus activates certain neuronal populations of the sensory cortex, and these activated populations deliver spikes to the next area, where, as shown in Figure 2A, the populations receiving the strongest connections are the first to fire. The first spikes excite populations of inhibitory interneurons (see Figure 1B) that laterally inhibit the other populations before they fire. Consequently, only the populations receiving the strongest connections are allowed to produce spikes. This mechanism is described by the following theorems. First, response latency of neuronal populations decreases with strong activation: Theorem 1. When x(t) is a step input, response latency of neuronal populations decreases with large synaptic input Wk x(t)Gok under the following conditions: (1) Synaptic input is excitatory and acts only on a single ion conductance Gka , and (2) increments in Wka x(t) are nearly the same for all the neurons of the population. For the proof, see appendix A. The assumption of step input x(t) will be discussed in section 6.3. The conditions were introduced for analytical simplicity, but they also have some biological plausibility. Condition 1 is satisfied if the corticocortical connections deliver the same transmitter, and it acts on a single ion conductance. Condition 2 holds if the input increments are produced by an increase in x(t). The lateral inhibition restricts firing to the most strongly activated populations: Theorem 2. Assume the same conditions as in Theorem 1. Then only the most strongly activated populations fire if the projections of pyramidal cells are strong enough to fire the inhibitory interneurons and if the resultant inhibition is strong
Neural Processing in the Subsecond Time Range
573
Figure 2: Neural mechanism of latency competition. (A) Responses of model neuronal populations receiving stepwise activations of different strengths. The abscissa and the ordinate, respectively, indicate the time after onset of the activations and the ratios of firing neurons in the populations. (B) Responses of neuronal populations in a single cortical area of the model (see Figure 1B). Five rows show changes in ratios of firing neurons in the populations receiving the same activations as in A. (C) Schematic illustration of connectivity reinforcement. Bold arrows denote reinforced corticocortical connections. Filled circles show activated neuronal populations, and open circles show unactivated neuronal populations.
enough to cancel out excitation of pyramidal cells and lasts longer than the refractory period of pyramidal cells. Quantitative description and proof are given in appendix B. Figure 2B shows a simulation result. Since the interneurons inhibit pyramidal cells of their own populations as well, excitation of the inhibitory neurons is sup-
574
Kiyohiko Nakamura
pressed. When the circuit is released from the inhibition, it repeats the same competitive process. As a result, the most strongly activated populations make oscillatory firing. Note that the competitive mechanism detects the strongest inputs with the first spikes, and their latency is as few as milliseconds. If processing in each cortical area is performed by the mechanism, the cortical processing by way of multiple areas can be done in a few hundred milliseconds. And no more than one spike is needed for each neuron during the cortical processing. The short processing time and the small number of spikes required for the processing meet the biological constraints already noted. Consider how the cortical processing by way of many areas is performed in reinforced circuits. Assume that synaptic weights of corticocortical fibers are initially weak and set at random values. Suppose a sensory stimulus excites some populations of the sensory cortex. It follows from Theorem 2 that only the populations receiving the strongest connections produce spikes in the first area of association cortex. While the sensory stimulus is presented, every activation of projections from the brain stem and limbic system selectively reinforces the same connections between the firing populations and makes the initially strong connections stronger. This makes the specific populations of the area strongly responsive to the sensory stimulus (see Figure 2C).1 Connections to subsequent areas are reinforced in the same way, and strong pathways running through those reinforced connections are formed. Consequently, we have the following theorem: Theorem 3. Assume the same conditions as in Theorem 2 and that corticocortical connections are initially set at random strength and are so weak that only strong activation of some populations in the projecting areas may weakly excite populations in the target areas. When the projections from the brain stem and limbic system are repeatedly activated under a sensory stimulus, certain pathways of the model cortex are selectively reinforced, and neuronal populations along these pathways become strongly responsive to the sensory stimulus. If the projections from the brain stem and limbic system convey reward signals, connections with the motor cortex are reinforced only if activation of their postsynaptic neurons produces reward motor responses. This is because firing of those neurons leads to activation of the projections and evokes the synaptic change (Nakamura, 1993). After reinforced pathways to the motor cortex are established, sensory signals travel first along them, and animals respond with rewarded actions. 1 Reward signals from the brain stem and limbic system are delivered after the motor response, and sensory stimuli may therefore change before reward signals arrive. This time delay can be offset by reverberation in the corticohippocampal circuits. A detailed discussion of this has been published in Nakamura (1993b), which also discusses how the neural circuit starts or stops learning.
Neural Processing in the Subsecond Time Range
575
4 Simulation The model was used to simulate the experiment done by Oram and Perrett (1992). 4.1 Model of Temporal Lobe. The experimental data were obtained in response to the presentation of visual form (head views), so the model cortex was that of the ventral pathway of the temporal lobe. It consists of seven areas: V1, V2/V3, V4, PIT, CIT, AIT, and STPa. They are connected in series, and the V1, V4, and PIT areas also send direct projections to the V4, CIT, and AIT areas (see Figure 3A). Each cortical area is a 5 × 5 array of neuronal populations (see Figure 3B), and each cell population contains 100 stellate cells, 100 pyramidal cells, and 100 inhibitory interneurons. Those neurons are connected as shown in Figure 1. The simulator includes 100 × 3 × 5 × 5 × 7(= 52, 500) neurons in total. The strength of corticocortical connections becomes weaker as the length of the connections increases, as shown in Figure 3C. Lateral inhibition also becomes exponentially weaker with distance. The synaptic plasticity is formulated as dWk /dt = Cr x(t)I(t)r(t)Wk − Cd Wk ,
(4.1)
where I(t) is the firing function of the postsynaptic neuron (I(t) = 1 when the neuron fires and 0 otherwise), r(t) is the firing ratio of the projections from the brain stem and limbic system, and Cr and Cd are coefficients of reinforcement and decay, respectively. Values of Wk are assumed not to ˆ k . The parameter values of equations 1 to 4 are exceed an upper limit W given in appendix C. 4.2 Stimuli. Stimuli to the model cortex were activations of neuronal populations of the area V1. They were activated in the six patterns shown in Figure 4. For the populations indicated by filled circles, firing ratios of the afferent connections were increased from 0.009 (spontaneous activity) to 0.03. Since duration of spikes was set to be 1 ms, the ratio of 0.03 corresponds to a firing rate of 30 Hz of single neurons. Spontaneous activity was also introduced to all neurons: Potential V was randomly raised above the firing threshold ho at the rate of spontaneous activity (S/A). The rate was set at 9 Hz according to the empirical data (Oram & Perrett, 1992). 4.3 Procedure. This study assumed that the monkeys Oram and Perrett (1992) used had learned the head views before the experiment, and the learning process was therefore simulated first. Initially, values Wk of corticocortical connections were set at small random values, and the neurons of STPa made no response to any stimulus. Learning was conducted. The projections from the brain stem and limbic system were kept
576
Kiyohiko Nakamura
Figure 3: (A) Corticocortical connections of the ventral pathway of the temporal lobe. (B) A 5 × 5 array of neuronal populations in a single cortical area of the model simulator. (C) Change in synaptic efficacy of lateral and corticocortical connections with increasing distance between neuronal populations. The efficacy decreases exponentially as the distance increases.
Figure 4: Six patterns of activation of the area V1. Filled circles indicate activated neuronal populations, and open circles are unactivated populations.
Neural Processing in the Subsecond Time Range
577
activated—that is, r(t) was increased to 0.03—and the six stimuli were sequentially provided. The learning was continued until the activity of neurons in the STPa was significantly different from the S/A. The time required for the learning depended on the values of reinforcement and decay parameters Cr and Cd . A large value of Cr reduces computational time but makes learning unstable. After learning, cell responses of the STPa were examined, just as in the experiment. The six test stimuli and a control stimulus (no activations of the area V1) were provided five times. Timing of spikes of the S/A was different from stimulus to stimulus. During the examination, it was assumed that reinforcement was negligible, that is, that Cr = 0. 4.4 Data Analysis. First, the same analysis that Oram and Perrett (1992) had done was conducted. Pyramidal cells were classified according to the mean cell response (spikes per second) over a period of 250 ms starting at stimulus onset. In the simulation, the stimulus onset was the activation of area V1, whereas in the experiment, it was the presentation of visual stimuli (Oram & Perrett, 1992). Therefore, the former follows the latter by a few tens of milliseconds. Cells were selected when analysis of variance (ANOVA) showed that at least one test stimulus elicited activity significantly (p < 0.025) greater than that observed after the control stimulus. For each cell, responses were categorized into three groups: Best (response to the most effective stimuli), Worst (response to the least effective stimuli), and Mid (response between the former two). The three categories were defined as 20% ranges of the full range of responses to different stimuli: Worst, 0– 20%; Mid, 40–60%; and Best, 80–100%. Of the selected cells, those that had responses of all three categories were picked out. The responses obtained in five trials were averaged within each response category for each cell on a bin-by-bin basis (bin size was 5 ms). This yielded a peristimulus time histogram (PSTH) with 120 bins. The latency was taken as the first of three consecutive time bins in which the mean response was in excess of the 95% confidence interval of the S/A. If different categories gave different latencies, the shortest was taken as the cell response latency. The mean firing rate of the cells during the first, second, and fifth 100 ms of the response after the estimated cell response latency was also calculated. Two response measurements introduced by Oram and Perrett (1992) were calculated: population response and average cell response. Population response is PSTH profiles calculated in the three response categories for the entire population of selected cells. They were obtained by three procedures: (1) normalizing the response magnitude of each cell to the magnitude of the difference between the S/A and the peak response of the Best category, then (2) averaging the response rate in each time bin across all cells, and finally (3) renormalizing as procedure 1. Average cell response was calculated from PSTHs, each of which was shifted by the cell’s estimated response latency. The shifted PSTHs were
578
Kiyohiko Nakamura
processed according to the three procedures just described. Oram and Perrett introduced this synchronization of response onset because response onsets of cells of monkeys were fairly widely distributed, and it was necessary to examine the difference between cell responses following response onset. To estimate the efficiency of discrimination between different response categories, the responses in each time bin were subjected to a two-way ANOVA with response category being a fixed factor and cell a random factor. Firing rates above S/A were used without normalizing the magnitude of responses. The simulation provided data concerning concurrent activity in multiple cells, which was not recorded in the experiment. Concurrent cell responses to one activation pattern were examined at the model STPa. The activation pattern was that of Figure 4B. The concurrent cell response was calculated as follows: For each of the three response categories, 50 cells that made responses of the corresponding category were selected at random. The responses of the 50 cells were averaged within each response category for each trial on a bin-by-bin basis. The difference in average concurrent responses between the response categories was estimated on each bin for each the five trials with the use of ANOVA. For each category, mean firing rates of the corresponding 50 cells were used. Synaptic coefficient Wk change caused by learning was calculated. After the learning period, values of Wk were averaged within each postsynaptic neuronal population to obtain the average synaptic efficacy between neuronal populations. Increments in the average synaptic efficacy from their initial values were calculated as percentages of the initial values. The same calculation was also performed after the first half of the learning period. 5 Results At the end of learning, 2469 cells showed statistically significant responses to some stimuli; they were 98.7% of the pyramidal cells (5 × 5 × 100 cells) of the model STPa. Numbers of cells responding to the six activation patterns are listed in Table 1 for each the three response categories. Among them, 399 cells made responses of all of the Best, Mid, and Worst categories. In the experiment (Oram & Perrett, 1992), 44 cells were suitable for the data analysis. For comparison, 50 of the 399 cells were sampled at random. Change in firing rates of those cells is listed in Table 2. The PSTHs of the cells were also calculated. A PSTH of a cell for the Best response category is illustrated in Figure 5A. The response showed a very rapid rise followed by a slower decline in firing rate, as did the monkey cells. The rate of rise depends on membrane parameters C, Gk , and Ek , which were set according to physiological data, and on synaptic efficacy Wk , which was adjusted by learning. The decline was produced by increasing the firing threshold for the model neurons. The rate of this decline was fitted to the empirical data
Neural Processing in the Subsecond Time Range
579
Table 1: Distribution of Cells Responding to the Six Input Activations. Response Category Activation Pattern
Best
Mid
Worst
A B C D E F
1 210 0 139 49 0
0 99 89 160 0 112
398 90 310 49 325 250
Note: Activation pattern labels correspond to those in Figure 4.
Figure 5: Peristimulus time histograms of the Best (A) and Worst (B) response categories of one cell.
by adjusting parameters 1h and Ch .2 A PSTH of a cell for the Worst response category is shown in Figure 5B. The firing rate showed a small rise followed by a rapid and complete decay. Means of the estimated response latency for the Best, Mid, and Worst response categories were, respectively, 41.2, 49.6, and 62.4 ms. Increments in the mean latency from that of the Best category are shown in Figure 6. 2
There was still some disparity. Firing rates for the Best and Mid categories in the model were larger than those in the experiment, and for the Worst category they were smaller. Reduction of firing rates for the former categories made response for the Worst category statistically indistinguishable from the S/A. Since this model was intended to analyze latency competition that was related to response onset, only the decline in firing rate was modeled.
580
Kiyohiko Nakamura
Table 2: Modeled and Measured Firing Rates (spikes per second). A. Model
B. Experiment
Period
Best
Mid
Worst
Best
Mid
Worst
Peak First 100 ms Second 100 ms Fifth 100 ms
187.3 94.6 82.7 32.9
136.8 59.8 34.1 14.4
36.3 17.1 12.0 9.6
115.8 66.1 45.1 27.7
82.4 43.6 31.8 19.7
51.2 24.3 16.0 13.1
Source: Oram & Perrett, 1992.
Figure 6: Increment in mean response latency versus the response category. Stimuli of the Best category fire cells at the greatest mean rate during the poststimulus period of 250 ms, and those of the Worst category fire cells at the smallest mean rate. Open circles indicate data of the model simulation, and filled circles indicate the empirical data of Oram and Perrett (1992).
Weaker responses occur in longer latencies for both the model simulation and experiment. The mean response latency of cells of the monkey was 111.2 ms for the Best response category. Assuming that visual signals took 70 (111.2 − 41.2) ms to reach area V1, Figure 6 shows that response latencies of the model STPa are in good accordance with those of the monkey STPa. 5.1 Population Response. The amplitude-normalized responses (S/A, 0%; Peak from Best, 100%) were averaged by response category to produce the population PSTH profiles (see Figure 7A). They reproduce those of the experiment (see Figure 7B). Although the firing rates of model cells tend to
Neural Processing in the Subsecond Time Range
581
Figure 7: Cell population responses and discrimination between the responses. (A, B) PSTH response profiles in the model simulation and in the experiment (Oram & Perrett, 1992). Clear bars show the Best category, hatched bars show the Mid category, and solid bars show the Worst category. Firing rate is expressed as a percentage of peak response in the Best category. (C, D) The results of statistical evaluation of response discrimination in the simulation and the experiment: Fratio computed for each time bin across the three categories. Discrimination reaches significance (P < 0.05) 45 ms after stimulus onset (activation of V1) in the simulation and 116 ms after stimulus onset (presentation of visual stimuli) in the experiment.
be higher than those of the monkey (see Table 2), the simulation results are fairly close to the empirical data. A two-way ANOVA (fixed factor, response category; random factor, the 50 cells) was performed between the three response categories for each of the 120 time bins. Values of the F-ratio are plotted against time in Figure 7C. The discrimination between stimuli reaches a statistically significant level
582
Kiyohiko Nakamura
45 ms after stimulus onset (activation of V1) and is very reliable, as in the experiment (see Figure 7D). The mean response latency after stimulus onset of the selected cells is 37.6 ms. In the experiment, the time needed for the discrimination measure to reach a significant level was 116 ms, and the response latency was 90 ms. Note that stimuli in the experiment were stimuli to the retina and that those times should therefore be longer in the empirical data than in the simulation. Significant discrimination in the model simulation occurred 7.4 ms after response onset and in the experiment 26 ms. 5.2 Average Cell Response. Some cells did not have a clear response onset for some categories. Nineteen cells had the defined response onset for all the response categories 20 ms after stimulus onset. Responses of those cells were synchronized at 20 ms after stimulus onset. That is, cell responses of the three categories were shifted by the difference between the cell’s estimated response latency and 20 ms of poststimulus time. For those 19 cells, PSTH profiles were calculated after the synchronization of response onsets (see Figure 8A). The synchronization makes the Best, Mid, and Worst responses begin at the same latency, though responses of different categories could occur at different latency for each cell. Discrimination between stimuli was again analyzed by ANOVA of firing rate in different time bins across the 19 cells. Changes in the F-ratio values with time are shown in Figure 8C. The discrimination reaches a statistically significant level within the first 5 ms of response onset, as it does in the empirical data (see Figure 8D). 5.3 Concurrent Cell Response. For each of five trials, the concurrent responses of the cells were averaged within each response category (see Figure 9A). Responses of the Best category rose first every trial. They were followed by responses of the Mid category. Some trials did not produce clear responses of the Worst category. All the responses were considerably oscillatory, and cells of the same response categories fired synchronously. This would correspond to the synfire patterns proposed by Abeles (Abeles et al., 1993). The oscillation did not appear in the population PSTH profiles because the responses there were averaged across the trials. Discrimination among responses of the Best, Mid, and Worst categories was analyzed by ANOVA of firing rate in different time bins. The statistical evaluation is illustrated in Figure 9B. Statistically significant discrimination (p < 0.005) occurred at 50 ms for all the five trials, and the timing agreed with the onsets of the concurrent responses for the Best category. 5.4 Reinforcement of Synaptic Connectivity. Increments in the average synaptic efficacy were calculated after the learning period and after the first half of the learning period. Figure 10 shows the values of connections between neuronal populations in the second rows of the population matrices
Neural Processing in the Subsecond Time Range
583
Figure 8: Average cell responses and discrimination between the responses. (A, B) PSTH response profiles in the model simulation and in the experiment (Oram & Perrett, 1992). Response latencies of contributing cells were synchronized to 20 ms (poststimulus). Best, Mid, and Worst categories are, respectively, denoted by clear, hatched, and solid bars. (C, D) The results of statistical evaluation of response discrimination in the simulation and the experiment: F-ratio computed for each time bin across the three categories. Discrimination reaches significance (P < 0.05) within 5 ms of response onset in both the simulation and the experiment.
of cortical areas. Dashed arrows indicate connections whose increments in average synaptic efficacy were between 3% and 30% of their initial values. Solid arrows indicate connections whose average synaptic efficacy increased by more than 30% of the initial values. As learning proceeded, pathways of reinforced connections extended from area V1 to area STPa.
584
Kiyohiko Nakamura
Figure 9: Average concurrent responses of multiple cells and discrimination between responses. (A) Responses averaged across 50 cells for five trials. Clear bars show the Best category, hatched bars show the Mid category, and solid bars show the Worst category. Firing rate is expressed as a percentage of peak response in the Best category (300 spikes/s). (B) Statistical evaluation of response discrimination: F-ratio computed for each time bin across the three categories.
Neural Processing in the Subsecond Time Range
585
Figure 10: Reinforcement of corticocortical connections. Increments in the average synaptic efficacy after the first (A) and second (B) halves of the learning period. Circles represent neuronal populations of the second rows in the population matrices of cortical areas. Dashed arrows indicate connections whose increment in average synaptic efficacy was between 3% and 30% of their initial values. Solid arrows indicate connections whose average synapyic efficacy increased by more than 30% of the initial values.
6 Discussion 6.1 Latency and Response Discrimination. The mechanism of latency competition transmits the strongest activation first, so strong responses should occur at short latency. Unless the latency of strong responses is shorter, the hypothesis presented here does not hold. This requirement was verified in both the simulation results and the empirical data (see Figure 6). Differences of latency between response categories in the simulation were close to those in the empirical data. This suggests that the number of synaptic relays in the shortest pathways for the discrimination in the monkey STPa is nearly the same as that in the model circuit and therefore that the pathways of the monkey might be included in the ventral pathway of the temporal lobe of the model cortex.
586
Kiyohiko Nakamura
Oram and Perrett (1992) calculated a measure of discrimination between the Best and Worst categories. The discrimination was expressed as 100 × (RB − RW )/RB , where RB and RW are the mean firing rate levels above S/A in the Best and Worst categories during the first 100 ms of the response after the estimated cell response latency. They found no correlation between the discrimination measure and cell’s response latency. The reason might be that the cell’s response latency principally depends on the length of pathways transmitting the first spikes and is not related to the above discrimination measure. Instead, our model predicts that for each cell, responses of the Best category should come earlier than those of the other categories, no matter how long the cell’s response latency is. This was verified by the data shown in Figure 6. 6.2 Robustness of Populational Behavior Against Noise. Neural mechanisms must work well in the noisy environment of the brain. Our model assumes that cortical circuits encode information in firing ratios of neuronal populations. The simulation has shown that in all the trials, the average concurrent response of neuronal populations began at the shortest latency to stimuli of the Best response category even though spontaneous activity (9 Hz) provided random spikes (see Figure 9A). This enables the mechanism of latency competition to work under the random spontaneous activity. In every trial, discrimination between stimuli of the three response categories reached a statistically significant level within 5 ms of response onset (see Figure 9B). 6.3 Robustness Against Latency Variability of Input Signals. The mechanism of latency competition may be suspected not to work correctly if input activations start after different latencies. The firing latency of retinal ganglion cells varied up to 30 ms with changes in contrast. Robustness of the latency competition against this variability was examined for two typical cases. The first is that every neuronal population of area V1 receives projections from cells with different firing latencies. Figures 11A and B show a result of simulation, where the circuit of Figure 1B receives inputs, each of which is produced by 100 cells with different firing latencies. The latencies of the first spikes distribute in a gaussian fashion: mean ± SD = 15 ± 5 ms. Firing ratios of afferents increase not stepwise but gradually (see Figure 11A). Even in response to gradually rising inputs, the circuit allowed only the most strongly activated populations to fire (see Figure 11B). That is, the latency competition correctly detected the strongest activations. The second case is that each population of area V1 receives projections from cells of the same latency and the latency varies from population to population. Neuronal populations of area V1 receive stepwise inputs of different latencies (see Figure 11C). If strong inputs start early, they certainly win the latency competition. The simulation investigated the opposite case: Weak inputs start early. In the simulation, every interval between input on-
Neural Processing in the Subsecond Time Range
587
Figure 11: Robustness of latency competition against input latency variability. The abscissa and the ordinate, respectively, indicate time and the ratios of firing neurons in the populations. (A) Input activations, each of which is produced by 100 cells with different firing latencies. Mean firing ratio tends to 0.05 for the bottom row, and limit values increase to 0.25 for the top row. (B) Responses of the circuit of cortical area shown in Figure 1B to inputs of A. (C) Input activations, each of which is produced by 100 cells with the same firing latency. The latency is the shortest for the bottom row, increases by 7 ms, and is the longest for the top row. Firing ratio is 0.05 for the bottom row and increases to 0.25 for the top row. (D) Responses of the circuit of cortical area shown in Figure 1B to inputs of C. Duration of inhibition is 4 ms. (E) Responses of the circuit of cortical area shown in Figure 1B to inputs of C. Duration of inhibition is 25 ms.
sets was 7 ms, and the weakest input began 28 ms earlier than the strongest input. Figure 11D shows the responses. The weakest input was too weak to fire any cell. The second and third weakest inputs activated some cells and laterally inhibited all the populations. At release from the inhibition,
588
Kiyohiko Nakamura
the fourth weakest (second strongest) input had begun, and the four inputs competed. The two strongest excited their target populations and again inhibited all the populations. When the second inhibition ended, the strongest input had been on, and all the inputs participated in the competition. The result was that the two strongest inputs produced oscillatory responses. As above, the latency competition resulted in the same response that was produced by simultaneous inputs after all the inputs had been on. In the early stage, weaker inputs fired some cells. The irrelevant response would be reduced, however, if the inhibition were of long duration. If the oscillation of 40 Hz at area V1 was produced by lateral inhibition, its estimated duration would be 25 ms. Figure 11E shows a simulation result where inhibition lasts 25 ms. Although small, irrelevant responses were made in the first competition, the circuit correctly detected the strongest activations after the second competition. We see that except in the first competition, the latency competition makes relevant responses even to inputs with an onset variability of nearly 30 ms. In the discussion, input activations were not oscillatory. This is plausible for inputs to area V1, but activation of the subsequent areas may be oscillatory. We will see here that neuronal populations receive nonoscillatory inputs even if neuronal populations of the preceding area deliver oscillatory activations. Every neuronal population receives projections from many populations in the preceding area (see Figure 12A). Therefore, the activation the population receives is a superposition of activations from the multiple populations. Suppose each of the activations is oscillatory. Intervals of the oscillation are determined by the interval of inhibition. If we assume that the interval differs in the inhibitory neuronal populations, the superposition of oscillatory activations becomes less oscillatory. Figure 12B shows a superposition of three oscillatory activations. It is less oscillatory than the original three. A superposition of more activations is expected to be close to a step input. The discussion also suggests a possible way of integrating signals conveyed by different pathways. The signals could arrive after different latencies at areas where they are integrated. For example, auditory signals arrive at the STS cells about 100 ms earlier than visual signals (Bruce, Desimone, & Gross, 1981). Despite the latency difference, animals can integrate those signals to respond in a few hundred milliseconds. Suppose that a cortical area receives input along two cortical pathways (see Figure 12C) and that the first pathway delivers activation earlier than the second pathway. Also assume that the first pathway activates the two neuronal populations more strongly than the other three populations, but activation of both the pathways provides more synaptic input to the latter three populations than to the former two. Figure 12D shows the behavior of the model cortical area. The responses of the former and the latter populations are, respectively, shown in the top two rows and the bottom three rows. Before activation of the second pathway, the two neuronal populations strongly activated by
Neural Processing in the Subsecond Time Range
589
Figure 12: Integration of signals from different cortical pathways. (A) Corticocortical connections in a single pathway. Three activated neuronal populations project to stellate cells of a population in the next area. (B) Oscillatory activations of the three corticocortical projections in A (top three rows) and the superposition of them (bottom row). The refractory periods of inhibitory neurons in the model cortical areas differ from neuron to neuron; the distribution of lengths of the refractory periods is gaussian (mean ± SD = 10.0 ± 3.33 ms). (C) Two cortical pathways converging at a cortical area. The first pathway activates the two neuronal populations more strongly than the other three populations, but activation of both the pathways provides more synaptic input to the latter three populations than to the former two. (D) Responses of neuronal populations of a cortical area that receives activations of two pathways at different onset times. Open and solid triangles, respectively, indicate onset times of activations of the first and the second pathways. The responses of the populations strongly activated by the first pathway are shown in the top two rows, and the responses of the populations strongly activated under activation of both the pathways are shown in the bottom three rows.
590
Kiyohiko Nakamura
the first pathway won the competition and made oscillatory responses. After the second pathway also began to fire, the other three populations won the competition and produced oscillatory firing. Again, the mechanism of latency competition correctly detected the strong activation from the integrated signals, except in the first few competitions. If the motor system needs to accumulate some neural activation in order to move, irrelevant responses of the first competitions could be neglected, and relevant motor response would be made. From the above consideration, we see that the latency competition is not only performed by the first spikes but is repeated in oscillation by the subsequent spikes. This makes the competition mechanism robust against some latency variability of neural inputs. 6.4 Feedback Connections. There are many feedback connections in the cortical circuitry, though the model presented here does not include them. This study focused on processing of the millisecond time range, which biological time constraints suggest should be performed without using feedback connections (Oram & Perrett, 1994). They might, however, accelerate processing. Auditory signals activate cells of the STS earlier than visual signals, and feedback connections could transmit the activation to earlier stages of visual pathways such as the AIT and MST. This might increase the excitability of those areas and make them respond faster to visual stimuli. It is expected that feedback connections play different roles in different aspects of cortical processing. How the mechanism of latency competition might cooperate with other cortical mechanisms is an issue for future research. Appendix A P Suppose x(t) = x0 for t P < 0 and x(t) = x for t ≥ 0. Let A(x(t)) = k (Wk x(t)+ 1)Gok /C and V(x(t)) = k (Wk x(t) + 1)Gok Ek /(CA(x(t))). Assuming V is constant for t < 0, we obtain from equation 2.1 ½ if t < 0 V(x0 ) . (A.1) V= V(x) + (V(x0 ) − V(x)) exp(−A(x)t) if t ≥ 0 Let T0 be a time interval required for the neuron to fire the first time. Since the synaptic input is excitatory, V(x) > h. Setting V = h in equation A.1 yields T0 = A(x)−1 log
V(x) − V(x0 ) . V(x) − h
(A.2)
Let T be the time intervals required to fire after the first firing. It is given by T = A(x)−1 log
V(x) − Vo . V(x) − h
(A.3)
Neural Processing in the Subsecond Time Range
591
It follows that the uth firing begins at tu = T0 + (u − 1)(τh + τr + T),
u = 1, 2, . . .
(A.4)
Suppose that the neuron is producing the uth spike at time t; that is, tu ≤ t and tu > max[0, t − τh ]. Equations A.2, PA.3, and A.4 indicate that tu is a function of (Wk ). Condition 1 reduces k Wk x(t) to Wka x(t) in V(x(t)) and A(x(t)). It follows that tu depends only on Wka among Wk . By counting neurons that are producing the uth spikes at time t, we can calculate the ratio of firing neurons in the population, which is denoted by y(t): y(t) =
Z uX e −1 u=1
Z
+
Wka ∈{Wka |tu ≤t,tu ≥t−τh }
f (Wka )dWka
Wka ∈{Wka |tue ≤t,tue ≥max[0,t−τh ]}
f (Wka )dWka ,
(A.5)
where f (Wka ) denotes the probability density of neurons receiving connections with coefficient Wka in the populations, and ue is the maximum integer not more than t/(τh + τr + T) + 1. Let y1 (t) and f1 (Wka ) denote values of y(t) and f (Wka ) when Wka is larger by 1Wka . Condition 2 implies that f1 (Wka ) = f (Wka − 1Wka ). Assuming that 1Wka is small, we have dy(t) dWka dy(t) dt dtu = y(t) − 1Wka . dt dtu dWka
y1 (t) ' y(t) − 1Wka
(A.6)
Let us define the response latency of neuronal populations, which is denoted by t∗1 , as the time when the value of y(t) reaches a certain value y∗ for the first time. Since y(t) is not decreasing at t∗1 , dy(t)/dt ≥ 0 at t∗1 . dt/dtu = 1. From equation A.4, dtu /dWka = dT0 /dWka + (u − 1)dT/dWka . −1 dT0 = 0 dWka Gka A2 (x)
"
G0ka (h − V(x0 ))(Eka − V(x)) C(V(x) − V(x0 ))(V(x) − h)
# 0 A(x)(Eka − V(x0 )) Gka V(x) − V(x0 ) + + log . (A.7) CA(x0 ) V(x) − V(x0 ) C V(x) − h G0ka
Since the afferents are excitatory, Eka ≥ V(x) ≥ h > V(x0 ). It follows from equation A.7 that dT0 /dWka < 0. Similarly, dT/dWka < 0 because V(x0 ) ≥ Vo . From these, we have dtu /dWka < 0. It follows from equation A.6 that y1 (t∗1 ) ≥ y(t∗1 ). Consequently, the value of y1 (t) reaches y∗ not later than t∗1 . Let t∗2 denote the time when the value of y1 (t) reaches y∗ for the first time.
592
Kiyohiko Nakamura
t∗2 ≤ t∗1 . Let y2 (t) and f2 (Wka ) denote values of y(t) and f (Wka ) when Wka is larger by 21Wka . The same consideration results in y2 (t) reaching y∗ not later than t∗2 and so not later than t∗1 . Similarly, whenever Wka is larger by any value, the value of y(t) with the larger Wka reaches y∗ not later than t∗1 . Appendix B Theorem 1 ensures that neuronal populations receiving the strongest activation produce spikes first. The first spikes excite the inhibitory interneurons if Wka of the connections with the inhibitory neurons is larger than the following value. Consider the weakest activation, that is, that only one spike is delivered. Then input x = 1/n, where n denotes the number of neurons of the projecting population. If T given by equation A.3 is shorter than spike duration τs , the neuron is certain to fire. Let T(1/n) = A(1/n)−1 log[V(1/n) − Vo ]/[V(1/n) − h]. T(1/n) decreases as Wka increases (see appendix A). It follows that there is some value of Wka that makes T(1/n) = τs . If Wka is larger than that value, the inhibitory neuron is certain to fire. If the input stays larger than x for a period of τe , the condition T(1/n) = τs is replaced by T(x) = τe . Assume the inhibitory spikes act only on the ki th ion conductance of pyramidal cells. This appendix will show that firing of the inhibitory neurons reduces the membrane potential of pyramidal cells close to the inhibitory electromotive force Eki (
(B.1) P
denote k (Wk xk + 1)Gok where V(xˆ ka , 1/ni ) and P A(xˆ ka , 1/ni ), respectively, o Ek /(CA(xka , xki )) and k (Wk xk + 1)Gk /C, where xka = xˆ ka , xki = 1/ni , and xk = 0 for the other k, and where V1∗ denotes the value of V at the onset of the first spike: Gok (Eki − V(xˆ ka , 1/ni )) dV1 (1 − exp(−A(xˆ ka , 1/ni )τs )) = i dWki ni CA(xˆ ka , 1/ni ) + (V(xˆ ka , 1/ni ) − V1 )(Goki /ni C)A(xˆ ka , 1/ni ) × τs exp(−A(xˆ ka , 1/ni )τs ).
(B.2)
Neural Processing in the Subsecond Time Range
593
The ki th input being inhibitory means that V(xˆ ka , 1/ni ) decreases to Eki as Wki increases and therefore Eki < V(xˆ ka , 1/ni ). If V1∗ > Eki , there is some value of Wki such that V(xˆ ka , 1/ni ) < V1∗ . It follows from these that there is some value of Wki such that dV1 /dWki < 0. Consequently, if Wki is above that value, V1 decreases as Wki increases. Let V2∗ denote the value of V at the onset of the second spike. From equation A.1, we have V2∗ ≤ V(xˆ ka , 0) + (V1 − V(xˆ ka , 0)) exp(−A(xˆ ka , 0)(τi − τs )).
(B.3)
Only V1 depends on Wki in equation B.3 and decreases as Wki increases. There is some value of Wki such that V2∗ < V1∗ . The value ensures that the neuron never fires before the onset of the second spike. Similarly, let Vj and Vj∗ , respectively, denote the values of V at the ends and the onsets of the jth spikes, j = 1, 2, . . . , nt . There are some values of Wki such that Vj decreases as ∗ . And from equation B.1, we know that V tends Wki increases and Vj > Vj+1 j to Eki as Wki increases. It follows that there is some value of Wki such that the neuron does not fire during the spike train and Vnt is close to Eki (< Vo ). From the assumption that the length of the spike trains, nt × τi + τs , is longer than the refractory period of pyramidal cells, τr , it follows that at release from the inhibition, pyramidal cells that produced spikes have recovered from the refractory state. Since those cells receive inhibition after the recovery, their membrane potentials are not more than the resting potential Vo . And they are larger than those of the unfired cells because the fired cells are assumed to neglect any input (including inhibitory input) during the refractory period and so do not receive inhibition any longer than the unfired cells. After release from the inhibition, the second competition takes place, in which the previously fired cells fire more quickly than the previously unfired cells. Again, only the most strongly activated populations are fired. Appendix C The model neuron is assumed to have Na+ , K+ , and Cl− channels. Values of Gok are 0.888888 (MÄcm2 )−1 for Na+ , 44.4444 (MÄcm2 )−1 for K+ , and 154.666 (MÄcm2 )−1 for Cl− . Values of Ek are 60.0 mV for Na+ , −90.0 mV for K+ , and −65.0 mV for Cl− . The initial threshold potential for firing, ho , is −40.0 mV. 1h = 0.1 mV/spike. Ch = 6.0 × 10−4 . Vo = −70.0 mV. C = 1.0µF. τs = 1.0 ms. At every firing, excitatory neurons produce a single spike, and inhibitory neurons produce a train of spikes (three spikes in this simulation). The interspike interval in the train is 1.5 ms. τr = 4.0 ms for excitatory neurons, and the τr of inhibitory neurons varies with a normal distribution, from neuron to neuron. The mean is 10.0 ms, and the standard deviation (SD) is one-third of the mean. The probability that the variable is below the mean −3 × SD is 0.2% and each neuronal population has 100 neurons. It follows that the neuronal populations rarely include neurons with negative τr .
594
Kiyohiko Nakamura
Synaptic efficacy should differ from neuron to neuron, so values of Wk were assumed to be normally distributed. The mean values were chosen so that neurons may fire at physiologically plausible rates. The means were 2500 for projections of stellate cells to pyramidal cells and for projections of pyramidal cells to inhibitory cells. The projections with synaptic efficacy of those mean values fire about 3% of the neurons in their target populations when the projections are activated at the ratio of x(t) = 0.03. Lateral connections of inhibitory neurons diverge exponentially. Means of Wk decline at a rate of exp(Ci (l2x + l2y )) where Ci is a coefficient, lx is the distance between the projecting population and target population on the x-axis, and ly is that distance on the y-axis. The mean of Wk of the lateral connections is 2 at lx = ly = 0 and Ci = 0.5. Corticocortical connections between areas were initially so weak that no cortical cells might respond to any activation of area V1. They also diverge exponentially, at a rate given by exp(Cc (l2x + l2y )), where Cc = 1.0 and the mean of Wk of the corticocortical connections is 250 at lx = ly = 0. Projections to area V1 are strong enough to fire stellate cells of area V1: the means of Wk are 3000, and all SDs are one-third of the means. ˆ k are two times as large as the means. The time needed for spikes to travel W along the connections is 0.2 ms inside the cortical areas, 6.0 ms between the sequential cortical areas, and 9.0 ms along the shortcut connections. The coefficient of reinforcement Cr = 10.0 and the coefficient of decay Cd = 2.43 × 10−5 . The upper limit of synaptic efficacy Wˆ k is twice the initial mean value. Acknowledgments I thank M. W. Oram and D. I. Perrett for sending copies of their empirical data. This research was supported in part by Grants in Aid for Scientific Research, Ministry of Education, Science and Culture of Japan, and Research for the Future Program, Japan Society for the Promotion of Science. References Abeles, M., Vaadia, E., Bergman, H., Prut, Y., Haalman, I., & Slovin, H. (1993). Dynamics of neuronal interactions in the frontal cortex of behaving monkeys. Concepts in Neurosci., 4, 131–158. Bruce, C., Desimone, R., & Gross, C. G. (1981). Visual properties of neurons in a polysensory area in superior temporal sulcus of the Macaque. J. Neurophysiol., 46, 369–384. Coultrip, R., Granger, R., & Lynch, G. (1992). A cortical model of winner-take-all competition via lateral inhibition. Neural Networks, 5, 47–54. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern., 36, 193–202. Gordon, B., Allen, E. E., & Trombley, P. Q. (1988). The role of norepinephrine in
Neural Processing in the Subsecond Time Range
595
plasticity of visual cortex. Prog. Neurobiol., 30, 171–191. Grossberg, S. (1987). Competitive learning: From interactive activation to adaptive resonance. Cognition, 11, 23–63. Kalaska, J. F., & Crammond, D. J. (1992). Cerebral cortical mechanisms of reaching movements. Science, 255, 1517–1523. Kang, Y., Kaneko, T., Ohnishi, H., Endo, K., & Araki, T. (1994). Spatiotemporally differential inhibition of pyramidal cells in the cat motor cortex. J. Neurophysiol., 71, 280–293. Karil, R. E., Dubin, M. W., Scott, G., & Stark, L. A. (1986). Elimination of action potentials blocks the structural development of retino-geniculate synapses. Nature, 323, 156–158. Kawaguchi, Y. (1995). Physiological subgroups of nonpyramidal cells with specific morphological characteristics in layer II/III of rat frontal cortex. J. Neurosci., 15, 2638–2655. Kobatake, E., Tanaka, K., Wang, G., & Tamori, Y. (1993). Effects of adult learning on the stimulus selectivity of cells in the inferotemporal cortex. Soc. Neurosci. Abstr., 19, 975. Miyashita, Y. (1990). Associative representation of the visual long-term memory in the neurons of the primate temporal cortex. In E. Iwai & M. Mishkin (Eds.), Vision, memory and the temporal lobe (pp. 75–87). New York: Elsevier. Miyashita, Y., & Chang, H.-S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. Nakamura, K. (1992). Neuropopulational mechanism of parallel cortical processing in the millisecond range. Soc. Neurosci. Abstr., 18, 1210. Nakamura, K. (1993a). Temporal competition as an optimal parallel processing of the cerebrohypothalamic system. Proc. IEEE Int. Conf. on Neural Networks, 1, 64–69. Nakamura, K. (1993b). A theory of cerebral learning regulated by the reward system I: Hypotheses and mathematical description. Biol. Cybern., 68, 491– 498. Nakamura, K., & Ichikawa, A. (1989). Timing information in transient behavior of neuropopulations. IEEE Trans. on Syst., Man, and Cybern., 19, 1188–1193. Opara, R., & Worg ¨ otter, ¨ F. (1996). Using visual latencies to improve image segmentation. Neural Computation, 8, 1493–1520. Oram, M. P., & Perrett, D. I. (1992). Time course of neural responses discriminating different views of the face and head. J. Neurophysiol., 68, 70–84. Oram, M. P., & Perrett, D. I. (1994). Modeling visual recognition from neurobiological constraints. Neural Networks, 7, 945–972. Rumelhart, D. E., & Zipser, D. (1986). Feature discovery by competitive learning. In J. L. McClelland, D. E. Rumelhart, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition fundations (Vol. 1, pp. 151–193). Cambridge, MA: MIT Press. Sakai, K., & Miyashita, Y. (1991). Neural organization for the long-term memory of paired associates. Nature, 354, 152–155. Received October 15, 1996; accepted July 15, 1997.
LETTER
Communicated by Sebastian Seung and Miguel Nicolelis
Temporal-Code to Rate-Code Conversion by Neuronal Phase-Locked Loops Ehud Ahissar Department of Neurobiology, Weizmann Institute of Science, Rehovot 76100, Israel
Peripheral sensory activity follows the temporal structure of input signals. Central sensory processing uses also rate coding, and motor outputs appear to be primarily encoded by rate. I propose here a simple, efficient structure, converting temporal coding to rate coding by neuronal phaselocked loops (PLL). The simplest form of a PLL includes a phase detector (that is, a neuronal-plausible version of an ideal coincidence detector) and a controllable local oscillator that are connected in a negative feedback loop. The phase detector compares the firing times of the local oscillator and the input and provides an output whose firing rate is monotonically related to the time difference. The output rate is fed back to the local oscillator and forces it to phase-lock to the input. Every temporal interval at the input is associated with a specific pair of output rate and time difference values; the higher the output rate, the further the local oscillator is driven from its intrinsic frequency. Sequences of input intervals, which by definition encode input information, are thus represented by sequences of firing rates at the PLL’s output. The most plausible implementation of PLL circuits is by thalamocortical loops in which populations of thalamic “relay” neurons function as phase detectors that compare the timings of cortical oscillators and sensory signals. The output in this case is encoded by the thalamic population rate. This article presents and analyzes the algorithmic and the implementation levels of the proposed PLL model and describes the implementation of the PLL model to the primate tactile system. 1 Introduction The distinction between rate and temporal coding is not always clear (Theunissen & Miller, 1995). For example, temporal coding is sometimes regarded as rate coding with a fine time resolution. In this article, temporal coding will refer to coding in which the exact time of every spike is informative. Rate coding will be associated here with a temporal window, the rate bin, within which the exact temporal information is not informative and the information is carried by the average firing rate over the entire temporal window. The rate bin is usually determined by the integration times of the readout mechanisms. A rate encoded signal can thus be described by a series of numbers, each of which represents the average firing rate in a Neural Computation 10, 597–650 (1998)
c 1998 Massachusetts Institute of Technology °
598
Ehud Ahissar
single rate bin (see appendix A.1). Fluctuations in the average firing rate of a neuron over different rate bins are considered here as fluctuations of rateencoded information, and not as temporal coding, as has been considered previously (Richmond & Optican, 1987). A temporally encoded signal is described by a series of numbers, each of which represents either the timing of a single spike or a single interspike interval (ISI; see appendix A.1). The information contained in the spiking times can be presented in different ways; two of them are depicted in Figure 1: M(n) describes the deviations of the actual train from an imaginary, ideally periodic, “carrier” train and I(n) describes the ISIs. Figure 1 also demonstrates the distinction between temporal and rate coding; the spike train in this example carries a significant amount of information if a temporal coding is assumed (see Figure 1a), but almost no information if a rate coding with a particular rate bin is assumed (see Figure 1b). Practically, this distinction is important for reading out the information of the spike train. A readout mechanism based on rate will lose more and more information as its integration time increases. To read out temporally encoded information, a rate-based mechanism needs to employ integration times shorter than half of the input temporal resolution, an implementation that is both nonefficient and, with fine input resolution, not practical for neurons. The other alternative is to utilize preprocessing by time-sensitive mechanisms—mechanisms that produce populations of spikes, where the number of spikes in a population directly represents the ISI at the input and the exact times of these output spikes is not important. Another important distinction is between peripheral and central temporal encodings (Perkel & Bullock, 1968). When a stimulus is temporally encoded at the periphery, the peripheral ISIs directly describe stimulus features such as spatial periods (Darian-Smith & Oke, 1980) whereas when stimuli are temporally encoded centrally, the centrally generated temporal structures are usually not directly related to the stimulus (Engel, Konig, Kreiter, Schillen, & Singer, 1992; Abeles, Bergman, Margalit, & Vaadia, 1993). This article deals with central decoding of peripherally generated temporal encodings. Using the term decoding in neuronal contexts should not imply that the original signals are reconstructed, but rather that the encoded information is extracted (Perkel & Bullock, 1968). In mammals, sensory information is encoded by both rate and temporal coding (Sejnowski, 1995; Carr, 1993; Middlebrooks & Green, 1991; Wang, Merzenich, Beitel, & Schreiner, 1995; Johansson & Vallbo, 1983; Berkley, 1978). Whereas spatial static information is usually encoded by rate, dynamic information, generated during movements of either the stimulus or the sensory organ, is encoded also by temporal cues (see, for example, encoding of spatial intervals by ISIs of tactile [Darian-Smith & Oke, 1980] and visual [Shadlen & Newsome, 1994] neurons). In contrast, motor control is assumed to utilize rate coding predominantly (Georgopoulos, 1986; Fetz, 1993; Wise, 1993), even at the early stage of motor planning (Boussaoud & Wise, 1993). Thus, information carried by the sensory temporal
Code Conversion by Phase-Locked Loops
599
Figure 1: Encoding by spike trains. (a) Temporal encoding. Two possible presentations of the information contained in the spiking times: a series of ISIs [I(n)] and a series of absolute modulations [M(n)]. M(n) is the difference between the actual timing of the nth spike and the timing expected by an imaginary, ideally periodic, “carrier” train that has the same average periodicity and no modulation (see appendix A1). (b) Rate encoding. The spike train is divided into several rate bins (four in this case), and the total number of spikes in each bin determines the value of the rate function r(t) for that bin. Insets: S(t0 ), a spike triggered at t0 = 0; R(t0 ), a step function with a unity gain and a duration of a rate bin, Tr .
components is probably translated, by neuronal circuits in the brain, to rate-encoded signals that are “readable” by the motor system. If such a translation occurs early in a sensory pathway, the translation would also facilitate integration of temporally encoded information with other, rateencoded sensory information. This necessity for translation was elegantly demonstrated by Mountcastle and his colleagues (Talbot, Darian-Smith, Ko-
600
Ehud Ahissar
rnhuber, & Mountcastle, 1968; Mountcastle, 1993) over the past few decades. A mechanism that utilizes neuronal delay lines to transform temporal coding to rate coding has been suggested by Jeffress (1948). Such delay lines exist in the electric sensory system of electric fishes and in the subcortical auditory systems of birds and mammals (reviewed in Carr, 1993). These delay lines are probably utilized to decode temporal disparities, which in the submillisecond and low millisecond ranges can determine interaural time differences and echo delays, respectively. As the delay increases above a few hundred microseconds, implementations of delay lines require multiple neuronal elements, and the accuracy decreases (Carr, 1993). A mechanism that uses synaptic time constants appears more suitable to decode temporally encoded information in the millisecond range (Buanomuno & Merzenich, 1995). Both of these mechanisms describe “passive,” open-loop decoding schemes that are based on classification of different ISIs according to their interaction with predetermined neuronal temporal features. In this article, I suggest an “active,” closed-loop decoding mechanism, which dynamically adapts its working parameters to match the incoming signal. This phase-locked loop (PLL) model is based on a local oscillator “measuring” the instantaneous temporal period of the input by comparing it to its own period. During decoding, the local oscillator updates its period according to the result of the comparison, such that it remains tuned to the changing input. The PLL is a well-known mechanism in electrical engineering where it is often used for the decoding of phase-modulated signals. The algorithm presented here was adopted from that of continuous-time electronic PLLs (Gardner, 1979) and modified to fit discrete-time neuronal PLLs. This approach uses a small neuronal network as a PLL, unlike a previous approach that described a single-neuron as a PLL (Hoppensteadt, 1986). Neuronal PLLs appear suitable to decode temporally encoded information in the range of a few to a few hundred milliseconds. However, decoding by a single neuronal PLL is usually limited to phase modulations that are in the order of its intrinsic period. Thus, decoding in different frequency ranges requires different PLL circuits, and decoding of large modulations requires an ensemble of several PLLs. In this article, the mechanism of a single PLL is described in detail, whereas only the principles of operation are described for the postulated ensemble. Depending on the parameters of a stimulus, sensory firing could engage different temporal forms. For example, the peripheral tactile response to a moving grating can be one spike per bar or a burst of a variable length per bar (Darian-Smith & Oke, 1980; Morley and Goodwin, 1987). For clarity, simple temporal forms will be assumed here. Sensory firing with bursts does not affect the principles of decoding described here (S. Serulnik and E. Ahissar, unpublished observations), although it affects the decoding details.
Code Conversion by Phase-Locked Loops
601
2 The PLL Model 2.1 The PLL Algorithm. The simplest version of a first-order PLL (Gardner, 1979) is adopted here. A first-order neuronal PLL is composed of two elements (see Figure 2a): a phase detector (PD) and a rate-controlled oscillator (RCO). The RCO is a local oscillator whose output frequency, and thus the timing of its output spikes, is controlled by the firing rate of its input. If the input is zero, the RCO will fire at its intrinsic frequency. The more excitatory the input, the higher the RCO’s output frequency, and the more inhibitory the input, the lower is the RCO’s frequency. The PD compares the phase—the time of arrival—of each of the spikes of a repetitive input against the phase of the RCO spikes and produces an output that is a “measure” of (i.e., its firing rate is proportional to) that phase difference. The RCO can be regarded as a rate-to-temporal code converter and the PD as a temporal-to-rate converter. The PD’s output (Rd ) is fed into the RCO’s input and changes the RCO’s firing phase in the direction that will cancel the phase difference (in fact, cancel any deviation from some constant phase difference), establishing a negative feedback loop (see section A.2). Note that in the following description phase difference and temporal difference are interchangeable terms, both expressed in time units. The PLL is considered locked when the RCO’s instantaneous frequency equals the input’s instantaneous frequency. The phase difference, in the locked state, depends on the difference between the input frequency and the RCO’s intrinsic frequency (see appendices A.3 and A.4). While locked, the RCO generates one and only one oscillation cycle for each input cycle. For simplicity, assume that a single spike represents a single cycle, even though a short burst or an ensemble of single spikes over a cell population is also possible. In the absence of noise and with ideal PLLs (see appendix A.3), the RCO’s output spike train is a perfect replica of the input spike train, but with a delay of one cycle plus a constant phase shift. This is because, with ideal PLLs, any deviation of the input from the expected ISI is followed by an identical deviation of the RCO’s ISI at the next cycle. The decoding (or recoding) of the input information is based on the delayed internal replica of the input spike train. As long as the PLL is locked, the RCO’s ISI has to be modulated by the same information that modulates the input ISI. Thus, the input information is represented by the rate-encoded signal that drives the RCO (see appendix A.4). This signal is the PLL’s output. The same decoding mechanism can be described differently, at least for ideal PLLs: every input ISI is “stored” as the next RCO’s ISI. Thus, at each cycle, an input ISI is compared with the input’s previous ISI, and the change, which is the encoded information, is detected by the PD and presented as a rate-encoded signal (see appendix A.4). While the PD’s (=PLL’s) output is affecting the RCO at every cycle, it can be integrated over several cycles by a potential readout mechanism. The readout integration time, or rate bin, determines the maximal rate of information that can be represented inter-
602
Ehud Ahissar
Figure 2: The first-order neuronal PLL algorithm. (a) Schematic diagram. The output of the phase detector (PD) is proportional to the difference between the timing of its two inputs. The output frequency of the rate-controlled oscillator (RCO) is modulated by the firing rate of its input (see loop equations at bottom and appendix A.2). (b) Schematic examples of transfer functions. For every Ii within the working range, Rd (n + 1) increases as a function of the timing difference [ηo (n) − ηi (n)] and the timing difference decreases, via decrement of Io (n), and thus of ηo (n), as a function of Rd (n). The crossing (working) point is approximately (exactly for linear systems) at (1∅, Rdc ).
nally; higher rates require shorter integration periods. Note that the phase of the input is not lost but rather preserved by the firing phase of the PLL’s output, which is phase locked to the input (see, for example, Figure 6b). Thus, the output of the PLL is a rate-encoded signal proportional to the difference between the RCO’s intrinsic period and the input instantaneous ISI. This signal can be decomposed to two components: a DC component
Code Conversion by Phase-Locked Loops
603
(Rdc ), which represents the difference between the RCO’s intrinsic period and the average input ISI, and an AC component (Rac ), which represents the dynamic input information (see appendices A.2 and A.4). An ideal PLL should be able to track any change in the input ISI within one cycle. Practical PLLs, however, are limited in both their working ranges—the ranges of trackable input frequencies (see appendix A.2.3)—and lock-in times—the time required for moving into a new locked state. The lock-in dynamics, which is mainly determined by the loop gain (see appendix A.2.3), limits the maximal rate of change in the input frequency that a given PLL can track and decode. PLLs of higher than first order have low-pass filters between the PD and RCO. Such filters improve the loop performance, especially in noisy conditions (Gardner, 1979; Viterbi, 1966). Low-pass filtering, also referred to as input integration, is an elementary feature of nerve cells. It is assumed that the RCO uses such filtering in its input stage. However, for simplicity, higher-order circuits will not be discussed here, since the first-order version is sufficient for code translation. 2.2 Implementations of PLLs. There are two main families of PLL implementations: excitatory PLLs (ePLLs) are those implementations in which the PD excites the RCO (see Fig. 3a, dashed lines), and inhibitory PLLs (iPLLs) are those implementations in which the PD inhibits the RCO (see Figure 3a, dotted lines). Here, only two specific implementations of these families—the AND-like and NAND-like implementations—are described in detail. Thus, ePLL will refer to an AND-like and iPLL to a NAND-like implementation, unless otherwise noted. In the following descriptions, only essential components are included, a case that probably does not occur in the brain. The described implementations should thus be regarded as building blocks that can be used separately or in combination in the brain. Accordingly, the principles of, not the exact, operation of such hypothetical PLL circuits are discussed. The basic ePLL is a straightforward implementation of the PLL algorithm (see Figure 2) and involves two sets of neurons: the PD and RCO sets (see Figure 3a). The basic iPLL, in addition to these two sets of neurons, involves a set of inhibitory cells (INH). In both ePLL and iPLL, every component is implemented by a set of neurons similar to each other. These sets of neurons are interconnected by “diverging/converging” pathways; every cell in the projecting set sends axons to many cells in the target set, and every cell in the target set receives synapses from many cells in the projecting set. The set of RCOs of a given PLL is regarded as a set of coupled oscillators that oscillate at the same frequency. The redundancy of the RCO and INH cells has no specific role in the presented implementations beyond improving robustness. However, the efficiency of phase detection by a PD composed of a population of cells is significantly better than the efficiency of a single coincidence detector. The number of coincidence-detecting neurons that
604
Ehud Ahissar
Figure 3: Possible implementations of a single PLL. (a) Connection schemes. For clarity, the width (w) was arbitrarily set at 7. Dotted lines indicate inhibitory PLL (iPLL); dashed lines, excitatory PLL (ePLL); and INH, inhibitory neurons. (b) Schematics of two possible PLL implementations where the PD operates in an AND-like manner.
compose the PD set defines the “width” (w) of a single PLL. Arbitrarily, it is assumed that the other neuronal sets (INH and RCO) have the same width. A reasonable estimation for the minimal value of w can be derived from the number of peripheral fibers activated by a “point” stimulus. In the tactile case, for example, this number is around 20 (Johansson & Vallbo, 1980). 2.2.1 Implementations of PDs. In principle, each neuron can operate as a degenerated PD. When functioning as a “coincidence detector” (Goldberg & Brown, 1969; Abeles, 1982), a cell will fire only if a certain number of its inputs will be synchronously active—that is, a single neuron detects a zeroor near-zero-phase difference among its inputs. Thus, to serve as a PD, the neuron’s inputs should be predominantly organized into two groups, with
Code Conversion by Phase-Locked Loops
605
the inputs always being temporally coordinated within each group. The neuron will function as an AND-like zero-phase detector if its threshold is set such that neither of these two groups of inputs is able by itself to activate the cell, but there is a high probability that synchronous operation of both inputs will activate it. Coincidence detection, however, is not sufficient for a PD. A usable PD should have a range within which its output is a monotonic increasing or a monotonic decreasing function of the phase difference (see appendix A.2). 2.2.1.1 A single neuron AND-like PD. The PD capacity of a single neuron is due to the nonzero time constants of its inputs. If a neuron receives two major inputs, the range of delays that it will be able to resolve (i.e., its working range) will depend on the amplitude and time constants of the two inputs. For example, suppose the excitatory postsynaptic potentials (EPSPs) of both inputs, when measured at the axon hillock, exhibit short rising times, long decays, and relatively strong amplitudes (see Figure 4a). The longer the delay between the inputs—the phase difference—the shorter the period in which the membrane potential will be above threshold, and thus, the fewer the spikes that will be emitted. Thus, the output rate of a single neuronal PD is generally a monotonic decreasing function of the input phase difference. For frequencies near 30 Hz, for example, the working range of an appropriately tuned neuron covers about half a cycle (see appendix A.5), which is satisfactory for a PD (Gardner, 1979). However, the refractoriness of a single neuron results in a poor output resolution—usually poorer than 2 ms. For example, tracking a frequency of 30 Hz with an error less than 1 Hz requires the RCO to be informed about deviations as small as 1 ms in the input spike train. A single neuron with a refractory period of 2 ms or more cannot provide this resolution. In addition, single-cell reliability is limited, and noise will significantly influence the single neuron PD’s performance. 2.2.1.2 A population AND-like PD. To increase a PD’s resolution, a number of single cells, say w, can be arranged in parallel such that all receive the same input, but with different delays (see Figure 4b). Let Two denote the effective width (see appendix A.5) of the RCO’s output and Twi the effective width of the input. The most efficient phase detection occurs with Two = Twi . In this case, every phase difference between 0 and Two produces a different population sum (see appendix A.1) at the PD’s output. Since the population sum is directly related to the overlap period, this dependency is monotonic. If the input delays are generated by constant and reliable delay lines, the phase differences will be also coded by the PD’s population vector (see appendix A.1). Both “sum PD” and “vector PD” are valid PD implementations. Schematically, the two input signals to the PD can be described as square waves (see Figure 4c) whose duty cycles are determined by their effective
606
Ehud Ahissar
widths; an input will be considered as “on” at all time points in which, had the other input been considered “on,” summation of the generated EPSPs would be suprathreshold in at least one of the PD neurons. If the PD’s output is a linear function of the inputs’ overlap time, then the transfer function would take the form described in Figure 4d (see appendix A.5).
Code Conversion by Phase-Locked Loops
607
Since gd is required to be monotonous, it is clear that a PLL can function only in limited ranges of phase differences: either within one of the increasing monotonic ranges (see Figure 4d, ePLL) or within one of the decreasing monotonic range (see Figure 4d, iPLL). In the AND-like ePLL, the RCO’s output leads the input (ηo − ηi < 0), and in the AND-like iPLL, the RCO’s output lags the input (see Figure 4d). Each of these implementations requires a different circuit to achieve the negative feedback (see appendix A.5). In the ePLL, the PD excites directly the RCO, and in the iPLL the PD excites inhibitory interneurons (INH) which, in turn, inhibit the RCO (see Figure 3 and section A.5.2). Note that the PD transfer function is periodic. Thus, large, instantaneous input modulations can move the PLL from one working range to another, producing only instantaneous tracking errors—that is, losing or “filling in” one or more input cycles.
Figure 4: Facing page. Implementations of neuronal AND-like PDs. (a) A singlecell PD. The two inputs, which are massive, generate two giant EPSPs with exponential decays (A = 0.9, τ = 10 ms). More spikes are emitted when the overlapping time is larger (i.e., when there is a smaller time difference between the inputs). Vr , resting voltage; Vth , threshold voltage. (b) A population PD. A population of cells in which each cell receives a delayed version of the Input, each after a different delay. The RCO signal decays more slowly (τ = 33 ms) than the Input and arrives simultaneously at all cells. As the time difference between the arrival of the RCO signal and the Input increases, fewer cells will be activated. (c) A schematic description of a population AND-like PD. The population signal of the Input is a pulse function, which is the “envelope” of all the delayed versions of the input, ignoring fluctuations due to EPSP shapes. At any time when both the Input and the RCO signal are “on,” it is assumed that at least one of the PD cells will be activated (see section 2.2.1.2). The RCO signal is described as a pulse function, where the pulse duration is defined by the part of the RCO-driven EPSP in which adding an EPSP of the input (in any of the PD cells) will drive the membrane voltage above threshold. If the time difference between the two inputs decreases (dashed line of the RCO), the PD’s output becomes stronger (two additional, dashed, spikes). (d) An example of a linear AND-like PD transfer function (gd ). The output is stronger for time delays [ηo (n) − ηi (n)] that have smaller absolute values (larger overlap) and monotonically decreases in response to larger time differences. The exact form of the periodic transfer function depends on the input parameters (dashed lines; see section 2.2.1.2). The working range of the iPLL includes ISIs that are longer than the intrinsic period (Tc ), while the working range of the ePLL includes ISIs that are shorter than the intrinsic period.
608
Ehud Ahissar
2.2.1.3 Other PD implementations. The AND-like implementation adopted here is not the only possible one. PDs could implement an OR function, in which the PD fires when either of its two inputs is active, or an ANDNOT function, in which the PD fires only when the input is active and the RCO is silent (E. Ahissar and M. Zacksenhouse, unpublished observations). Variations of these three basic mechanisms are also possible. For example, each input can activate the PD by itself, whereas a synchronous activation augments the PD’s output (an AOR operation). All of these mechanisms can be implemented in either ePLL or iPLL configurations. Since the transfer function of the RCO is probably always a decreasing one, the potential working ranges for each implementation are those in which the PD function is increasing (see appendix A.2). 2.2.2 Implementations of RCOs. Almost any single neuron can be regarded as a voltage-controlled oscillator (VCO or VCON; Hoppensteadt, 1986). However, the PLL circuit presented here requires that the RCO exhibit an explicit periodic output activity. One possible model for a neuronal intrinsic oscillator is Calvin’s regenerative firing mode (Calvin, 1975). According to this model, each spike is followed by a strong afterhyperpolarization (AHP), which recovers at some rate until it reaches the threshold again, generates a new spike, and then restarts the process. The average ISI, To , is determined by the depth of the AHP and the average input to the neuron. Modulations of the input produce modulations of the RCO’s ISI (inhibition extends Io (t), whereas excitation shortens it (Calvin, 1975; Perkel, Schulman, Bullock, Moore, & Segundo, 1964; Hoppensteadt, 1986; see Figure 5). In another possible model, the RCO has intrinsically generated subthreshold oscillations that become suprathreshold with an appropriate DC input. The frequency of such oscillations is often controlled by the input (Llinas, Grace, & Yarom, 1991). Both subthreshold and suprathreshold intrinsic oscillations often present close-to-linear input-output (current to frequency) transfer functions (Calvin, 1975; Llinas et al., 1991; Silva, Amitai, & Connors, 1991). This implies that the input-rate to output-frequency transfer function of these oscillators is close to linear, since the amount of input current accumulated during a cycle is directly related to the rate of synaptic activation. Three different frequencies are associated with an RCO. The intrinsic frequency ( fc = 1/Tc ) is the RCO’s frequency when the input to the RCO is quiescent. The local frequency is the RCO’s frequency when the input to the PLL is quiescent, a situation that may include spontaneous activity within the loop. The working frequency ( fo = 1/To ) is the RCO’s average frequency during the decoding of a specific input. 2.3 Simulation. Validation of the basic idea of a neuronal PLL circuit and a demonstration of such a circuit’s operation is provided by a simulation of a simple circuit that includes only the essential elements of the iPLL (see
Code Conversion by Phase-Locked Loops
609
Figure 5: A neuronal RCO mechanism. (a) Output signal. The thick traces describe the membrane voltage of an RCO with no modulating input—when the RCO oscillates at its intrinsic frequency. Additional excitation or less inhibition will increase the depolarization slope, and thus increase the frequency of the RCO’s output (top, thin trace). Additional inhibition or less excitation will decrease the slope, and hence decrease the frequency (bottom, thin trace). (b) A schematic transfer function of the RCO plotted as in Figure 2b. As the input (Rd ) increases, the ISI (Io ) decreases.
Figure 6a). The simulation was performed on a DEC 3100 workstation using Genesis, a general-purpose neuronal simulator (Wilson & Bower, 1989). Neurons were represented by two compartments: one that represented an excitable soma that obeyed Hodgkin-Huxley kinetics and another that represented the dendrites. Three types of synapses were simulated: (1) fastexcitatory, non-NMDA-like synapses with a conductance time constant (τ ) of 1 ms; (2) slow-excitatory, NMDA-like synapses with τ = 20 ms, and (3) slowinhibitory, GABAB -like synapses with τ = 20 ms. Axons were simulated as delay lines that conducted action potentials. Intrinsic oscillations were simulated by increasing the maximal sodium conductance by approximately
610
Ehud Ahissar
Figure 6: Computer simulation of a neuronal PLL circuit. (a) Simulated circuit and spike trains. The circuit was composed of one input cell (I), 20 PD neurons (PD), 20 different delay lines from the input to the PD neurons, and one RCO neuron (RCO) receiving an inhibitory input from each of the 20 PD neurons. The timings of the input spikes and the membrane voltage of the RCO are presented at the bottom. (b) The PLL’s output, which is the population output of the PD. The spike trains of the 20 PD neurons are depicted. Each line represents, as a function of time, the membrane voltage of one PD neuron. (c) The RCO’s integrated input—the total inhibitory conductance caused by synaptic input to the RCO neuron. (d) The instantaneous ISIs of the input I and the RCO are described as a function of time. After a lock-in stage, the two curves essentially merge.
50% (Alonso & Llinas, 1989; Llinas et al., 1991; Silva et al., 1991). The intrinsic oscillating frequency of the simulation was set by tuning the membrane capacitance. The width of the loop was set to w = 20. The input was simulated by a single input cell (I) whose output was conveyed to the PD neurons via 20 axons, whose delays to the 20 PD neurons were uniformly distributed between 14 and 20 ms and which formed fast-excitatory synapses on PD neurons. All 20 PD neurons converged on a single RCO neuron via slowinhibitory synapses. For simplicity, the INH neurons were discarded and replaced by direct inhibitory connections from the PD to the RCO. A single RCO neuron represented the hypothesized 20 RCO neurons. This RCO neuron fed back, by slow-excitatory synapses, each of the PD neurons.
Code Conversion by Phase-Locked Loops
611
The ability of this simplistic simulated PLL circuit to decode periodic modulations of periodic input signals was tested by “injecting” excitatory intracellular currents into the input cell’s soma. Figures 6b–d depict the results of one simulation. The input signal was a 1 Hz modulation of a carrier frequency (35 Hz), with a modulation depth of 20%. The RCO frequency locks to the input frequency (see Figure 6d); the instantaneous ISI, of both the input and the RCO, are described by the two curves. After a lock-in stage, the two curves merge, which indicates the frequency locking. In the locked condition, the input modulation of 1 Hz is decoded by the PLL and approximated by a 1 Hz population signal (both population vector and population sum) at the PD’s output (see Figure 6b). At any given time, both the population vector (the actual firing neurons) and the population sum (total spikes across the population) represent the input ISI (within the PD resolution limits). The integrated inhibition (see Figure 6c) modulates the RCO’s frequency. This integrated signal is an integration of the PLL’s output, and it provides a measure of the population sum. This simple simulated circuit was able to decode modulations of up to 2 Hz with a 20% modulation depth. However, one cannot learn about the decoding limitations of the PLL from this simulation, since only a specific, limited circuit was simulated. For example, the resolution of PLL decoding depends on the number of elements, and the range of decodable modulation depths and frequencies, as well as lock-in dynamics, depends on the loop gain. This simulation mainly demonstrates how PLL neuronal signals should look in principle. To demonstrate the dependency of lock-in dynamics on the loop gain, I performed a MATLAB simulation of the iPLL, using equations A.4, A.9, and A.15 (see Figure 2) and a periodic PD function with the profile depicted in Figure 4d. The results are shown in Figure 7. All time variables are expressed in Tc (the RCO’s intrinsic period; see appendix A.2) units. For an input period (Ti ) of 1.2 Tc and an initial phase difference [ηo (0) − ηi (0)] of 0.3 Tc , lock-in time was one cycle for loop gain (G) = −1 (see Figure 7a). When G was too small in absolute value (lowermost trace) the RCO could not approach the input period. The reason was that with such gains, the phase difference that was required to follow Ti exceeded the PD’s working range (Tc /2; see Figure 4d). Thus, with this specific PD function, the working range of the iPLL was Tc < Ti ≤ Tc (1 + |G|/2). iPLLs with G < −2 (upper-most oscillating trace) were not stable (see appendix A.2, equation A.17). Between these two limits of G, iPLLs could lock in to the input, where lock-in times increased with increased deviation of G from −1. However, lock-in times also depended on the initial phase difference (see Figure 7b). Thus, even with ideal PLLs, having G = −1, lock-in to the onset of an input train might take more than one cycle, due to the phase difference. A single cycle lock-in is guaranteed only when the PLL is already locked to the input and a sudden change in the input periodicity is introduced, as demonstrated in Figure 7c. Here, after four cycles of
612
Ehud Ahissar
1.1 Tc the input ISI was changed to 1.4 Tc and from then on was modulated around 1.25 Tc with a modulation period of 10 Ti and a modulation depth of 0.4 (peak-to-peak). Six iPLLs with 0.5 ≤ G ≤ 1.75 were simulated. Lock-in to the input onset was not immediate, due to the nonoptimal initial phase difference (0.3 Tc ). However, after the PLLs were locked, those who could track the maximal input period (those with G ≤ −1, see working ranges above) tracked it more or less smoothly. The tracking errors are plotted in Figure 7d. It can be seen that the iPLL with G = −1 (x’s) tracked the input modulations with no errors, while the other iPLLs exhibit tracking errors, as expected (see section 2.1 and appendix A.3). The output rate produced by the iPLL with G = −1 is depicted in Figure 7e. Finally, the PLLs that could reach the maximal Ti (those with G ≤ −1 in this case) were able to follow the highest possible rate of input modulations: (2Ti ) (see Figure 7f). Note that the maximal rate of input modulations trackable by PLLs does not indicate the maximal resolution of temporal decoding by PLLs. The latter is determined by the smallest deviation of input ISI that can be detected by a PLL and is determined by the resolution of the PD’s population output (e.g., the resolution of the y-axis in Figure 7e). Thus, PLLs can detect (and represent by rate) temporal changes with a resolution that can be much higher than the maximal rate of temporal modulations that they can track. For example, a PLL whose working range is 100 ms < Ti ≤ 150 ms can, with enough PD resolution, distinguish between inputs of 110 ms and 111 ms, although it cannot track 1 kHz modulation. 3 Tactile PLLs The mammalian tactile system contains the neuronal elements required for the function of thalamocortical PLLs. Following is a proposal for a plausible implementation of PLLs by the primate tactile system. The peripheral tactile system, which acquires sensations during exploration of textures, has been described in detail over the past three decades. The tactile system includes the three following subsystems, which are classified according to the temporal nature of their responses: slowly adapting (SA) receptors and neurons, which respond optimally over the low range (∼0–20 Hz) of stimulus frequencies; rapidly adapting (RA) receptors and neurons, which respond best over frequencies of medium range (∼20–40 Hz); and Pacinian (PC) receptors and neurons, which mainly transfer information at high frequencies (>80 Hz) (Talbot et al., 1968; Freeman & Johnson, 1982; Johansson, Landstrom, & Lundstrom, 1982; Goodwin, John, Sathian, & Darian-Smith, 1989). The glabrous fingertip is innervated mainly by RA receptors, by lower numbers of SA receptors, and by only a small number of PC receptors (Johansson & Vallbo, 1979; Darian-Smith & Kenins, 1980). Less is known about the central mechanisms underlying tactile decoding and processing. The tactile pathways from the periphery to the cortex preserve the phase of the stimulus (Darian-Smith & Oke, 1980; Mountcas-
Code Conversion by Phase-Locked Loops
613
Figure 7: Dependence of lock-in dynamics on input parameters. The iPLL algorithm was simulated in MATLAB (see text). Simulations steps (n) were counted from 1. (a) Input (stars and dotted line): A steady input period at Ti = 1.2Tc . G values (traces from bottom up at n = 2): 0.2, 0.6, 1.0, 1.4, 1.8, 2.2. Initial phase difference [1η(1)] = 0.3Tc (b) Input: As in a. G = −1. 1η(1) = 0.15, 0.30, 0.45, 0.60, 0.75, 0.90 Tc . (c) Input: Four cycles of 1.1Tc followed by Ii (n) = (1.25 + 0.25 sin(2π(n − 4)/10))Tc , for n > 4. G values (traces from bottom up at n = 2): 0.5, 0.75, 1.0, 1.25, 1.5, 1.75. iPLLs with G > −1 are plotted without symbols. 1η(1) = 0.3Tc . (d) Tracking errors for the simulation in c were computed as (Io (n) − Ii (n − 1))/Tc , for n > 1. Only G ≤ −1 are shown. G = −1 is plotted with X’s. (e) The PLL’s output rate (Rd ), in arbitrary units, for the simulation in c. (f) Input: Ti = 1.25 Tc , modulation rate (2Ti ), modulation depth 40%. G values as in c. 1η(1) = 0.1Tc
614
Ehud Ahissar
tle, Talbot, Sakata, & Hyv¨arinen, 1969; Ferrington & Rowe, 1980; Burton & Sinclair, 1991; Gardner, Palmer, Hamalainen, & Warren, 1992). However, the degree of phase locking gradually decreases along the afferent pathways, with the largest reduction probably occurring at the transition from the thalamus to the cortex, a transition that is also accompanied by an increased complexity of response (Darian-Smith, Sugitani, Heywood, Karita, & Goodwin, 1982; Sinclair & Burton, 1988; Burton & Sinclair, 1994). This increased complexity could be due to significant processing that occurs already at the thalamocortical level (Gottschaldt, Vahle-Hinz, & Hicks, 1983). 3.1 Temporal Encoding of Textures. I consider here only textures (of variable patterns and heights) on flat surfaces (e.g., textures of sandpapers, clothes, woods, artificial gratings, or braille pages). A finger traversing these types of surfaces usually moves along sections of approximately straight lines (see Figure 8, top). The information contained in these textures consisting of a collection of ridges is expressed by three variables: amplitude, average spatial period (where period is 1/frequency), and local spatial modulations. The information carried by the average spatial period (Xi ) is called here the roughness of the surface and the information carried by local spatial modulations (pi ) the pattern of the surface. I use italics to distinguish this specific stimulus-defined roughness from the more general roughness percept. As I will show below, decoding roughness information can contribute to the roughness percept. When a surface is transversed by fingertips, the spatial information is encoded in two ways: 1. Spatial encoding: Across the contact area (∼0.6 cm2 in humans), at any given moment, the spatial features are reflected by the corresponding skin deformations leading to a spatially encoded response of the relevant receptor population. 2. Temporal encoding: At any given skin location, receptors are responding to the fluctuations of the indentation amplitude produced by the movement (see section A.6). Spatial encoding is probably best mediated by the SA receptors (Phillips, Johansson, & Johnson, 1990) and needs to be decoded by mechanisms utilizing spatial comparisons. Temporal encoding is probably best mediated by RA (and, to a lesser degree, PC) receptors, which respond reliably to temporal modulations (Darian-Smith & Oke, 1980; Morley & Goodwin, 1987). Such temporally encoded signals could be efficiently decoded by thalamocortical PLLs. However, since point skin indentations are modulated by both amplitude (due to vertical surface fluctuations) and time (due to horizontal interval fluctuations; see Figure 8, top), the interpretation of the decoded signals could be ambiguous. Electronic implementations of PLL, facing similar problems, always include an amplitude limiter at the input
Code Conversion by Phase-Locked Loops
615
stage (Gardner, 1979). Interestingly, the RA peripheral system employs a similar mechanism. Responses of RA mechanoreceptive fibers to vibratory stimuli, or moving spatial gratings, having amplitudes between fewer than 10 to hundreds of microns, are often of a 1:1 type; they fire one and only one spike per vibratory or grating cycle, regardless of the amplitude (Talbot et al., 1968; Darian-Smith & Oke, 1980; Goodwin & Morley, 1987; Gardner & Palmer, 1989). Every RA fiber exhibits a 1:1 response within a specific range of amplitudes (termed the “plateau” range; Talbot et al., 1968) and temporal frequencies (Darian-Smith & Oke, 1980). Outside these ranges, fibers respond with bursts of variable lengths per cycle (Darian-Smith & Oke, 1980; Morley & Goodwin, 1987), depending on the force and frequency of the stimulus (Darian-Smith & Oke, 1980; Goodwin et al., 1989). Thus, in the case of the RA system and with a constant finger velocity, the encoding of the horizontal features of textures is straightforward. The horizontal (or temporal) modulations of the periodic indentation profile, as a function of x (or t), can be described by the two methods used above to describe temporal periodic signals (see Figure 1 and appendix A.1): either with respect to an imaginary “carrier” signal (see Figure 8a, Pi (n)) or with respect to the spatial intervals themselves (see Figure 8a, pi (n)) (see appendix A.6). During scanning, the timing of the nth mechanoreceptive spike is uniquely determined by the location of the nth ridge (see Figure 8b). If the response type is 1:1, the RA mechanoreceptive fibers should fire one and only one spike per every ridge in the surface, and the sensory transformation takes a simple form: the horizontal spatial structure is directly represented by the temporal structure of the RA spike trains (see appendix A.6). With different ranges of finger forces, the 1:1 response becomes a 1:n response, and the transformation is more complex. However, as long as the duration of the bursts is small relative to the input average cycle (which is usually the case; Darian-Smith & Oke, 1980; Morley & Goodwin, 1987), decoding efficiency should hardly be affected since input onset times, which are the important parameters for the decoding, are not affected. Yet the increased input intensity and duration caused by the bursts should be compensated by a proper tuning of the PLL’s loop parameters (S. Serulnik & E. Ahissar, unpublished observations). Thus, for optimal performance, PLL parameters should be tuned according to the expected form of input bursts. 3.2 Decoding by Thalamocortical PLLs. As I will show, the decoding of tactile signals by PLLs requires an additional feedback loop. Thus, the postulated temporal tactile decoder, as one module within a global tactile texture decoder, includes many parallel PLLs embedded within a sensorimotor feedback loop (see Figure 9). According to the model, the movement of fingers across a surface activates skin mechanoreceptors (MR) which convert the spatial details into temporal signals. The RA and PC mechanoreceptors at the fingertip include amplitude limiters (L), which eliminate amplitude modulations. The parallel array of input filters (IF; mechanoreceptors and
616
Ehud Ahissar
their fibers) transfers the filtered signal to an array of somatotopic organized PLLs, each specifically tuned to a particular frequency restricted to one of the tactile submodalities (SA, RA, or PC). Therefore, every point on the skin is driving a set of PLLs, each tuned to a particular frequency (see Dykes,
Code Conversion by Phase-Locked Loops
617
1983). The output of all the PLLs is fed to two readout networks, IP and IR, for pattern and roughness evaluation, respectively. The IR’s output drives the velocity controller (VC), which closes the loop by controlling the finger velocity. Each PLL thus processes information about different spatial frequencies of the explored surface. How does the brain know which PLLs provide relevant information about the actual surface being explored and how can it focus on these PLLs? If the PDs employ nonperiodic, sigmoid-like transfer functions, the answer to the first question would be simple. Only PLLs that are tuned to the relevant (i.e., informative) temporal frequencies should present modulated output signals. All other PLLs should produce outputs that are saturated at either the highest or the lowest possible values. Thus, the modulation depth of the AC output component, or a related measure such as |Rac (t)|, should provide a reliable measure of the amount of information contained in each PLL’s output. This criterion appears to be valid also for periodic PDs (like the one in Figure 4d). As the input average frequency moves away from the center of the working range, the probability of the instantaneous frequencies to exceed the bounds of this range increases. Once the input frequency exceeds one of these bounds, the PLL’s output is closer to its average value, and its modulation depth decreases. Thus, local maxima of |Rac (t)| represent maximal information. Furthermore, it is most likely that these local maxima will be graded among different PLLs, and a global maximum will also be available. The reason is that each PLL can obtain a larger
Figure 8: Facing page. Illustration of a temporal encoding of spatial features. The movement of the hand (arrow) across a surface generates skin displacements at the zone of contact. This series of displacements can be described as a spatial signal [ui (x)] that represents the texture in this one-dimensional direction of movement. (a) Decomposition of the spatial signal. ui (x) can be decomposed into vertical [Ai (n)] and horizontal [either an imaginary “carrier” + Pi (n), or pi (n)] components. The similarity between Pi (n) and pi (n) in this example is due to the regularity of the pattern. Gi (n) is the interridge interval. (b) Receptor transformation. Assuming a 1:1 response of mechanoreceptive fibers and a constant velocity, ui (x) is converted to a temporal signal described by si (t). The temporal signal, which is carried by the mechanoreceptive fibers, can be also decomposed into subcomponents. However, due to the 1:1 response, which neglects amplitude changes, the amplitude component is constant and equal to 1, and thus is ignored. Ii (n) is the ISI. Refer to appendix A.1 for the definitions of other terms. (c) Decoding by an ideal PLL. The phase-locking mechanism of the PLL forces the RCO’s output [so (t)] to track the peripheral input [si (t)] with a delay of one cycle (Ti ) and a constant phase difference (1∅). As a result, the brain can extract the modulation (Mo (n) or mo (n)) that describes the pattern and the average interval (To ) that describes the roughness. See appendices A.1, A.3, and A.6 for explanation of other symbols.
618
Ehud Ahissar
Figure 9: PLLs within a global texture decoder. Many topographicallyorganized PLLs reside in parallel; only six are shown in the figure. Each PLL is tuned to a specific frequency range within a specific submodality: SA, RA, or PC. The input of each PLL is received from a set of mechanoreceptive fibers through a set of dorsal column nuclei relay neurons, which together comprise the input filter (IF). Most of the mechanoreceptors (MR) include amplitude limiters (L). IP and IR are readout networks that produce pattern- and roughness-related information, respectively. A hypothetical spatial decoder illustrates the operation of additional mechanisms in parallel.
working range if it tunes the slope of its PD function according to its average frequency—smaller slopes for lower frequencies. In this case, the global maximum will indicate the PLL whose working range is fully exploited. Since neuronal excitation is often sensitive to the variability at the input (Aertsen, Erb, & Palm, 1994), circuits that detect maximal variabilities can be implemented. If such circuits are included in the PLLs’ readout networks (e.g., IP in Figure 9), they can assist the selection of one of the submodalities (SA, RA, or PC) and the specific PLLs within that submodality that are most informative. Other factors affecting this selection probably include visual, cognitive, and additional tactile information, such as that obtained by spatial decoders (see Figure 9). According to this selection, the finger velocity is deliberately determined to be in the range that will generate temporal frequencies in the appropriate range for the chosen PLLs. By setting the finger velocity, the system focuses on the selected PLLs, since they will generate the most informative output. To keep this focus steady, an automatic feed-
Code Conversion by Phase-Locked Loops
619
back system is required to carry out the fine-tuning of the velocity. Such a feedback system could be tuned to maximize the amount of output information from the selected set of PLLs, using |Rac (t)| as a measure. Although such an operation makes sense, I propose that if it is implemented, it is implemented as a higher-order feedback loop. For maintaining the input frequencies around a selected PLLs’ working point, such a feedback system could simply operate on fluctuations of the averaged PLL’s output, Rdc . Let us represent each selected group of PLLs by a single PLL. Once a particular PLL is selected, the sensorimotor circuit responsible for temporal decoding can be described by two loops (see Figure 10a): the inner loop is the selected PLL, which extracts the input temporally encoded information (see Figure 8) and recodes it by rate (see appendix A.7), and the outer loop is an automatic velocity control (AVC), which keeps the input frequency of the PLL centered around the PLL’s working frequency. IP and IR are reduced in this description to single filters, assumed to produce outputs related mainly to the selected PLL. The other inputs to IP and IR are assumed to be averaged out. The general case of processing, in which the average input frequency can change (albeit slowly) over time, even beyond the working range of the PLL, is assumed here. Thus, both the average input ISI and the average PLL’s output are functions of time (Ti (t) and Rdc (t), respectively). 3.2.1 The automatic velocity control (AVC). When the PLL is locked, the average RCO’s ISI is approximately equal to the average input ISI (To (t) ≈ Ti (t); see appendix A.3). If either the average input spatial period (Xi (t)) or the finger velocity (V(t)) is changed, the PLL will move to a new working point in which To (t) ≈ Ti (t). Such a new working point will be associated with a new average output rate (Rdc (t)) of the PLL. However, if the working range of the PLL is limited, as is the case for any practical implementation, this adaptive process is also limited, and consistent drifts in the input average frequency can eventually lead to a loss of locking as the PLL leaves its working range. From the point of view of the sensorimotor system, there are two possible solutions to this problem: it can either have many PLL circuits, each tuned to a different working range (the open-loop approach), or it can actively maintain the input temporal frequency within a working window (the closed-loop approach). The closed-loop approach, whose operation is postulated here, requires that, while operating near the center of the PLL’s working range, if To (t) is driven toward the limits of the working range, an action will be taken to bring Ti (t) back to its original value via the control of the finger velocity, V(t) (see appendix A.8). The algorithm for the tactile AVC is composed of five elements (see Figure 10a): (1) a multiplier (MR), which multiplies the finger velocity by the spatial frequency of the texture; (2) a PLL circuit, which converts the roughness and pattern information to the DC and AC components of a rate signal; (3) a base-band filter (BBF), which transfers only the frequencies related to the pattern; (4) a low-pass filter (LPF), which transfers only the frequencies
620
Ehud Ahissar
Figure 10: The algorithm of the AVC loop. (a) Loop components. MR, multiplier that includes an amplitude limiter; PLL, one of the PLL circuits in Figure 9 that is selected for optimization; BBF, base-band filter; LPF, low-pass filter; VC, velocity controller. ξ i(n), location of the nth input ridge; Xi (x), average interridge interval; ηi (n), timing of the nth input spike; Ti (t), average input ISI; Rd (t), firing rate of the PLL’s output; Rac (t), the integrated signal representing the pattern; Rdc (t), the integrated signal representing changes in roughness; V(t), the finger velocity. The loop equations (inset) are explained in appendix A.8. (b) Schematic examples of transfer functions. The crossing point (Tcw , Rcw ) is the working point of the AVC, which, optimally, fits the desired working point of the selected PLL.
related to changes in the roughness; and (5) a velocity controller (VC), which controls the finger velocity. The negative feedback nature of the AVC maintains the PLL’s mean input ISI close to the PLL’s desired working point (Tcw , Rcw ), which is the center of the working range of the selected PLL. An increase in either the average spatial frequency or the finger velocity will result in the input’s average ISI decreasing, Rdc (t) increasing, and V(t) decreasing (see Figure 10 and sec-
Code Conversion by Phase-Locked Loops
621
tion A.8). As a result, Ti (t) will be driven back toward Tcw with a dynamics that depends on the actual l transfer functions. An opposite reaction occurs when either the average spatial frequency or the finger velocity decreases. Note that this servo operation holds for any given transfer functions, provided that they establish a negative feedback. Thus, dependence of tactile inputs on motor outputs (Chapman, 1994; Nelson, 1996) should affect the details of the AVC operation but not its principles. 3.3 Implementations of Tactile PLLs. 3.3.1 Implementations of tactile PDs. The tactile RA system appears to have evolved such that thalamic RA “relay” cells can be used as efficient phase detectors. The main features of the RA system contributing to this efficiency are (1) a rectangular-like distribution of the conduction velocities of RA fibers (Talbot et al., 1968; Darian-Smith & Kenins, 1980); (2) a close-touniform receptor sensitivity across the receptive field (Johansson & Vallbo, 1983; Gardner & Palmer, 1989); and (3) slow (long duration) cortical-torelay neurons excitatory postsynaptic potentials (EPSPs) (Deschenes, Paradis, Roy, & Steriade, 1984). In general, the first two of these features are also typical of the PC, but not of the SA, subsystem; the third is probably common to all three of the tactile subsystems. If organized correctly, the lemniscal input to the thalamic relay neurons can implement a square-wave-like signal, like in Figure 4c. Given features 1 and 2 of the RA system, the lemniscal input contains subpopulations in which, for a given point stimulus, different subsets of the input are active at different times. A “point” stimulus—an abrupt indentation at a single skin locatio—will generate a uniform response across all RA receptors that include that point in their receptive field, due to the uniform sensitivity of each receptor across its receptive field. When light touch is used, the skin indentation is assumed to be within a plateau range of amplitudes where the response has the form of one spike per one point stimulus (Talbot et al., 1968; Darian-Smith & Oke, 1980). If each subpopulation of fibers that share a skin location contains fibers with different conduction velocities, these activations will arrive at the thalamic relay neurons at different times for each fiber, like in Figure 4b. In this case, each of the fibers can be considered as a delay line generating a specific delay from skin activation to the firing of a lemniscal fiber. For the conductance velocities (Talbot et al., 1968; DarianSmith & Kenins, 1980) and hand length (∼50 cm) of monkeys, the spread of lemniscal firings probably contains mainly latencies between 7 and 14 ms, not including the duration of input bursts. This range corresponds to about one-fourth of a cycle of 30 Hz oscillations and is a reasonable range for a PD (see section 2.2.1). However, different spreads of the afferent signal are optimal for different PLL working frequencies. Thus, it is expected that channels conveying lower frequencies will employ larger temporal spreads.
622
Ehud Ahissar
3.3.2 Implementations of tactile RCOs. Obvious candidates for RCOs are the posterior SII local oscillators (Ahissar & Vaadia, 1990). Many of the neurons in this area display oscillatory patterns; however, not all of them can be considered local oscillators. At least 15% of the neurons in that area probably oscillate due to local mechanisms (Ahissar & Vaadia, 1990). The rest of the oscillating neurons (about 30% of the population) are either externally driven by the local oscillators, or their local oscillations are masked by a significant amount of noncorrelated input. The local oscillators can either directly drive thalamocortical neurons, if they project to the thalamus, or drive corticothalamic neurons. Note that these single-cell oscillations do not merely reflect sleeplike thalamocortical spindles (Steriade, McCormick, & Sejnowski, 1993) since they appear in wakefulness, include mainly gamma frequencies, and are not correlated among neighboring neurons. There is no direct evidence yet that indicates an RCO-like operation of the SII oscillating neurons. However, these neurons lose their oscillatory patterns when stimulated with nonperiodic tactile stimuli (Ahissar & Vaadia, 1990). This finding is consistent with the cortical oscillators trying to track the nonperiodic input. More important, the distribution of oscillating frequencies of these oscillators matches the peripheral distribution of best frequencies (see Figure 11). A more direct evidence was obtained for SI oscillators in monkeys, employing RA frequencies (∼30 Hz; Lebedev & Nelson, 1995), and in rodents, employing whisking frequencies (∼10 Hz; Ahissar, Alkon, Zacksenhouse, & Haidarliu, 1996). These oscillators can be entrained by tactile periodic stimuli near their spontaneous frequencies, but usually not with significantly higher or lower frequencies. Thus, PLL circuits might exist in parallel in thalamocortical loops involving SI and SII cortices. 3.3.3 Implementation of Readout Networks. Each of the two readout networks, IP and IR, should implement at least two functions. The simple one, which is required for the AVC operation, is filtering out the unnecessary information. Both low-pass and bandpass filters are easy to implement by neuronal networks, utilizing synaptic integrations and decays. In addition, these networks should probably include circuits that compute and compare input variabilities (IP) and input averages (IR). Detailed implementations of these filters are beyond the scope of this article. Although the PLLs’ outputs are described as converging to the readout networks (see Figure 9), they do not necessarily have to converge. The readout networks can utilize parallel processing and produce population outputs. Accordingly, the single lines standing for the outputs of the two filters in Figure 9 denote the unity of information conveyed by their outputs rather than the outputs’ physical widths. 4 Discussion 4.1 Advantages and Limitations of PLLs. The PLL algorithm is used extensively in electrical engineering for decoding of phase and frequency
Code Conversion by Phase-Locked Loops
623
Figure 11: Cortical oscillating frequencies and peripheral frequency tunings. (a) Distribution of oscillating frequencies of cortical (posterior SII) single-cell oscillators (104 frequencies observed in 76 neurons of which 18 exhibited more than a single frequency). Only frequencies larger than 2.8 Hz were included for compatibility with the peripheral data. (modified from Ahissar and Vaadia, 1990). (b–d) Estimated distributions of peripheral tuning to the frequency of sinusoidal skin displacements. Ordinates depict estimations, for each input frequency, of the number of mechanoreceptive fibers that innervate the skin contact area and are tuned to that frequency. The distribution of tuning to vibratory frequencies among the input fibers was estimated here by calculating the “equivalent number of fibers” tuned to each frequency. The equivalent number of fibers per submodality is the fraction of the average response of that submodality at each frequency multiplied by the average number of fibers of the same submodality that innervates the stimulated area of skin. During light touch, the contact areas of skin for humans and monkeys are ∼0.6 cm2 (Lamb, 1983) and ∼0.2 cm2 (Goodwin & Morley, 1987), respectively. (b) Mean responses of mechanoreceptive fibers were obtained from data published for humans (Johansson et al., 1982) and innervation densities from data published for monkeys (Darian-Smith & Kenins, 1980). Skin contact area was assumed to be 0.2 cm2 . Peak-to-peak indentation amplitude is 16 µm. (c) Same as b, except that the indentation amplitude is 32 µm. (d) Same as b, except that the innervation densities were obtained from data published for humans (Johansson & Vallbo, 1979) and skin contact area was assumed to be 0.6 cm2 .
624
Ehud Ahissar
modulated signals, frequency synthesis, and pulse synchronization. When utilized as a phase demodulator, a PLL exhibits an excellent noise immunity due to its adaptive narrowband filtering (Gardner, 1979). This narrowband filtering is achieved by comparing the input against a specific internal frequency and becomes adaptive because of the feedback control of the internal frequency. In principle, temporally encoded neuronal signals (see Figure 1) are phase-modulated signals; therefore, utilization by the brain of a PLL mechanism to decode temporally-encoded signals should be advantageous. However, there are limitations inherent in the PLL mechanism that the brain would have to compensate for. One limitation arises from the adaptive behavior of the PLL, which limits the PLL’s capacity to track rapid changes in the input. As with any other negative feedback loop, a few input cycles may be needed before the PLL can lock in to a new input and efficient decoding can commence. Nevertheless, learning-induced fine tunings of the loop parameters can reduce to a minimum (down to 1 cycle) the number of lock-in cycles. Another limitation of PLLs stems from the limited working ranges of their implementations. A PLL cannot track, although it can detect input modulations that are larger than its working range. The working ranges of basic neuronal PLLs are usually around half a cycle, with the upper limit probably being one cycle (see section 2.2). Thus, a typical, “nonsophisticated,” neuronal PLL is limited to inputs with modulation depths of less than 50%. If an RCO cannot produce the required frequencies, the PLL’s working range will be even more limited. This limitation can be circumvented by having several PLL circuits in parallel, each tuned to a different frequency range and decoding a different segment of the input information. In addition, “sophisticated” implementations of PDs can extend working ranges and reduce lock-in times. A significant advantage of neuronal PDs is that transitions from one implementation to another can occur within a given anatomical circuit by changing cellular parameters. For example, at low excitability levels, a PD neuron can implement an AND-like function, at high excitability levels an OR-like function, and at intermediate excitability levels an AOR-like function (see section 2.2.1.3). Thus, neuronal PLLs can dynamically change their loop parameters, including gain and working range, to accommodate to global sensory changes or requirements. For example, a full-cycle working range can be implemented by asymmetrical PDs that employ an AND-like function for negative phase differences [ηo (n) − ηi (n) < 0] and an OR-like function for positive phase differences. In such an asymmetrical AOR-like PD, the order of input activation determines the sensitivity of the PD neurons. The periodic PD transfer function of such asymmetrical PDs has the shape of a sawtooth instead of the triangular shape of the symmetrical PDs (see Figure 4d). The advantages of a sawtooth PD function are that the
Code Conversion by Phase-Locked Loops
625
working ranges are larger, and with very large input modulations, the PLL immediately shifts to another valid working range. Other options that are probably available for neuronal PLLs are dynamic tuning of the RCO’s local frequency, asymmetric RCO transfer functions, and combined excitatory-inhibitory implementations of PLLs (see Figure 3a) with asymmetric or dynamically shifted relative weights. 4.2 Plausible Sites for PLLs. Neuronal circuits that contain local oscillators probably can function as PLLs without any specific tuning. In principle, the feedback connections and the basic phase detection function of any neuron (see section 2.2.1.1) establish the essential requirements of the loop. Nevertheless, efficient operation at a specific frequency range requires additional tuning of the cellular and circuit parameters (see section 2.2). Some neuronal systems have at least some of the required parameters for efficient PLLs. Following is a summary of the requirements from the circuitry and local oscillators and a review of plausible sites. 4.2.1 Potential Circuits. The sensory thalamocortical loops are attractive candidates for PLL circuits, since PLL performance improves when the RCO-to-PD connections are massive, and having PLLs early in a processing stream would be advantageous for facilitating sensory-sensory integration. Nevertheless, feedback circuits within or between cortical areas could function as PLLs as well. Within thalamocortical loops, the natural implementation would be that the thalamic relay neurons function as PDs and corticothalamic neurons at deep cortical layers function as RCOs. Natural candidates for INH neurons in iPLLs are the cortical inhibitory interneurons in layer 4 (White & Keller, 1987; Agmon & Connors, 1992; Swadlow, 1995). However, other combinations, including inhibitory neurons of the reticular nucleus of the thalamus, or oscillatory neurons in superficial layers that drive the corticothalamic neurons, are also possible, as long as the loop transfer functions establish a stable negative feedback loop (see appendix A.2). Within thalamocortical systems, many PLLs are expected to function in parallel, each tuned to a different combination of receptive field and working range. 4.2.2 Local oscillators. In a PLL, a local oscillator should function as an RCO, that is, its output frequency should be controllable by the input. The RCO’s oscillations can be sub- or suprathreshold during spontaneous activity, as long as when decoding starts, the oscillations become suprathreshold. With single or groups of cells, which are oscillating due to intrinsic mechanisms, control of the frequency of oscillations by the input is expected to obey the simple neuronal rules required by the PLL: excitatory inputs should increase the frequency of oscillations, whereas inhibitory inputs should decrease the frequency (see section 2.2.2). Thus, single-cell oscillators are excellent candidates for RCOs. In contrast, oscillations generated outside the
626
Ehud Ahissar
processing network are not expected to be affected by the excitation levels of the network and thus cannot function as RCOs. The effect of neuronal input on a network that oscillates due to specific connectivity of excitatory and inhibitory cells is not obvious and depends on the exact connectivity and the exact input. Thus, neuronal ensembles that exhibit network oscillations will not be considered candidates for RCOs but rather as circuits that can be driven by RCOs. Brain activity contains oscillations in a wide range of frequencies, from circadian to millisecond ranges. However, only frequency ranges compatible with perceptual time scales, during which sensory temporal codes should be transferred to motor rate codes, will be discussed here. Emphasis will be placed on the 10–100 Hz range, although lower and higher frequencies can probably also be used for perceptual processing by PLLs.
4.2.3 The primate somatosensory system. Candidates for somatosensory RCOs were presented in section 3.3.2. If PLLs indeed operate in somatosensory thalamocortical circuits, network oscillations that occur in the primary somatosensory and motor areas (reviewed in Fetz, 1993) during the performance of tactile tasks or during periods with global excitation could be due to propagation of the working frequencies from the PD neurons to the sensorimotor areas. If input modulations are not overly strong, the working frequency is preserved in the synchronous PD firings, even though the firing rate of the whole population might be modulated (see Figure 6).
4.2.4 The rodent vibrissal system. Many rodents achieve tactile sensory acquisition with an active process in which their whiskers move back and forth in a sinusoidal-like manner with frequencies near 10 Hz (Welker, 1964; Simons, 1995). Rodents use such whisking to localize (Welker, 1964) and identify objects with strategies and resolution capabilities comparable to those of primates achieved by applying manual active touch (Simons, 1995). During active whisking, the vibrissal pathway of the rat oscillates synchronously at around 10 Hz (Nicolelis, Baccala, Lin, & Chapin, 1995). These oscillations probably originate in the cortex, but in every cycle, the peripheral neurons fire first, followed by the firing of cortical ones, which in turn is followed by a firing of thalamic neurons. These observations are fully consistent with PLL circuits of 10 Hz operating in the thalamocortical vibrissal system of the rat. Recently we observed that cortical oscillators in the somatosensory cortices of anesthetized rats and guinea pigs exhibit three modalities of oscillating frequencies, at roughly 1, 10, and 100 Hz (Ahissar et al., 1996). It is possible that the ∼10 Hz oscillators are utilized in PLLs that detect the location of external objects and that the ∼100 Hz oscillators are utilized in PLLs that decode the texture of these objects (see Carvell & Simons, 1995).
Code Conversion by Phase-Locked Loops
627
4.2.5 The auditory system. Under normal conditions, most of the singlecell oscillations in the auditory cortex have frequencies below 14 Hz (Ahissar & Vaadia, 1990). Thus, if auditory PLLs exist, they probably decode lowfrequency information derived from relatively slow processes such as speech or movements of sound sources (Ahissar, Ahissar, Bergman, & Vaadia, 1992). 4.2.6 The olfactory system. Network oscillations occurring in the olfactory system (Freeman, 1975) are probably utilized to enhance cortical processing or to encode sensory information (Hopfield, 1995), but not to decode temporally encoded sensory information which is not conjectured in this sense. 4.2.7 The visual system. During stimulations, the visual pathway often exhibits synchronous, wide-band oscillations (Neuenschwander & Singer, 1996; Engel et al., 1992; Eckhorn, 1994). Whether these oscillations are utilized for the decoding of temporally encoded information or for recoding spatially encoded information is not yet clear. The fact that the internal frequencies are usually much higher than the so-called temporal frequencies of the stimulus (i.e., the frequencies at which single receptors are stimulated) is not indicative in this case. The temporal structure of the retinal output depends also on the frequency of the sequential activation of neighboring receptors since, at least in cats, several receptors usually converge onto single ganglion cells. The direct dependency of cortical frequencies on stimulus velocities (Gray, Engel, Konig, & Singer, 1990; Eckhorn, Frien, Bauer, Woelbern, & Kehr, 1993) supports the direct coupling between peripheral and cortical oscillations. Visual temporal decoding by PLLs could rely on single-cell oscillators in the thalamus or the cortex. Neurons in the lateral geniculate nucleus (LGN) exhibit spontaneous stable oscillations (Ghose & Freeman, 1992) that are disturbed once visual patterns are presented, which is consistent with these oscillators trying to track modulated temporal structures. Neurons in superficial layers of the cortex exhibit intrinsic oscillations during stimulations (Gray & McCormick, 1996). These neurons could function as RCOs probably only after initial sensory or internal preparatory excitation. 4.2.8 Summary. The perceptual mechanisms of two sensory (tactile and visual) systems that apparently can use PLL circuits involve motion of the sense organs during sensory acquisition. Such movements result in encoding of spatial information in temporal firing patterns, information that can be efficiently decoded by PLL circuits. Hand movements can easily be measured with a resolution higher than that of the tactile receptive fields, while such measurements with eye movements are difficult (Carpenter, 1988). Thus, accurate testing of PLL’s predictions in visual systems is limited. PLLs might be implemented differently in these two systems. For example, tactile RCOs are expected to be cortical, whereas visual ones could be thalamic.
628
Ehud Ahissar
According to the PLL model, the network oscillations observed in the visual and sensorimotor cortices reflect oscillatory activities in either the output or the readout stages of PLL circuits. After the cessation of oscillatory sensory stimulations in both the visual and somatosensory modalities, the brain persists in emitting synchronized oscillations having the same frequency of the stimulus (Narici et al., 1987). This “oscillatory memory” requires a closed-loop operation, at either the cellular or the circuit level, as suggested by the PLL model. Testing of this phenomenon at low frequencies revealed that best resonating frequencies for the somatosensory modality were 6 and 8 Hz and for the visual modality 10 Hz. It will be interesting to see whether similar phenomena occur with oscillations around 30 Hz in the somatosensory system and 40 to 100 Hz in the visual system.
4.3 Experimental Evidence for Thalamocortical Tactile PLLs. Current physiological and anatomical data are compatible with PLL’s being implemented within and across the thalamic ventrobasal nuclei (VB), SI, and SII areas. Neurons that can be considered as local oscillators in the SA, PC, and mostly RA ranges exist in SI and SII areas of primates (see section 3.3.2). The high percentage of posterior SII oscillators and the grouping of oscillators with frequencies that correspond to the three submodalities (see Figure 11) suggests an important role for SII in temporal decoding of textures. In fact, lesions in SII of primates significantly impair tactile texture decoding (Murray & Mishkin, 1984; Carlson, 1990). Furthermore, the direct motor connections of SII to the primary motor cortex (MI) (Jones, 1986; Burton, 1986) would facilitate participation of SII in a basic sensorimotor loop, such as the one described by the AVC loop. The input tactile channels are evidently not fully segregated; sensory information is probably shared by different frequency channels and even between different submodalities. Thus, the decoding details cannot be as simple as described here. However, if channel segregation holds to a certain degree, PLL-like decoding could occur, and in this case the decoding principles outlined in this article should hold. Note that although anatomical continuity within input channels is required for input pathways running up to the cortex and back to the thalamus, physiological consistency of response type is required only up to the thalamus. In fact, the PLL model suggests that a significant code transformation occurs at the thalamocortical level. Thus, the findings that cortical response types are not correlated with peripheral ones (e.g., Tremblay, Ageranioti-Belanger, & Chapman, 1996) are not in conflict with the PLL model. Below are presented data that are consistent with (i.e., can be explained by) the PLL model and data that support the model (i.e., that are more consistent with the PLL model than with other models). Since no other specific mechanism has yet been suggested for texture decoding at the circuit
Code Conversion by Phase-Locked Loops
629
level, the PLL will be compared with the open-loop model of local oscillators (Ahissar, 1995) and with “non-PLL” mechanisms in general. 4.3.1 Data consistent with PLLs functioning in the tactile thalamocortical system. • The RA system employs amplitude limiting, uniform receptive fields, and temporal dispersions, all required for efficient thalamic phase detection (see sections 3.1 and 3.3.1). • The RA pathway exhibits a high degree of phase locking that preserves the temporal information up to the thalamus (see the introduction to section 3). • The mechanoreceptors and their fibers act as bandpass filters; they emphasize a certain range of the input frequency spectrum (Johansson et al., 1982; Freeman & Johnson, 1982; Goodwin et al., 1989), as required for efficient PLL decoding. • The reciprocal connections between the thalamus and cortex are modality and somatic specific (Jones, 1986; Doetsch, Standage, Johnston, & Lin, 1988; Hoogland, Welker, & Van der Loos, 1987). • The circuitry required for the function of thalamocortical ePLLs and iPLLs exists in mammals (Jones, 1986; White & Keller, 1987; Agmon & Connors, 1992). • Thalamic relay neurons are activated with short EPSPs from afferent projections and long EPSPs from cortical inputs (Deschenes et al., 1984), an arrangement that facilitates thalamic PD operation (see section 3.3.1). • Oscillating frequencies of SI neurons can be controlled locally (Silva et al., 1991; Amitai, 1994; Ahissar et al., 1996). • Two successive stimuli to the same location on the skin are not differentiable for delays between 0 and 15–40 ms (Rosner, 1961), consistent with a PLL-like mechanism that “samples” the input using RA-range frequencies. • Talbot et al. (1968) suggested the existence of a central mechanism that “alters its own activity [which “measures”] the dominant period in the input train of impulses.” The PLL, by altering its own activity (the RCO’s frequency), can “measure” the dominant input period. 4.3.2 Data that support PLLs in the tactile thalamocortical system. • Local oscillators in SI of monkeys (Lebedev & Nelson, 1995) and of anesthetized rats and guinea pigs (Ahissar et al., 1996) can be entrained by oscillatory tactile stimuli when the input frequency is close to the local frequency.
630
Ehud Ahissar
• When vibratory stimuli are applied within series of decreasing intensities, minimal detection thresholds were consistently slightly lower than during series of increasing intensities (Talbot et al., 1968). Whereas this result cannot be explained by neural mechanisms involving adaptation or habituation, it is easily explained by PLL-like mechanisms: A minimal input intensity is required to lock in the PLL (increasing series), but once the PLL is locked (decreasing series), less input intensity is necessary to keep it locked, since the local oscillators already fire in phase with the input. • A qualitative coding transformation, from temporally oriented at the thalamus (Sinclair, Sathian, & Burton, 1991) to rate oriented at the cortex (Sinclair & Burton, 1991), appears to occur at the thalamocortical level of monkeys performing a texture discrimination task. The gradual nature of the cortical responses is more consistent with the PLL than with alternative open-loop mechanisms producing labeled-line coding (Ahissar, 1995). 4.3.3 Data consistent with inhibitory PLLs. • There is evidence “that presumed inhibitory interneurons in the cat SI could be activated first by thalamic inputs among cortical neurons and set to inhibit the output cells” (Yamamoto, Samejima, & Oka, 1988, p. 199). • Activities of local oscillators in SI of the behaving monkey are often inhibited by vibrotactile stimuli (Lebedev & Nelson, 1995). • Local oscillators in SI of the rat receive strong inhibitory input (ChagnacAmitai & Connors, 1989). • With SII neurons of cats, firing in phase with a vibratory stimulus is impaired when GABA receptors are blocked (Alloway, Sinclair, & Burton, 1988). This is consistent with cortical phase locking being achieved by inhibitory PLL circuits. • In response to thalamic (VB) stimulation, corticothalamic neurons in SI of cats exhibit inhibitory postsynaptic potentials (Landry & Dykes, 1985). Synaptic excitation is also observed in some of these neurons, which suggests a combination of ePLLs and iPLLs. • Somatosensory cortical neurons of rats have been classified according to whether they are coactivated with fast (∼20 Hz) electroencephalogram waves (CoE cells) or not (CoI cells) (Angel, 1983). CoE neurons exhibit rhythmic firing around 20 Hz, dominate the electroencephalogram when CoI neurons are quiet, respond to peripheral inputs with longer latencies then CoI neurons, and activate thalamic (reticular) cells with a shorter latency then CoI neurons do. All of these
Code Conversion by Phase-Locked Loops
631
phenomena are consistent with CoE functioning as RCO neurons and with CoI functioning as inhibitory interneurons (INH). • About 25% of the SI neurons of the monkey exhibit a sharp, positive sigmoidal dependency on the spatial period (Sinclair & Burton, 1991; see also Darian-Smith et al., 1982), as expected by iPLLs (see Figure 6). Negative sigmoidal dependency, as predicted by ePLLs, was not observed. 4.3.4 Data that seem inconsistent with PLLs in tactile thalamocortical systems. • Although the primary function of the proposed PLLs in tactile thalamocortical systems would be in perception of patterns, these PLLs should, using only temporal input information, be able to convey information about the roughness of the scanned texture. However, in some cases, estimation of roughness is independent of the temporal parameters of the peripheral input (Lederman, 1981). This would imply that even if PLL circuits exist in the somatosensory system, their contribution to the perception of roughness is negligible. However, so far, only a partial range of possible conditions has been studied—for example, only spatial frequencies of relatively high frequencies (groove widths of 0.175–1 mm; Lederman, 1981). Also, the involvement of temporal information was tested only after subjects were trained to estimate the roughness of different textures at different velocities. Since the nervous system is capable of developing perceptual constancies over many parameters, one of which is probably finger velocity, naive rather than trained subjects should have been used. When naive subjects are required to identify forms or discriminate gratings, perception indeed depends on finger velocity (Vega-Bermudez, Johnson, & Hsiao, 1991; Ahissar & Gamzu, 1995). Furthermore, during training with a difficult discrimination task, subjects developed scanning strategies that were based on maximizing differences between temporal frequencies by controlling the scanning hand velocities (Gamzu, Haidarliu, & Ahissar, 1994). • As the spatial frequency of the stimulus decreases, the SA and RA mechanoreceptive fibers in the hand of the monkey fire more spikes per second, even if the peak temporal frequency of the stimulus is kept constant (Goodwin & Morley, 1987). This seems to contradict a basic assumption of the tactile PLL that peripheral firing depicts in a 1:1 manner the existence of texture ridges. These experiments were conducted with indentations (1 mm) well above the peripheral threshold (tens of microns; Talbot et al., 1968), which probably forced the peripheral fibers to function outside their plateau range (see Goodwin et al., 1989). Nevertheless, the spatial features were still represented by the peripheral temporal structure, though with a 1:n ratio (Mor-
632
Ehud Ahissar
ley & Goodwin, 1987; Goodwin et al., 1989). In principle, PLLs can decode 1:n input ratios, and the decoding efficiency depends on the parameters. For example, if the bursts increase the input spread (Twi ; see Figure 4) beyond Ti /2, performance could be impaired. However, if they bring Twi closer to Ti /2, performance should improve. Since the length of these bursts increases as the spatial frequency decreases (Goodwin et al., 1989), they might indeed improve the efficiency of the putative PLLs. Note, however, that the peripheral burst lengths are not necessarily preserved at the outputs of the dorsal column nuclei. • The PLL model for the tactile thalamocortical system predicts that the temporal information of the input will be represented by the firing rates of cortical populations. In contrast, Burton and Sinclair (1994) concluded that the cortex probably encodes spatial features of the surface independent of temporal factors. However, since only correlations between average values of cortical rates and hand velocities were computed, velocity was not systematically varied, and “velocity was not well controlled” (Sinclair & Burton, 1991, p. 165), these results cannot rule out representations of input temporal structures in cortical rates. Indeed, Chapmann and colleagues found recently, applying systematic variations of input velocities, that the firing rates of 66% of SI neurons are directly related to the stimulus velocity (Tremblay et al., 1996). • Johnson and Lamb (1981) argued that the temporal dispersion caused by a wide distribution of conduction velocities of mechanoreceptive fibers, especially those of RA fibers, can contribute to a spatial dispersion (blurring) of the image of the scanned texture. This would indeed be the case with a central mechanism that blindly integrates input signals from all input fibers. However, a PLL-like mechanism can actually benefit from such dispersions, which convert input “click trains” to lemniscal square waves (see Figure 4c) and enable the PD operation (see sections 2.2.1 and 3.3.1). • Connor and Johnson (1992) have compared spatial and temporal encoding schemes as possible candidates to underlie tactile roughness estimation and showed that spatial variations have the closest correlation with roughness estimations. However, in some spacing ranges, temporal variations are better than spatial variations in predicting the subjects’ reports (compare Figures 9 and 10 in Connor & Johnson, 1992). A view consistent with this and other (e.g., Ahissar & Gamzu, 1995) studies is that different perceptual mechanisms are emphasized as a function of the task at hand and the range of the spatial frequencies being explored. Roughness estimation tasks and low-spatial-frequency pattern discrimination tasks are probably primarily dealt with by spatial mechanisms (Connor & Johnson, 1992, and Ahissar & Gamzu,
Code Conversion by Phase-Locked Loops
633
1995, respectively) while high-spatial-frequency pattern discrimination tasks are primarily dealt with by temporal mechanisms (Ahissar & Gamzu, 1995), such as the PLL. This is consistent with the finding that superimposed vibrations improve stationary grating resolution in a range of high spatial frequencies and decrease performance in a range of lower frequencies (Johnson & Phillips, 1981). 4.3.5 Summary of experimental evidence. Experimental data indicate that under certain conditions, operation of a PLL-like mechanism is feasible at the thalamocortical level of mammalian tactile systems. In this system, current data favor the existence of iPLLs over ePLLs, although combined operation of both implementations has been indicated. If such PLLs exist, they should function in parallel to other, nontemporal, decoding mechanisms. 4.4 Interactions with other tactile mechanisms. If PLL circuits do exist in the brain, it is likely that they do not exist as isolated circuits and they operate in parallel with other temporal and nontemporal decoding mechanisms. In fact, Johnson, Phillips, and colleagues have shown that some spatial features are most efficiently resolved by the SA system (Johnson & Lamb, 1981; Phillips & Johnson, 1981; Phillips et al., 1990), and this resolution appears to occur without significant dependency on temporal parameters (Phillips, Johnson, & Hsiao, 1988; Phillips, Johansson, & Johnson, 1992). They suggested that both “spatial” (involving the SA system) and “nonspatial” (involving the RA system) mechanisms underlie texture perception; the RA system probably encodes the microscopic dimensions and the SA the macroscopic dimensions of the texture (Johnson & Phillips, 1984). Similarly, I suggest that PLL circuits are embedded in and intermingled with other circuits and that, as a whole, these circuits function as a texture analyzer (Taylor, Lederman, & Gibson, 1973). Within such embedded and intermingled circuits, operations that obey PLL principles should occur in parallel to other operations that obey other principles. Other possible operations could be purely spatial, such as spatial variation detection (Connor & Johnson, 1992), or spatiotemporal, such as cross-coincidence detection among parallel phase-preserved input signals. In real time, the adaptive brain can emphasize one or another operation, according to the task at hand and previous experience. Thus, PLL circuits, which probably occur predominantly in the RA system, can decode temporal information related to the pattern— to the fine details of the surface—while spatial mechanisms (e.g., Bankman, Hsiao, & Johnson, 1990), which predominantly use the SA system, can decode rate-encoded information related to the macroscopic details (e.g., the roughness or shape) of textures. In addition, SA-based intensity mechanisms can refine pattern perception by using detailed spatial information, and PLL circuits can refine roughness perception by using fine temporal information.
634
Ehud Ahissar
4.5 Predictions of the Tactile PLL Model. The predictions derived directly from the algorithm are considered critical; a rejection of any one of them results in a rejection of the model or, at least, a major modification of it. A rejection of an implementation-specific prediction results in the rejection of only that specific implementation. The electrophysiological predictions require a distinction between two types of neurons: RCO neurons, which are local oscillators, and PD (or PD-driven) neurons, whose oscillating activity is externally driven. A partial list of both types of predictions follows. The implementation-specific predictions are detailed only for the AND-like, vector PDs (see section 2.2.1.2). For these implementations, the thalamocortical relay neurons can function as PDs only if the lemniscal input is subthreshold; therefore, the related predictions apply only for light touch, such as that used for texture discrimination. Algorithm-Derived (“critical”) Predictions. AVC predictions. a1. During a difficult identification of a patterned texture, the exploring velocities are expected to maintain the average temporal frequency of the input within one of the three ranges that correspond to the trimodal distribution of cortical oscillating frequencies (see Figure 11), with the RA range being preferred. PLL predictions. a2. RCO neurons are expected to track, within a range around their spontaneous oscillating frequency, variations in the frequency of a vibratory stimulus. a3. When a periodic stimulus is applied at a frequency that matches the frequency of the RCO, the PD neurons are expected to be phase locked with both the stimulus and the RCO neurons, and, during phaselocking, the spikes of the PD neurons should usually (and in AND-like implementations always) lag those of the RCO neurons. a4. When the loop is locked, the net excitatory input to the RCO should be a monotonic increasing function of the input frequency. This is because in order to follow a higher frequency, a neuronal RCO needs to be excited further. a5. While the PLL is locked, as the frequency of the stimulus is increased, the delay between the input and the RCO neurons, (η0 − ηi ), becomes more positive (see Figure 4d). a6. Within the PLL’s working range, the response of the PD population should be monotonic with the input frequency. The polarity of this relationship depends on the implementation (see Figure 3) and the measurement point (e.g., before or after an inhibitory stage).
Code Conversion by Phase-Locked Loops
635
Implementation-Specific (“Noncritical”) Predictions for AND-Like Vector PDs. Thalamocortical Implementations. t1. Within groups (or “rods”; Jones, 1986) of thalamocortical relay neurons that share the same receptor modality, receptive field location, and cortical projecting area, different neurons will have different response latencies (phase shifts) that preferentially cover a range of several ms. Excitatory PLLs (ePLLs). e1. Usually an abrupt and strong peripheral stimulus should increase the instantaneous frequency of an RCO. e2. The higher the input frequency, the higher the fraction of PD neurons that should respond to the input. e3. Entrainment of single PD neurons should exhibit a steplike or sigmoidal dependency on input frequency. They should not be entrained to low frequencies and should start to respond once the input frequency becomes higher than a certain threshold. e4. At low input frequencies, only PD neurons that respond with short latencies should respond. As the input frequency increases, additional PD neurons, which have incrementally longer response latencies, should be recruited. Thus, although all PD neurons can maintain phase locking at high-input frequencies, PD neurons with shorter response latencies should maintain phase locking to stimuli of lower frequencies. Inhibitory PLLs (iPLLs). i1. Usually an abrupt and strong peripheral stimulus should decrease the instantaneous frequency of an RCO. i2. The higher the input frequency, the lower the fraction of PD neurons that should respond to the input. i3. Entrainment of single PD neurons should exhibit a steplike or sigmoidal dependency on input frequency. They should be entrained to low frequencies and should stop responding once the input frequency becomes higher than a certain threshold. i4. At high-input frequencies, only PD neurons that respond with long latencies should respond. As the input frequency decreases, additional PD neurons, which have decrementally shorter response latencies, should be recruited. Thus, although all PD neurons can maintain phase locking at low input frequencies, a PD neuron with a longer response latency should maintain phase locking to stimuli of higher frequencies.
636
Ehud Ahissar
Appendix A.1 Temporally- and Rate-Encoded Neuronal Signals. Any spike train that consists of N spikes of a single neuron can be described as (see Figure 1a) s(t) =
N−1 X
S(t − η(n))
(A.1)
n=0
where S(t0 ) describes a single spike triggered at t0 = 0 (see Figure 1a inset), and η(n) describes the series of spike timings, η(n) = η(0) + nT + M(n) = η(0) + nT +
n X
m(j),
n>0
(A.2)
j=1
where T is the average ISI; M(n) is the “absolute” modulation of T for the nth spike and represents the deviation of the timing of the nth spike from the expected timing of the nth spike of the equivalent ideal oscillator having the same T; and m(n) is the “cycle modulation” and represents the deviation of the nth ISI from T. By definition, the total modulation over the whole spike train should be zero n X M(N) = m(j) = 0 . j=1
For an ideal oscillator, M(n) = m(n) = 0 for every n. The instantaneous ISI is (see Figure 1a): I(n) = η(n) − η(n − 1) = T + m(n), η(n) = η(0) +
n X
I(j),
n>0
n > 0.
(A.3) (A.4)
j=1
It is assumed, as a convention, that the spike train was not modulated prior to n = 0; therefore: I(0) = T;
m(o) = 0;
m(j) = 0,
j<0
(A.5)
The information carried by the spike train is described by T and m. Generally the information carried by T and m could be referred to as rate-encoded and temporally-encoded, respectively, since T is a measure of the average firing rate over the whole period and m is a measure of the fine temporal modulations within that period.
Code Conversion by Phase-Locked Loops
637
A description of a signal by rate requires the division of a spike train into rate bins, with each rate bin being represented by a single number. Each of these single numbers can be evaluated by a variety of functions, ranging from a simple spike count to a weighted average that uses a postsynaptic filter function. Here, rate-encoded signals will be represented by simple spike counts over each rate bin. If other measurements are required, the spike count terms (e.g., A(k) in equation A.6) should simply be replaced with other terms. Thus, a spike train can be described by a series of counts of spikes, where each count corresponds to a single rate bin (see Figure 1b): Rx (t) =
Nr−1 X
A(k)R(t − kTr ),
(A.6)
k=0
where Tr is the rate bin, R(t0 ) is a pulse function that equals 1 for 0 ≤ t0 < Tr and 0 otherwise (see Figure 1b, inset), A(k) is the spike count of the neuron at the kth rate bin, and Nr is the number of rate bins in the spike train. For simplicity, negative “firing rates” will be ascribed to inhibitory inputs. Thus, a rate signal is defined as the difference between the count of spikes leading to EPSPs and the count of spikes leading IPSPs. For example, a single excitatory cell can produce only positive rate values, and a single inhibitory cell can produce only negative rate values. Two kinds of population rate coding are considered here: population sum, which, per each rate bin k, is the sum of all Ai (k), and population vector, which, per each rate bin k, is the array of all Ai (k). A.2 Phase-Locked Loop. A.2.1 Rate-controlled oscillator. so (t) =
N−1 X
S(t − ηo (n))
The RCO’s output signal is: (A.7)
n=0
where ηo (n) = ηo (0) + nTc + Mc (n).
(A.8)
Tc is the RCO’s intrinsic period—its ISI when it receives no input—and Mc (n) is the nth spike’s absolute modulation (see equations A.2 and A.3 for other related definitions). The ISI (the “cycle”) of the RCO is controlled by its input in the following way: Io (n) = Tc + go (Rd (n)),
(A.9)
where go , in the general case, is a monotonic decreasing or a monotonic increasing function, go (0) = 0, and Rd (n) is the input to the RCO integrated over the interval preceding spike n, during Io (n) (see Figure 1 and
638
Ehud Ahissar
section 2.2.2). In neuronal implementations, go will probably always be a decreasing function in which the more positive (excitatory) the oscillator’s input is, the sooner the oscillator will fire its next spike, and vice versa for more negative (inhibitory) inputs. The average value of Rd (n) is not necessarily, and usually will not be, 0. Therefore, Tc will not necessarily equal the average ISI of the RCO. To be consistent with equation A.3, for any given decoding period, Rd (n) will be described as being composed of two components: a DC component (the average value, Rdc ) and an AC component (the residual modulations, Rac (n)): Rd (n) = Rdc + Rac (n),
(A.10)
and the average ISI of the RCO, To , will be: To = Tc + hgo (Rd (n))i,
(A.11)
where hxi is the average value of x over the described decoding period. Thus, the output timings of the RCO can be rewritten as: ηo (n) = ηo (0) + nTo + Mo (n) n X mo (j), n > 0. = ηo (0) + nTo +
(A.12)
j=1
For a linear go we get: hgo (Rd (n))i = go (Rdc ),
(A.13)
and the instantaneous ISI is (from equations A.3, A.9, and A.11), Io (n) = To + go (Rac (n)).
(A.14)
A.2.2 Phase detector. The PD’s output, Rd (n), is a rate-encoded signal, which is a function of the difference between the arrival times of the PD’s two inputs, Rd (n + 1) = gd (ηo (n) − ηi (n)),
(A.15)
where gd is a monotonic increasing or a monotonic decreasing function. The difference ηo (n) − ηi (n) is simply the difference between the times of appearance of the nth spikes of the RCO and the PLL’s input, where n is counted only within a locked state, when the RCO’s and the input’s spikes are paired. With neuronal implementations, gd probably cannot achieve a strict monotonic shape but rather will assume a staircase-like form. There will be ranges of phase difference within which the PD will produce a constant output. The size of these ranges determines the PD’s resolution (see section 2.2.1) and, hence, also the PLL’s resolution.
Code Conversion by Phase-Locked Loops
639
A.2.3 The loop gain. The ability of the PLL to be locked to the input depends on the loop’s functioning as a negative feedback loop. In such a negative feedback loop, any deviation of the input from its expected frequency will produce an error signal (Rac ) that will drive the RCO’s frequency in the direction that will reduce the error—in the same direction as the input’s deviation. To provide a negative feedback, the gain along the loop during one cycle, referred to as the loop gain, should be negative. The loop gain, G, is computed per a working point (e.g., the crossing point in Figure 2b), assuming a constant input. It is equal to the gain of a small perturbation from the working point that is obtained after one cycle and is approximately (exactly for a linear system) G ≈ go0 gd0 ,
(A.16)
where, with continuous go and gd , g0o and g0d are the derivatives of go and gd , respectively, at the working point. With discrete go or gd , g0x equals 1y/1x, where 1y is the output change generated by a minimal input change (1x). The PLL will be stable only if any deviation from a working point, generated within the loop while the input is constant will be attenuated at the next cycle. A perturbation will be canceled at the next cycle if G = −1, will be attenuated if −1 < G < 0, and will be inverted and attenuated if −2 < G < −1. Outside this range, any perturbation will increase in absolute magnitude with each successive cycle. Thus, a necessary, although not sufficient, requirement for a stable PLL is −2 < G < 0.
(A.17)
Therefore, to keep a PLL stable gd and go must have opposite slopes around the working point (see, for example, Figure 2b). A range within which input modulations can be decoded is defined as a working range of the PLL. This range is determined by equation A.17, the dynamic range of the RCO, and the input average frequency. A.3 Tracking. For clarity, let us consider an ideal PLL, in which the derivatives of gd and go are constant (equal to kd and ko , respectively), G = −1, the RCO fires single spikes per cycle, and there is no noise. Suppose the input to the PLL (hereafter “the Input”) is: si (t) =
N−1 X
S(t − ηi (n)),
n=0
and mi (n) = 0,
n ≤ 0.
Ii (n) = Ti + mi (n),
n>0
(A.18)
640
Ehud Ahissar
When the loop is locked, there is one and only one RCO spike per each Input spike. If the Input is not modulated (i.e., it is perfectly periodic), the timings of the RCO’s spikes will differ from the Input spikes only by a constant time delay (phase shift), ηo (n) = ηi (n) + 1∅
(A.19)
where 1∅ is a constant time difference and n is the index of the Input cycle. And Io (n) = Ii (n).
(A.20)
When the Input is temporally modulated, the modulation is detected by the PD, which detects the difference between the expected ηi (n) (“stored” as ηo (n)) and the actual timing. The detected difference corrects the RCO’s frequency so that the latter matches the Input frequency. However, this correction will take place only at the following cycle (n + 1): Io (n) = Ii (n − 1)
(A.21)
and, from equation A.3, To + mo (n) = Ti + mi (n − 1).
(A.22)
By definition, To is the average ISI of the RCO, and as long as the loop is locked, it is equal to the average Input ISI, Ti , To = Ti
(A.23)
and, therefore, mo (n) = mi (n − 1).
(A.24)
Thus, the Input modulation is replicated by the RCO’s modulation, with one cycle lag. The average periods, To and Ti , are defined de facto for every decoding period. Thus, the PLL does not “know” the exact values for these averages during the decoding, and a decomposition of its output signal to the different components will fit the above definitions only at the end of the decoding period. Note, however, that this non-causal process relates only to the observer’s interpretation of the decoding process and does not relate to the process itself, since the decoding utilizes the actual timings [ηo (n) and ηi (n)] of the signals (equations A.9 and A.15).
Code Conversion by Phase-Locked Loops
641
A.4 Decoding. It can be shown that with ideal PLLs, Rdc = gd (1∅)
(A.25)
Rac (n + 1) = −gd (mi (n)) = g−1 o (mi (n))
(A.26)
1∅ = Tc − Ti + ∅m .
(A.27)
1∅ is the average phase difference and, together with Rdc , determines the PLL’s working point (see Figure 2b). ∅m is a constant delay that depends on the implementation (see section A.5.2). If gd or go is not linear, the AC component will depend on the DC component, that is, on the working point. However, since gd (and g−1 o ) is monotonic, Rd (n) is unique (within the resolution limits) for every input. When G 6= −1, the above solutions are the steady-state solutions that are obtained after a variable number of cycles, depending on G. Using minimal rate bin (= Ti ), the output of the ideal PLL is (see equation A.6), Rd (t) = gd (Tc − Ti + ∅m ) −
N−1 X
[gd (mi (n − 1))R(t − nTi )].
(A.28)
n=0
Readout mechanism that employ longer rate-bins should sample or integrate the PLL’s output. A.5 PD Implementations. A.5.1 A single neuron PD. The working range of such a PD is determined by the effective widths of its inputs—the maximal delay from an onset of an EPSP in which, if an EPSP from the other input is added, the membrane voltage will cross the threshold. For similar inputs whose (EPSP amplitude)/(threshold distance) = A Tw = τ ln(A/(1 − A)).
(A.29)
Assuming τ = 10 ms is the decay time constant, if A = 0.8, the working range (Tw ) is almost 14 ms, and if A = 0.9, it is about 22 ms. A.5.2 Population PDs. Figure 4d),
Within the working range of the ePLL (Twe ; see
Rd (n + 1) = Rmax + kd [ηo (n) − ηi (n)],
kd > 0.
(A.30)
642
Ehud Ahissar
Since go is monotonic decreasing, G < 0 and the basic algorithm can be implemented straightforwardly by the circuit denoted by the dashed lines in Figure 3a. With the PD implementation of Figure 4d, the average delay is 1∅ = Tc − Ti − Rmax /kd .
(A.31)
Within the working range of the iPLL (Twi , see Figure 4d), Rd (n + 1) = Rmax − kd [ηo (n) − ηi (n)],
kd > 0.
(A.32)
In this case, G < 0 because of the INH that are added to the loop (see Figure 3a, solid lines). The average delay for the Figure 4d iPLL implementation is 1∅ = Tc − Ti + Rmax /kd .
(A.33)
The PLL’s output, in both the excitatory and inhibitory implementations, is a population output. A.6 Tactile Signals. It is assumed, for simplicity, that for any given scanning direction, all ridges have negligible widths. Textures consisting of a collection of such ridges can be described, along any one-dimensional direction, in a discrete form by ui (x) =
N X
Ai (n)U[x − ξi (n)],
(A.34)
n=0
where U(x0 ) describes a single ridge at x0 = 0 with a unit height, Ai (n) is the height of the nth ridge, and ξi (n) describes the location of the nth ridge, ξi (n) = ξi (0) + nXi + Pi (n) = ξi (0) + nXi +
n X
pi (j);
j=0
Gi (n) = Xi + pi (n),
(A.35)
where Xi is the average spatial period, Pi (n) is the absolute modulation of this period for the nth ridge, pi (n) is the cycle modulation, and Gi (n) is the inter-ridge-interval (see section A.1). If a mechanoreceptor response is 1:1, then for a constant finger velocity, V, the sensory transformation is simply ηi (n) = ξi (n)/V,
ηi (0) = ξi (0) = 0,
(A.36)
and similar relationships hold for the signals’ components: Ti = Xi /V;
Ii (n) = Gi (n)/V;
mi (n) = pi (n)/V.
(A.37)
Code Conversion by Phase-Locked Loops
643
A.7 Tactile Decoding. With ideal PLLs, minimal rate bins (= Ti ) and constant finger velocity, the two output components (see equations A.28 and A.37) are, Rdc (t) = gd (Tc − Xi (t)/V(t) + ∅m ) Rac (t) = −
N−1 X
[gd (pi (n − 1)/V(t))R(t − nTi )]
(A.38)
(A.39)
n=0
where ∅m is an implementation-specific delay. A.8 Automatic Velocity Control. Suppose the desired working point is Tcw ; then the requirement is that Xi (t)/V(t) = Tcw ,
dx = V(t)dt,
(A.40)
and V(t) = Xi (t)/Tcw
(A.41)
is the finger velocity that the AVC should produce. The AVC is a servoregulating loop (see Figure 10) whose controlled variable, Ti (t), should be kept constant. The loop equations are: V(t) = gv (Rmax − Rdc (t)) + Vother
(A.42)
Rdc (t) = gpll (Ti (t) − Tc )
(A.43)
where gpll and gv are the PLL’s and VC’s transfer functions, respectively (see Figure 10b), Rmax is the maximal possible Rdc (t), Vother is the velocity additive component caused by the “other velocity control,” and Tc is the PLL’s intrinsic frequency (see appendix A.2). Acknowledgments I thank S. Serulnik and M. Zacksenhouse for extensive and illuminating discussions on phase-locked loops; A. Aertsen, M. Ahissar, D. Blake, T. Flash, K. O. Johnson, D. Shoham, A. Treves, S. Ullman, and several anonymous referees for their helpful comments on previous versions of this article; and B. Schick for reviewing the manuscript. This work was supported by the Alon Foundation, Israel; the Minna-James-Heineman Foundation, Germany; and grant 93-198 from the United States–Israel Binational Science Foundation, Jerusalem, Israel.
644
Ehud Ahissar
Note added in proof: Recently we found that the behavior of cortical oscillators in the barrel cortices of anesthetized rats and guinea pigs confirms predictions a2 and a5. Also, the behavior of multi-units at the thalamic recipient layers of these cortices is consistent with predictions a5 (under the assumption of AND-like PD operation) and a6 (Ahissar, E., Haidarliu, S., & Zacksenhouse, M. (1997) “Decoding temporally encoded sensory input by cortical oscillations and thalamic phase comparators.” Proc. Natl. Acad. Sci. USA. 94:11633–11638). Note also that the results of Nicolelis et al. (1995) confirm prediction a3. References Abeles, M. (1982). Role of the cortical neuron: Integrator or coincidence detector? Isr. J. Med. Sci., 18, 83–92. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70, 1629– 1638. Aertsen, A., Erb, M., & Palm, G. (1994). Dynamics of functional coupling in the cerebral cortex: An attempt at a model-based interpretation. Physica-D, 75, 103–128. Agmon, A., & Connors, B. W. (1992). Correlation between intrinsic firing patterns and thalamocortical synaptic responses of neurons in mouse barrel cortex. J. Neurosci., 12, 319–329. Ahissar, E. (1995). Conversion from temporal-coding to rate-coding by neuronal phaselocked loops (Tech. Rep. GC-EA/95-4). Rehovot, Israel: Weizmann Institute of Science. Ahissar, M., Ahissar, E., Bergman, H., & Vaadia, E. (1992). Encoding of sound source and movement: The activity of single neurons and interactions between adjacent neurons in the primary auditory cortex of monkeys. J. Neurophysiol., 67, 203–215. Ahissar, E., Alkon, G., Zacksenhouse, M., & Haidarliu, S. (1996). Cortical somatosensory oscillators and the decoding of vibrissal touch. Soc. Neurosci. Abstr., 22, 18. Ahissar, E., & Gamzu, E. (1995). Utilization of temporally-encoded versus spatially-encoded information during the performance of a tactile discrimination task. Soc. Neurosci. Abstr., 21, 1018. Ahissar, E., & Vaadia, E. (1990). Oscillatory activity of single units in a somatosensory cortex of an awake monkey and their possible role in texture analysis. Proc. Natl. Acad. Sci. USA, 87, 8935–8939. Alloway, K. D., Sinclair, R. J., & Burton, H. (1988). Responses of neurons in somatosensory cortical area II of cats to high-frequency vibratory stimuli during iontophoresis of a GABA antagonist and glutamate. Somatosens. Mot. Res. 6(2), 109–140. Alonso, A., & Llinas, R. R. (1989). Subthreshold Na+ -dependent theta-like rhythmicity in stellate cells of entorhinal cortex layer II. Nature, 342, 175–177. Amitai, Y. (1994). Membrane potential oscillations underlying firing patterns in
Code Conversion by Phase-Locked Loops
645
neocortical neurons. Neuroscience, 63, 151–161. Angel, A. (1983). The functional interrelations between the somatosensory cortex and the thalamic reticular nucleus: Their role in the control of information transfer across the specific somatosensory thalamic relay nucleus. In G. Macchi, A. Rustioni, and R. Spreafico (Eds.), Somatosensory integration in the thalamus (pp. 222–239). Amsterdam: Elsevier. Bankman, I. N., Hsiao, S. S., & Johnson, K. O. (1990). Neural image transformation in the somatosensory system of the monkey: Comparison of neurophysiological observations with responses in a neural network model. In Cold Spring Harbor Symposia on Quantitative Biology (Vol. 55, pp. 611–620). Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press. Berkley, M. (1978). Vision: Geniculocortical system. In R. B. Masterton (Ed.), Handbook of behavioral neurobiology, Vol 1: Sensory integration (pp. 165–207). New York: Plenum Press. Boussaoud, D., & Wise, S. P. (1993). Primate frontal cortex: Effects of stimulus and movement. Exp. Brain Res., 95, 28–40. Buanomuno, D., & Merzenich, M. M. (1995). Temporal information transformation into a spatial code by a neural network with realistic properties. Science, 267, 1028–1030. Burton, H. (1986). Second somatosensory cortex and related areas. In E. G. Jones and A. Peters (Eds.), Cerebral cortex (Vol. 5, pp. 31–98). New York: Plenum Press. Burton, H., & Sinclair, R. J. (1991). Second somatosensory cortical area in Macaque monkeys: 2. Neuronal responses to punctate vibrotactile stimulation of glabrous skin on the hand. Brain Res., 538, 127–135. Burton, H., & Sinclair, R. J. (1994). Representation of tactile roughness in thalamus and somatosensory cortex. Can. J. Physiol. Pharmacol., 72, 546–557. Calvin, W. H. (1975). Generation of spike trains in CNS neurons. Brain Res., 84, 1–22. Carlson, M. (1990). The role of somatic sensory cortex in tactile discrimination in primates. In E. G. Jones and A. Peters (Eds.), Cerebral Cortex (Vol. 8B, pp. 451– 486). New York: Plenum Press. Carpenter, R. H. S. (1988). Movements of the eyes. 2nd ed. London: Pion. Carr, C. E. (1993). Processing of temporal information in the brain. Annu. Rev. Neurosci., 16, 223–243. Carvell, G. E., & Simons, D. J. (1995). Task- and subject-related differences in sensorimotor behavior during active touch. Somatosens. Mot. Res., 12, 1–9. Chagnac-Amitai, Y., & Connors, B. W. (1989). Synchronized excitation and inhibition driven by intrinsically bursting neurons in neocortex. J. Neurophysiol., 62, 1149–1162. Chapman, C. E. (1994). Active versus passive touch: Factors influencing the transmission of somatosensory signals to primary somatosensory cortex. Can. J. Physiol. Pharmacol., 72, 558–570. Connor, C. E., & Johnson, K. O. (1992). Neural coding of tactile texture: Comparison of spatial and temporal mechanisms for roughness perception. J. Neurosci., 12(9), 3414–3426.
646
Ehud Ahissar
Darian-Smith, I., & Kenins, P. (1980). Innervation density of mechanoreceptive fibres supplying glabrous skin of the monkey’s index finger. J. Physiol., 309, 147–155. Darian-Smith, I., & Oke, L. E. (1980). Peripheral neural representation of the spatial frequency of a grating moving at different velocities across the monkey’s finger pad. J. Physiol., 309, 117–133. Darian-Smith, I., Sugitani, M., Heywood, J., Karita, K., & Goodwin, A. (1982). Touching textured surfaces: Cells in somatosensory cortex respond both to finger movement and to surface features. Science, 218, 906–909. Deschenes, M., Paradis, M., Roy, J. P., & Steriade, M. (1984). Electrophysiology of neurons of lateral thalamic nuclei in cat: Resting properties and burst discharges. J. Neurophysiol., 51, 1196–1219. Doetsch, G. S., Standage, G. P., Johnston, K. W., & Lin, C. S. (1988). Thalamic connections of two functional subdivisions of the somatosensory forepaw cerebral cortex of the raccoon. J. Neurosci., 8(6), 1873–1886. Dykes, R. W. (1983). Parallel processing of somatosensory information: A theory. Brain Res. Rev., 6, 47–115. Eckhorn, R. (1994). Oscillatory and non-oscillatory synchronizations in the visual cortex and their possible roles in associations of visual features. Prog. Brain Res., 102, 405–426. Eckhorn, R., Frien, A., Bauer, R., Woelbern, T., & Kehr, H. (1993). High frequency (60–90 Hz) oscillations in primary visual cortex of awake monkey. NeuroReport, 4, 243–246. Engel, A. K., Konig, P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992). Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends. Neurosci., 15, 218–226. Ferrington, D. G., & Rowe, M. (1980). Differential contributions to coding of cutaneous vibratory information by cortical somatosensory areas I and II. J. Neurophysiol., 43, 310–331. Fetz, E. E. (1993). Cortical mechanisms controlling limb movement. Curr. Opin. Neurobiol., 3, 932–939. Freeman, A. W., & Johnson, K. O. (1982). Cutaneous mechanoreceptors in Macaque monkey: Temporal discharge patterns evoked by vibration, and a receptor model. J. Physiol., 323, 21–41. Freeman, W. J. (1975). Mass action in the nervous system. New York: Academic Press. Gamzu, E., Haidarliu, S., & Ahissar, E. (1994). Sensorimotor control: Dependence of finger velocity on the scanned spatial frequency during performance and learning of a tactile task. Isr. Soc. Neurosci., 3, 57. Gardner, F. M. (1979). Phaselock techniques. New York: Wiley. Gardner, E. P., & Palmer, C. I. (1989). Simulation of motion on the skin. I. Receptive fields and temporal frequency coding by cutaneous mechanoreceptors of OPTACON pulses delivered to the hand. J. Neurophysiol., 62, 1410–1435. Gardner, E. P., Palmer, C. I., Hamalainen, H. A., & Warren, S. (1992). Simulation of motion on the skin. V. Effect of stimulus temporal frequency on the representation of moving bar patterns in primary somatosensory cortex of monkeys. J. Neurophysiol., 67, 37–63.
Code Conversion by Phase-Locked Loops
647
Georgopoulos, A. P. (1986). On reaching. Annu. Rev. Neurosci., 9, 147–170. Ghose, G. M., & Freeman, R. D. (1992). Oscillatory discharge in the visual system: Does it have a functional role? J. Neurophysiol., 68, 1558–1574. Goldberg, J. M., & Brown, P. B. (1969). Response of binaural neurons of dog superior olivary complex to dichotic tonal stimuli: Some physiological mechanisms of sound localization. J. Neurophysiol., 32, 613–636. Goodwin, A. W., John, K. T., Sathian, K., & Darian-Smith, I. (1989). Spatial and temporal factors determining afferent fibre responses to a grating moving sinusoidally over the monkey’s fingerpad. J. Neurosci., 9(4), 1280–1293. Goodwin, A. W., & Morley, J. W. (1987). Sinusoidal movement of a grating across the monkey’s fingerpad: Effect of contact angle and force of the grating on afferent fiber responses. J. Neurosci., 7(7), 2192–2202. Gottschaldt, K.-M., Vahle-Hinz, C., & Hicks, T. P. (1983). Electrophysiological and micropharmacological studies on mechanisms of input-output transformation in single neurones of the somatosensory thalamus. In G. Macchi, A. Rustioni, & R. Spreafico (Eds.), Somatosensory integration in the thalamus (pp. 199–216). Amsterdam: Elsevier. Gray, C. M., Engel, A. K., Konig, P., & Singer, W. (1990). Stimulus-dependent neuronal oscillations in cat visual cortex: Receptive field properties and feature dependence. Eur. J. Neurosci., 2, 607–619. Gray, C. M., & McCormick, D. A. (1996). Chattering cells: Superficial pyramidal neurons contributing to the generation of synchronous oscillations in the visual cortex. Science, 274, 109–113. Hoogland, P. V., Welker, E., & Van der Loos, H. (1987). Organization of the projections from barrel cortex to thalamus in mice studied with Phaseolus vulgaris-leucoagglutinin and HRP. Exp. Brain Res., 68, 73–87. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Hoppensteadt, F. C. (1986). An introduction to the mathematics of neurons. Cambridge: Cambridge University Press. Jeffress, L. A. (1948). A place theory of sound localization. J. Comp. Physiol. Psychol., 41, 35–39. Johansson, R. S., Landstrom, U., & Lundstrom, R. (1982). Responses of mechanoreceptive afferent units in the glabrous skin of the human hand to sinusoidal skin displacements. Brain Res., 244, 17–25. Johansson, R. S., & Vallbo, A. B. (1979). Tactile sensibility in the human hand: Relative and absolute densities of four types of mechanoreceptive units in glabrous skin. J. Physiol., 286, 283–300. Johansson, R. S., & Vallbo, A. B. (1980). Spatial properties of the population of mechanoreceptive units in the glabrous skin of the human hand. Brain Res., 184, 353–366. Johansson, R. S., & Vallbo, A. B. (1983). Tactile sensory coding in the glabrous skin of the human hand. Trends. Neurosci., 6, 27–32. Johnson, K. O., & Lamb, G. D. (1981). Neural mechanisms of spatial tactile discrimination: Neural patterns evoked by braille-like dot patterns in the monkey. J. Physiol., 310, 117–144. Johnson, K. O., & Phillips, J. R. (1981). Tactile spatial resolution. I. Two-point
648
Ehud Ahissar
discrimination, gap detection, grating resolution, and letter recognition. J. Neurophysiol., 46, 1177–1191. Johnson, K. O., & Phillips, J. R. (1984). Spatial and nonspatial neural mechanisms underlying tactile spatial discrimination. In C. von Euler, O. Franzen, V. Lindblom, and D. Ottoson (Eds.), Somatosensory mechanisms (pp. 237–248). London: Macmillan. Jones, E. G. (1986). Connectivity of the primate sensory-motor cortex. In E. G. Jones & A. Peters (Eds.), Cerebral cortex (Vol. 5, pp. 113–184). New York: Plenum Press. Lamb, G. D. (1983). Tactile discrimination of textured surfaces: Psychophysical performance measurements in humans. J. Physiol., 338, 551–565. Landry, P., & Dykes, R. W. (1985). Identification of two populations of corticothalamic neurons in cat primary somatosensory cortex. Exp. Brain Res., 60, 289–298. Lebedev, M. A., & Nelson, R. J. (1995). Rhythmically firing (20–50 Hz) neurons in monkey primary somatosensory cortex: Activity patterns during initiation of vibratory-cued hand movements. J. Comp. Neurosci., 2, 313–334. Lederman, S. J. (1981). The perception of surface roughness by active and passive touch. Bulletin of the Psychonomic Society, 18, 253–255. Llinas, R. R., Grace, A. A., & Yarom, Y. (1991). In vitro neurons in mammalian cortical layer 4 exhibit intrinsic oscillatory activity in the 10- to 50-Hz frequency range. Proc. Natl. Acad. Sci. USA, 88, 897–901. Middlebrooks, J. C., & Green, D. M. (1991). Sound localization by human listeners. Annu. Rev. Psychol., 42, 135–159. Morley, J. W., & Goodwin, A. W. (1987). Sinusoidal movement of a grating across the monkey’s fingerpad: Temporal patterns of afferent fiber responses. J. Neurosci., 7, 2181–2191. Mountcastle, V. B. (1993). Temporal-order determinants in a somatesthetic frequency discrimination—Sequential order coding. Ann. N.Y. Acad. Sci., 682, 150–170. Mountcastle, V. B., Talbot, W. H., Sakata, H., & Hyv¨arinen, J. (1969). Cortical neuronal mechanisms in flutter-vibration studied in unanesthetized monkeys: Neuronal periodicity and frequency discrimination. J. Neurophysiol., 38, 452–484. Murray, E. A., & Mishkin, M. (1984). Relative contributions of SII and area 5 to tactile discrimination in monkeys. Behav. Brain Res., 11, 67–83. Narici, L., Romani, G. L., Salustri, C., Pizzella, V., Modena, I., & Papanicolaou, A. C. (1987). Neuromagnetic evidence of synchronized spontaneous activity in the brain following repetitive sensory stimulation. Int. J. Neurosci., 32, 831– 836. Nelson, R. J. (1996). Interactions between motor commands and somatic perception in sensorimotor cortex. Curr. Opin. Neurobiol., 6, 801–810. Neuenschwander, S., & Singer, W. (1996). Long-range synchronization of oscillatory light responses in the cat retina and lateral geniculate nucleus. Nature, 379, 728–732. Nicolelis, M. A. L., Baccala, L. A., Lin, R. C. S., & Chapin, J. K. (1995). Sensori-
Code Conversion by Phase-Locked Loops
649
motor encoding by synchronous neural ensemble activity at multiple levels of the somatosensory system. Science, 268, 1353–1358. Perkel, D. H., & Bullock, T. H. (1968). Neural coding. Neurosci. Res. Prog. Bull., 6, 221–248. Perkel, D. H., Schulman, J. H., Bullock, T. H., Moore, G. P., & Segundo, J. P. (1964). Pacemaker neurons: Effects of regularly spaced synaptic input. Science, 145, 61–63. Phillips, J. R., & Johnson, K. O. (1981). Tactile spatial resolution. II. Neural representation of bars, edges, and gratings in monkey primary afferents. J. Neurophysiol., 46, 1192–1203. Phillips, J. R., Johansson, R. S., & Johnson, K. O. (1990). Representation of braille characters in human nerve fibers. Exp. Brain Res., 81, 589–592. Phillips, J. R., Johansson, R. S., & Johnson, K. O. (1992). Responses of human mechanoreceptive afferents to embossed dot arrays scanned across fingerpad skin. J. Neurosci., 12, 827–839. Phillips, J. R., Johnson, K. O., & Hsiao, S. S. (1988). Spatial pattern representation and transformation in monkey somatosensory cortex. Proc. Natl. Acad. Sci. USA, 85, 1317–1321. Richmond, B. J., & Optican, L. M. (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. II. Quantification of response waveform. J. Neurophysiol., 57, 147–161. Rosner, B. S. (1961). Neural factors limiting cutaneous spatiotemporal discrimination. In W. A. Rosenblith (Ed.), Sensory communication (pp. 725–737). Cambridge, MA: MIT Press. Sejnowski, T. J. (1995). Pattern recognition: Time for a new neural code? Nature, 376, 21–22. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Silva, L. R., Amitai, Y., and Connors, B. W. (1991). Intrinsic oscillations of neocortex generated by layer 5 pyramidal neurons. Science, 251, 432–435. Simons, D. J. (1995). Neuronal integration in the somatosensory whisker/barrel cortex. In E. G. Jones and I. T. Diamond (Eds.), Cerebral cortex (Vol. 11, pp. 263– 297). New York: Plenum Press. Sinclair, R., & Burton, H. (1988). Responses from area 3b of somatosensory cortex to textured surfaces during active touch in primate. Somatosens. Res., 5, 283– 310. Sinclair, R. J., & Burton, H. (1991). Neuronal activity in the primary somatosensory cortex in monkeys (Macaca mulatta) during active touch of textured surface gratings: Responses to groove width, applied force, and velocity of motion. J. Neurophysiol., 66, 153–169. Sinclair, R. J., Sathian, K., & Burton, H. (1991). Neuronal responses in ventroposterolateral nucleus of thalamus in monkeys (Macaca mulatta) during active touch of gratings. Somatosens. Mot. Res., 8, 293–300. Steriade, M., McCormick, D. A., & Sejnowski, T. J. (1993). Thalamocortical oscillations in the sleeping and aroused brain. Science, 262, 679–685. Swadlow, H. A. (1995). Influence of VPM afferents on putative inhibitory interneurons in S1 of the awake rabbit—Evidence from cross-correlation, mi-
650
Ehud Ahissar
crostimulation, and latencies to peripheral sensory stimulation. J. Neurophysiol., 73, 1584–1599. Talbot, W. H., Darian-Smith, I., Kornhuber, H. H., & Mountcastle, V. B. (1968). The sense of flutter-vibration: Comparison of the human capacity with response patterns of mechanoreceptive afferents from the monkey hand. J. Neurophysiol., 31, 301–334. Taylor, M. M., Lederman, S. J., & Gibson, R. H. (1973). Tactual perception of texture. In E. Carterette & M. Friedman (Eds.), Handbook of perception (Vol. 3, pp. 251–272). New York: Academic Press. Theunissen, F., & Miller, J. P. (1995). Temporal encoding in nervous systems: A rigorous definition. J. Comp. Neurosci., 2, 149–162. Tremblay, F., Ageranioti-Belanger, S. A., & Chapman, C. E. (1996). Cortical mechanisms underlying tactile discrimination in the monkey. I. Role of primary somatosensory cortex in passive texture discrimination. J. Neurophysiol., 76, 3382–3403. Vega-Bermudez, F., Johnson, K. O., & Hsiao, S. S. (1991). Human tactile pattern recognition: Active versus passive touch, velocity effects, and patterns of confusion. J. Neurophysiol., 65, 531–546. Viterbi, A. J. (1966). Principles of coherent communication. New York: McGraw-Hill. Wang, X., Merzenich, M. M., Beitel, R., & Schreiner, C. E. (1995). Representation of a species-specific vocalization in the primary auditory cortex of the common marmoset: Temporal and spatial characteristics. J. Neurophysiol., 74, 2685– 2706. Welker, W. I. (1964). Analysis of sniffing of the albino rat. Behaviour, 22, 223–244. White, E. L., & Keller, A. (1987). Intrinsic circuitry involving the local axon collaterals of corticothalamic projection cells in mouse SmI cortex. J. Comp. Neurol., 262, 13–26. Wilson, M. A., & Bower, J. M. (1989). The simulation of large-scale neural networks. In C. Koch and I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks (pp. 291–333). Cambridge, MA: MIT Press. Wise, S. P. (1993). Monkey motor cortex: Movements, muscles, motoneurons and metrics. Trends. Neurosci., 16, 46–49. Yamamoto, T., Samejima, A., & Oka, H. (1988). Short latency activation of local circuit neurons in the cat somatosensory cortex. Brain Res., 461, 199–203. Received August 1, 1996; accepted June 3, 1997.
LETTER
Communicated by Christoph von der Malsburg
Deformation Theory of Dynamic Link Matching Toru Aonishi Department of Biophysical Engineering, Faculty of Engineering Science, Osaka University, Osaka 560, Japan
Koji Kurata Department of Systems and Human Science, Graduate School of Engineering Science, Osaka University, Osaka 560, Japan
Dynamic link matching is a self-organizing topographic mapping between a template image and a data image. The mapping tends to be continuous, linking two points sharing similar local features, which, as a result, can lead to its deformation to some degree. In analyzing such deformation mathematically, we reduced the model equation to a phase equation, which enabled us to clarify the principles of the deformation process and the relationship between high-dimensional models and lowdimensional ones. We also elucidated the characteristics of the model in the context of the standard regularization theory. 1 Introduction Pattern recognition invariant to deformation or transformation can be performed by dynamic link matching (DLM) (Konen & von der Malsburg, 1993). DLM specifies a flexible match between a template pattern and a data pattern, where local features in the data pattern have to be matched with their counterparts in the template pattern. DLM is based on the model of a self-organizing topographic map with a weak tendency to link two points sharing similar features. Since the self-organizing map has a kind of elasticity, the map tends to resemble closely the identity map, resisting disturbances. DLM shares this characteristic with the self-organizing map but will also generally link two points of similar local features, causing map deformation to some extent. DLM has been applied effectively to some engineering examples (Lades et al., 1993; Bienenstock & Doursat, 1994), but no mathematical analysis has yet been made. Therefore, we propose a mathematically tractable model based on a system of local excitation. We reduce a model equation to a phase equation (Ermentrout, 1981), which gives us a mathematical understanding of the principle of the flexible matching process. DLM can be classified into two types. One, a high-dimensional representation (Bienenstock & von der Malsburg, 1987; Konen & von der Malsburg, Neural Computation 10, 651–669 (1998)
c 1998 Massachusetts Institute of Technology °
652
Toru Aonishi and Koji Kurata
1993; Konen, Maurer, & von der Malsburg, 1994), is the map expressed by the synaptic weight distribution on the product space of the two image spaces. The other, a low-dimensional representation, corresponds directly to the graph matching (Lades et al., 1993; Bienenstock & Doursat, 1994). In this case, the map is expressed by reference vectors as in Kohonen’s model (Kohonen, 1982). Matching two images is, in general, an ill-posed problem. Information is insufficient when matching local features in a data image with their counterparts in a template image. By introducing topography constraints, however, this deficiency can be addressed in the definition of the problem. The potential of the phase equation in our model is equivalent to the cost function in standard regularization theories (Poggio, Torre, & Koch, 1985), since it consists of a stabilizer (a topography constraint) and a data-fitting term (to match local features in the data image with their counterparts in the template image). Our theory bridges the gap between low-dimensional representations and high-dimensional representations, because we start from a high-dimensional model equation and derive a low-dimensional phase equation. 2 Model Description Figure 1a shows a schematic diagram of our model. The problem is how to build a flexible match between a one-dimensional data image and a onedimensional template image. The input patterns are feature vector functions defined on F1 and F2 . Here, we simply assume F1 = F2 = R to avoid boundary effects. However, our analysis is useful in understanding models with boundary conditions, because at points far from the boundary the boundary effect is negligible and is not important for the deformation process of the mapping, which is the theme of this article. The feature vectors are defined as I1 (r1 ), r1 ∈ F1 and I2 (r2 ), r2 ∈ F2 . I1 and I2 are expected to be equal up to a certain deformed topographic transformation. The system has to match local features in the data image with their counterparts in the template image, and thus produce a topographic transformation from F1 to F2 . The map is expressed by the synaptic weight distribution w(r1 , r2 ) on F1 × F2 . The model equation is ∂ w(r1 , r2 , t) = −w + K f (w) + ε 2 s(r1 , r2 ), ∂t Z ∞Z ∞ ¡ ¢ K f (w) = dr01 dr02 k(r01 , r02 ) f w(r1 − r01 , r2 − r02 ) , −∞
(2.1)
−∞
where |ε| << 1 and s(r1 , r2 ) is the local similarity between I1 (r1 ) and I2 (r2 ), s(r1 , r2 ) = v(I1 (r1 ), I2 (r2 )).
(2.2)
∂ This equation consists of two parts. The first part, − ∂t w − w + K f (w), is
Deformation Theory of Dynamic Link Matching
653
r2 (r )
( 12 (r2-r1)) r1
(r ) (a)
(b)
Figure 1: (a) Schematic diagram of our model. I1 and I2 are equal up to a certain deformed topographic transformation. The system has to produce a topographic mapping from F1 to F2 . The mapping is expressed by the synaptic weight distribution w(r1 , r2 ) on F1 × F2 . (b) Schematic diagram of a topographic mapping (an equilibrium solution). This solution is a bank on a diagonal line.
the self-organizing part, which conserves the topography. The second part, ε2 s(r1 , r2 ), is a perturbation term. This part links two points assigned similar features in I1 and I2 . Here we examine the case where the perturbation is very weak. 3 Self-Organizing Part Let us begin with the self-organizing part, ∂ w(r1 , r2 , t) = −w + K f (w). ∂t
(3.1)
The integration kernel k in the linear operator K is of a two-dimensional isotropic Mexican hat type, and f is a sigmoidal function with f 0 (x) > 0 (monotonically increasing function). If all of the parameters are set appropriately, we can make the dynamics have a stable equilibrium solution of the following form: µ w(r1 , r2 , t) = ψ
¶ 1 √ (r2 − r1 ) . 2
This solution is a bank on a diagonal line as shown in Figure 1b. Figures 2a and 2b shows an example of a bank and its one-dimensional cross-section obtained by computer simulation with another coordinate system (see equation 3.3). Some work (Takeuchi & Amari, 1979; Amari, 1980; H¨aussler & von
654
Toru Aonishi and Koji Kurata
0.7 0.6
3.0
0.5 0.4 0.3
y
y
0.2 0.1 0
-3.0 0
x
10
-0.1 -0.2 -0.3
-0.70
0.70
-3
-2
-1
0
1
2
3
y
(a)
(b)
y x (y+ ( x)) ( x)
x (c)
Figure 2: (a) Rotated equilibrium solution obtained by computer simulation. (b) One-dimensional cross-section of the equilibrium solution. (c) Schematic diagram of a deformed solution by perturbation. We can assume that this solution does not vary along the x-axis within a small region.
der Malsburg, 1983) has analytically shown the stability of equilibrium solutions of topographic connections. In section 6, we discuss the necessary condition for the stability of the solution when f (x) is a step function. It is not easy to analyze its stability mathematically. However, in this article, the stability is verified by computer simulation. If r1 ∈ F1 is fixed, w is a function in the domain F2 . In such a case, we can define the topographic mapping p as p: F1 → F2 r1 → r2 |arg maxr2 w. In our system, the basin of attraction for a single bank solution (see Figure 2a) is small because k is localized and isotropic. Thus, we set a single
Deformation Theory of Dynamic Link Matching
655
bank pattern as the initial state; otherwise, a spotty pattern would be formed. This single bank solution is stable against small perturbations, as verified by computer simulation. This is because of the sigmoidal function, f , which suppresses all of the perturbations under its threshold h and makes our system stable around w = 0. In DLM, it is essential that the self-organizing dynamics have a bankshaped stable equilibrium solution on a diagonal line (a topographic map). It is possible to complicate the model to avoid the formation of stripe or hexagon patterns with long-range synaptic competition. More complicated models, however, would lose the mathematical simplicity, and some could not be analyzed mathematically. Another merit of a difference of gaussian (DOG) kernel is that its rotation symmetricity makes the mapping solution very flexible and much more expedient with magnification, contraction, and deformation. Many DLM models are not so strong against magnification and contraction. We have compared commonly used anisotropic kernels and DOG kernels by computer simulation and have found that a DOG kernel can make a model much more tolerant of deformation (deformation involving a local magnification and contraction). A theory on deformation for DLM therefore needs to be formed and verified by computer simulation; the use of a DOG kernel is very appropriate. Equation 3.1 is popular as a system of local excitation in 2D neural fields. Furthermore, this form of dynamics can be represented by a simplified model for Hebbian learning between two 1D neural fields. The lateral connection of each 1D neural field consists of short-range excitatory and longrange inhibitory connections. Interlayer synaptic weight w increases between excited cells as in Hebbian learning. When some synapse is enforced, other synaptic connections nearby decrease due to synaptic competition, since both neurons absorb the limited nutrition available from nearby cells. f represents the nonlinearity of the synapse. Through simple mathematical manipulation, equation 3.1 can be expressed in terms of another coordinate system, ∂ w(x, y, t) = −w + K f (w), ∂t Z ∞Z ∞ ¡ ¢ K f (w) = dx0 dy0 k(x0 , y0 ) f w(x − x0 , y − y0 ) , −∞
µ
x y
¶
µ =
(3.2)
−∞
√ 1/ √2 −1/ 2
√ ¶µ ¶ 1/√2 r1 , r2 1/ 2
(3.3)
where kernel k is not changed by the transformation of a coordinate because of its isotropy. The equilibrium solution is rewritten as: w(x, y, t) = ψ(y).
(3.4)
656
Toru Aonishi and Koji Kurata
Figure 2a shows an example of an equilibrium solution obtained by numerical calculation with equation 3.2. Equation 3.2 is invariant to spatial shifts along the y-axis. Therefore, ψ(y − θ) is also an equilibrium solution with any constant θ. θ stands for the phase of the solution. If we define ψ(0) as the maximum value of ψ, θ represents the center of the receptive field, which can be considered as the reference vector, for example, as in Kohonen’s (1982) model. Here, we estimate the stability against the fluctuation constant about the x-axis. Substituting w(x, y, t) = ψ(y − θ ) + εu(y, t) into equation 3.2 and linearizing around ε = 0, we obtain the following equation: ∂u = Lθ u, ∂t Lθ = −1 + Ky ( f 0 (ψ(y − θ )) · ).
(3.5)
Here, Lθ denotes a linear operator, and Ky is the following 1D convolution operator in the direction y: Z
Ky v(x, y) = k(y) =
∞
−∞ Z ∞ −∞
dy0 k(y0 )v(x, y − y0 ), dxk(x, y).
All of the eigenvalues of Lθ are nonpositive since the equilibrium solution ψ(y−θ) is stable. However, there is an eigenvalue of 0 with the eigenfunction ψ 0 (y − θ) because −ψ + K f (ψ) = 0. This eigenfunction corresponds to a small spatial shift of ψ in the direction of y, because ψ(y − θ + ε) ≈ ψ(y − θ) + εψ 0 (y − θ). We assume no other eigenfunctions exist for the eigenvalue of 0, that is, kerLθ = span{ψ 0 (y − θ)}.
(3.6)
This assumption is equivalent to the waveform stability of ψ(y − θ ). This means that although the equilibrium solution is irresistant to a spatial shift, it can maintain its waveform. The waveform stability of our solution is unproved, but it can be easily verified by computer simulation. The stability of modes with eigenvalue 0 is dramatically changed by a small perturbation. If other modes with eigenvalue 0 were to exist, the waveform would be broken. Even with a boundary, the solution maintains an approximate waveform stability as long as the solution decreases toward zero in both directions and is sufficiently small at the boundary points.
Deformation Theory of Dynamic Link Matching
657
4 Perturbation Here, we consider the perturbed model, ∂ w(x, y, t) = −w + K f (w) + ε 2 s(x, y), ∂t
(4.1)
where the local similarity in equation 2.2 is transformed into s(x, y) using equation 3.3. Since ε is very small, the self-organizing part is only weakly affected by the image data. The equilibrium solution 3.4 is gently deformed by the effect of the perturbation. We assume that the deformed solution can be expressed as w(x, y, t) = ψ(y − θ(ξ, τ )) + εu1 (ξ, y, τ ) + ε2 u2 (ξ, y, τ ),
(4.2)
where ξ = εx, τ = ε 2 t. ξ is a large space variable, and τ denotes a slow time variable. This solution can be regarded as constant with respect to x within a tight range about the x-axis as shown in Figure 2c. εu1 and ε 2 u2 are fluctuations caused by the effects of the perturbation. Since the deformation is caused by ε 2 s in the neighborhood of an equilibrium solution for an unperturbed system, the order of this deformation process is equal to the order of the perturbation, that is, O(ε2 ). In order to treat this small phenomenon in the theory, it is necessary to shrink time t and space x. If we set ξ = εx and τ = ε2 t, we can derive the following O(ε2 ) terms: ∂ ∂ = ε2 ∂t ∂τ
and
2 ∂2 2 ∂ = ε . ∂x2 ∂ξ 2
Thus, we can derive a diffusion term to describe the resistance against deformation. However, if x and t shrink with the same order, we cannot derive a spatial derivatives term (the ∂/∂x term is erased in the case of our system; see appendix A), that is, the deformation process is neglected in the theory, which is equivalent to a previous analysis (Aonishi, Kurata, & Mito, 1997). From this assumption, we can derive the following equations (see appendix A): 0 = −ψ + Ky f (ψ) + ε Lθ u1 + ε 2 (Lθ u2 − m2 ) , ∂θ ∂τ µ ¶2 ∂θ 1 ∂ 2θ 1 + G2 (y − θ ) 2 − G1 (y − θ) 2 ∂ξ 2 ∂ξ 1 − Ky f 00 (ψ(y − θ))u1 2 − s(x, y), 2
m2 = −ψ 0 (y − θ)
(4.3)
658
Toru Aonishi and Koji Kurata
Z G1 (y) =
∞
Z
−∞
³
∞
−∞
dx0 dy0 k(x0 , y0 )x0
2
× f (ψ(y − y0 ))ψ 00 (y − y0 ) + f 00 (ψ(y − y0 ))ψ 0 (y − y0 ) Z G2 (y) =
∞
−∞
0
2
Z
∞
−∞
´
,
dx0 dy0 k(x0 , y0 )x0 f 0 (ψ(y − y0 ))ψ 0 (y − y0 ). 2
Here, we cannot neglect the O(ε 2 ) term, since the order of the perturbation is O(ε 2 ). The coefficient of ε0 satisfies 0 from the equilibrium condition. Therefore, equation 4.3 can be reduced as follows: 0 = Lθ u1 + ε (Lθ u2 − m2 ) .
(4.4)
This equation demonstrates that Lθ u1 = O(ε). Therefore, from equation 3.6, we can obtain u1 in the following form, u1 = αψ 0 (y − θ) + εu1 ,
(4.5)
where εu1 is a higher-order fluctuation in u1 , and α is an arbitrary constant, since Lθ u1 = α Lθ ψ 0 (y − θ) + εLθ u1 = ε Lθ u1 = O(ε). Substituting this into equation 4.4 and neglecting higher-order terms, we obtain
Lθ (u1 + u2 ) − m2 = 0.
(4.6)
∂θ ∂τ µ ¶2 ∂θ 1 ∂ 2θ 1 + G2 (y − θ ) 2 − G1 (y − θ) 2 ∂ξ 2 ∂ξ 1 2 2 − α Ky f 00 (ψ(y − θ ))ψ 0 (y − θ ) . − s(x, y). 2
m2 = −ψ 0 (y − θ)
Here, we can erase fluctuation term Lθ (u1 + u2 ) in equation 4.6 using the following averaging manipulation. We average equation 4.6 using the weight f 0 (ψ(y − θ))ψ 0 (y − θ), Z
∞ −∞
dyf 0 (ψ(y − θ))ψ 0 (y − θ )m2 Z ∞ = dyf 0 (ψ(y − θ ))ψ 0 (y − θ )Lθ (u1 + u2 ) −∞ Z ∞ ¡ ¢ = dyf 0 (ψ(y − θ )) −ψ 0 (y − θ ) + Ky ψ 0 (y − θ ) (u1 + u2 ) −∞ Z ∞ = dyf 0 (ψ(y − θ ))Lθ ψ 0 (y − θ ) (u1 + u2 ) . −∞
Deformation Theory of Dynamic Link Matching
659
From equation 3.6, Lθ ψ 0 (y − θ ) = 0. Thus, we can obtain Z
∞ −∞
dyf 0 (ψ(y − θ))ψ 0 (y − θ )m2 = 0.
(4.7)
Therefore, we can erase the fluctuation term. Since k(x, y) is an even function, the model is symmetrical about the yaxis. Thus, we can expect stationary solution ψ(y) to be an even function. If we have an uneven stationary solution in the symmetric model, almost all small symmetrical perturbation on the model will transform it into a traveling wave solution (see appendix B), for there is no generic reason for it to stand still. In other words, uneven stationary solutions in symmetric models are structurally unstable. By computer simulation, we could verify that ψ(y) is an even stationary solution, as shown in Figure 2b. Thus ψ 0 (y) and f 0 (ψ(y))ψ 0 (y) are odd functions, G1 (y) is an even function, and G2 (y) is an odd function. Eliminating some terms by averaging with the weight f 0 (ψ(y − θ))ψ 0 (y − θ), we obtain the following phase equation: d ∂ 2 θ(ξ, τ ) ∂ ∂θ(ξ, τ ) = + c ∂τ 2 ∂ξ 2 ∂θ
Z
∞
−∞
dyf (ψ(y − θ ))s(ξ/ε, y),
(4.8)
where Z c=
∞ −∞
dyf 0 (ψ(y))ψ 0 (y)ψ 0 (y),
Z d=
∞
−∞
dyf 0 (ψ(y))ψ 0 (y)G2 (y).
From variational principles, phase equation 4.8 has the following potential V: ¶ ∂θ(ξ, τ ) 2 dξ ∂ξ −∞ Z ∞ Z ∞ 1 − dξ dyf (ψ(y − θ ))s(ξ/ε, y). c −∞ −∞
d V= 4c
Z
∞
µ
(4.9)
This equation consists of two parts: one for smoothing and the other for data fitting. Thus, it is obvious how it is analogous to the standard regularization theory. The data-fitting term derives the mapping so that f (ψ(y − θ )) and s(x, y) have maximal covariance, giving the mapping the tendency to connect two points sharing similar local features. It should also be noted that θ is a low-dimensional expression of weak deformation. The low-dimensional representation of mapping p is obtained from θ as the solution of the following self-consistent equation: p: F1 → F2¯ ³ ´ ¯ r1 → r2 ¯ √1 (r1 − r2 ) = θ √ε (r1 + r2 ) . 2 2
660
Toru Aonishi and Koji Kurata
5 Parameters We used the following convolution kernel, Ã Ã ! ! x2 + y2 x2 + y2 1 1 − 2 exp − , k(x, y) = 2 exp − σ1 2σ12 σ2 2σ22 and the sigmoidal function f (x) =
1 , 1 + exp(−β(x − h))
where σ1 = 0.25, σ2 = 0.33, β = 10, and h = 0.36. In following simulations, we set ε = 0.2. 6 Stability of the Solution With the β → ∞ limit, f (x) tends to form a step function: ½ 1 x>h f (x) = . 0 x≤h In this case, we can simply derive the following two conditions for the stability of ψ(y). One is a one-dimensional stability condition and is equivalent to the stability condition for local excitation in one-dimensional neural fields (Amari, 1977). The other is a two-dimensional stability condition and is needed to keep the solution straight, as in Figure 2a. A stable solution must satisfy both conditions. One-dimensional stability is a necessary condition for preventing perturbations that are constant with respect to x, that is, w(x, y, t) = ψ(y − θ ) + a εu(y, t). Let ψ(y) > h, that is, f (ψ(y)) = 1 in the region [−∞, ∞] × [ −a 2 , 2 ]. ψ(y) exists if a satisfies K(a) = h, Z Z ∞ dx K(a) =
(6.1) a
dyk(x, y) Ã Ã Ã ! !! √ Z a 1 y2 y2 1 dy exp − 2 − exp − 2 = 2π . σ1 σ2 2σ1 2σ2 0 −∞
0
One-dimensional stability in this case, K0 (a) < 0,
(6.2)
is a necessary condition for the stability of the solution ψ(y). The function K(a) is depicted in Figure 3a. If 0 < h < K(amax ), condition 6.1 allows two
Deformation Theory of Dynamic Link Matching
(a)
661
(b)
Figure 3: (a) Schematic diagram of K(a). Only a is a stable solution; a0 is an unstable fixed point. (b) Diffusing coefficient d with the β → ∞ limit. The vertical axis corresponds to d/ψ 02 , and the horizontal axis corresponds to a range of solutions. We can find a bifurcation point of the stability parameterized by a. Here, all parameters are set as in section 5.
solutions: a and a0 in Figure 3a. However, condition 6.2 permits only a as a stable solution, while a0 becomes an unstable fixed point. Obviously, d > 0 in equation 4.8 is a necessary condition for the stability of the solution. This is a two-dimensional stability condition. In the unstable case, our phase equation is no longer valid. In the case of a step function, we can explicitly derive the diffusing coefficient, µ µ µ ¶¶ √ a2 2 d = 2 2πψ 0 (a/2) σ1 1 − exp − 2σ1 2 µ µ ¶¶¶ 2 a , − σ2 1 − exp − 2σ2 2
(6.3)
where a is the size of the connected region in which the condition f (ψ(y)) = 1 holds, as previously discussed for one-dimensional stability. Figure 3b shows the diffusion coefficient d/ψ 0 (a/2)2 versus the range a. We can find a critical point where the sign of d changes, that is, a bifurcation point of the stability parameterized by a. Figures 3a and 3b shows that a region a exists in which both conditions are satisfied. Here, all parameters are set as in section 5, except for β. We can
662
Toru Aonishi and Koji Kurata
expect the stability conditions with infinite β to be applied in establishing stability at large values of β. In the case of h = 0.36, the solution satisfies both stability conditions. 7 Computer Simulation In numerical simulations to verify our theoretical results, computer simulation was performed for equation 4.1 with the periodic boundary condition. Here, the artificial local similarity s(x, y) was set as shown in Figure 4a. We set the initial state w(x, y, 0) = ψ(y) in Figure 2a. In the following simulations, d = 0.019453 and c = 5.034132. Figure 4b shows the deformed equilibrium solution by perturbation with the local similarity in Figure 4a. In Figure 4c we show θ(x), the ridge of the deformed equilibrium solution 4b, together with a theoretical curve from equation 4.8. We also investigated spatial frequency responses of the system. The similarity was assumed to be a function with a sinusoidal ridge. The amplitude of the ridge was fixed to 0.4. Figure 4d shows the amplitude of the solution wave versus various frequencies of similarity waves. The dotted curve in the same figure indicates the theory from phase equation 4.8. According to Figure 4d, the theory strongly conforms to the simulation results when f < 0.3, but when f > 0.3 our theory breaks down. The solutions (which initially have a continuous ridge) in the range of 0.3 < f < 0.4 break into blobs, and thus our theory ceases to be valid. The solutions when f > 4.0 keep the continuous ridge, but the assumption that the solution varies gently along the x-axis is no longer satisfied. Figures 4e and 4f shows the time-dependent behavior of θ. Our theory provides a good description of the simulation data in the time scale. 8 Application to Real Data Our design is very simple so as to be mathematically tractable; however, this creates limitations in its application. Even so, we have provided one example of an application to real data. From real speech data, we extracted power spectra at each time with wavelet filters for 20 different frequencies corresponding to Fourier transforms limited by a gaussian window in the time domain. Then the 20-dimensional power spectrum data for each time were assigned to a vector Ii (ri ). In this simulation, I1 and I2 were extracted from two different pieces of data, which were pronounced “neuron” by the same Japanese speaker. Here, ri denotes the time. All parameters were set as in the former simulations, except ε 2 = 0.045. We used the following local similarity function: v(I1 (r1 ), I2 (r2 )) = 1.0 − |I1 (r1 ) − I2 (r2 )|.
(8.1)
Deformation Theory of Dynamic Link Matching
3.0
3.0
y
-3.0
663
y
0 -1.0
-3.0 0
10
x
-0.81
1.0
(a)
10
x 0.81
(b) 0.5
Simulation Theory Max of s
0.4 0.3
0.4
0.2
0.3
Amplitude
0.1
q
Simulation Theory
0.35
0
-0.1
0.25 0.2 0.15
-0.2 0.1
-0.3
0.05
-0.4 -0.5
0
2
4
6
8
0 0.1
10
0.15
0.2
0.25
x
0.3
0.35
0.4
0.45
0.5
Frequency
(c)
(d) 0.4
Simulation Max of s
0.3
0.4
Theory Max of s
0.3
0.2 0.2
0.1 0.1
q
0
q
-0.1 -0.2
-0.2
-0.3
-0.3
-0.4
0
2
4
6
8
-0.4
10
x
(e)
0 -0.1
0
2
4
6
8
10
x
(f)
Figure 4: (a) Example of s(x, y). (b) Deformed equilibrium solution by perturbation with the local similarity function shown in a. (c) Ridge of s(x, y) and that of deformed equilibrium solution (b), together with a theoretical curve from equation 4.8. (d) Frequency responses to a sinusoidal ridge of s(x, y) (the amplitude is 0.4). (e,f) The time-dependent behavior of θ. Time interval between two curves is 5.6 in τ . d = 0.019453, c = 5.034132.
664
Toru Aonishi and Koji Kurata
n
Simulation
n
e
u
r
o
Theory
n 0.36
e
u
r
o
n
0.95
Figure 5: Two pieces of real data and the local similarity s(r1 , r2 ) between the data, together with a simulation result and a theoretical curve. We also show syllables for speech data.
Figure 5 shows two pieces of real data and the local similarity s(r1 , r2 ) between the data. We performed computer simulations by numerical calculation with equation 4.1 with the periodic boundary condition, as in the former simulations. On the local similarity s(r1 , r2 ) in Figure 5, we superimposed the simulation result (the ridge of the deformed equilibrium solution) and a theoretical curve, together with the starting points of the speech data syllables. Our system could match between the points in I1 and their counterparts in I2 . The theory strongly correlated with a behavior of the system.
Deformation Theory of Dynamic Link Matching
665
9 Conclusion In this article, we proposed mathematically tractable DLM, a topographic mapping formed between a template image and a data image. The mapping is continuous but tends to link two points sharing similar local features, which can result in some degree of deformation. To analyze such deformation mathematically, we derived a phase equation from a model equation. We demonstrated that the theory complies with the behavior of the system using computer simulations. DLM can be classified into two types. One is a high-dimensional representation. Equation 2.1 corresponds to this type since the map is expressed by a synaptic weight distribution on the product space of two image spaces. The other is a low-dimensional representation. The potential (see equation 4.9) is the same as in the cost function of standard regularization theories, since it consists of a stabilizer (a topography constraint) and a datafitting term (to match local features in the data image with their counterparts in the template image). Our theory bridges the gap between these two representations and gives us a mathematical understanding of the principle behind the flexible matching process. Low-dimensional versions of DLM require both a much lower computational time and less memory than do high-dimensional versions of DLM. We used an example to demonstrate that these two versions are equivalent in the neighborhood of the identity map. However, some high-dimensional models (Konen & von der Malsburg, 1993; Konen et al., 1994) have a very wide basin for the successful matchings they make. With such models, we can start from constantly distributed connections, but we have to set lowdimensional models to some mapping to start them. If this “prejudice” is too far from correct matching, then we cannot reach it. Appendix A Substituting equation 4.2 into equation 4.1, we obtain
ε 2 ψ 0 (y − θ(ξ, τ ))
∂θ = −ψ(y − θ ) − εu1 − ε 2 u2 ∂τ + K f (ψ(y − θ (ξ, τ ))) ¢ ¡ + ε K f 0 (ψ(y − θ (ξ, τ )))u1 (ξ, y, τ ) ¢ ¡ + ε2 K f 0 (ψ(y − θ (ξ, τ )))u2 (ξ, y, τ ) ´ ε2 ³ + K f 00 (ψ(y − θ (ξ, τ )))u1 (ξ, y, τ )2 2 + ε 2 s(x, y) + O(ε3 ).
(A.1)
666
Toru Aonishi and Koji Kurata
Here, we can represent the K (· · ·) term in equation A.1 as follows: Z
∞ −∞
Z
∞
dx0 dy0 k(x0 , y0 )u(ε(x − x0 ), y − y0 ) µ 0 ¶ Z ∞ x 0 01 , y u(εx − x0 , y − y0 ). dx k = ε ε −∞ −∞
(A.2)
As ε → 0, 1ε k( xε , y) tends to the following function, Ã Ã Ã ! !! √ 1 y2 y2 1 1 x k( , y) = 2πδ(x) exp − 2 − exp − 2 . ε ε σ1 σ2 2σ1 2σ2
(A.3)
Thus, we can expand u(εx − x0 , y − y0 ) into a polynomial around x0 = 0 as follows: µ 0 ¶ Z ∞Z ∞ x 0 0 01 , y u(εx − x0 , y − y0 ) dx dy k ε ε −∞ −∞ µ 0 ¶ Z ∞Z ∞ x 0 1 ,y dx0 dy0 k = ε ε −∞ −∞ # " 02 0 0 0 0 x + ··· × u(εx, y − y ) − ux (εx, y − y )x + uxx (εx, y − y ) 2 Z ∞Z ∞ dx0 dy0 k(x0 , y0 ) = −∞ −∞ # " 2 02 0 0 0 0 ε x + ··· × u(εx, y−y ) − ux (εx, y−y )εx + uxx (εx, y−y ) 2 · Z ∞ dy0 M0 (y0 )u(εx, y − y0 ) − εM1 (y0 )ux (εx, y − y0 ) = −∞
+
¸ ε2 M2 (y0 )uxx (εx, y − y0 ) + · · · , 2
(A.4)
where Mn (y) is the moment of k(x, y), Z Mn (y) =
∞
−∞
dx0 k(x0 , y)x0 . n
If n is odd, Mn (y) = 0, since k is even w.r.t. x. Thus, the K (· · ·) term in equation A.1 can be expanded as follows:
Kψ(y − θ(ξ, τ )) Z ∞Z ∞ dx0 dy0 k(x0 , y0 ) f (ψ(y − y0 − θ (εx, τ )) = −∞
−∞
(A.5)
Deformation Theory of Dynamic Link Matching
Z + ε
∞
Z
−∞
∞
−∞
667
dx0 dy0 k(x0 , y0 )x0
× f 0 (ψ(y − y0 − θ (εx, τ )))ψ 0 (y − y0 − θ (εx, τ )) Z Z ∂θ(εx, τ ) ε2 ∞ ∞ 0 0 0 0 0 2 + dx dy k(x , y )x × ∂εx 2 −∞ −∞ ¡ × f 0 (ψ(y − y0 − θ (εx, τ )))ψ 00 (y − y0 − θ (εx, τ )) + f 00 (ψ(y − y0 − θ (εx, τ )))ψ 0 (y − y0 − θ (εx, τ ))2 ε2
Z
∞
Z
∞
´ µ ∂θ (εx, τ ) ¶2 ∂εx
dx0 dy0 k(x0 , y0 )x0 2 −∞ −∞ × f 0 (ψ(y − y0 − θ (εx, τ )))ψ 0 (y − y0 − θ (εx, τ )) −
2
∂ 2 θ(εx, τ ) + O(ε 3 ) (A.6) ∂(εx)2 ¡ 0 ¢ ε K f (ψ(y − θ(ξ, τ )))u1 (ξ, y, τ ) Z ∞Z ∞ dx0 dy0 k(x0 , y0 ) f 0 (ψ(y − y0 − θ (εx, τ ))u1 (εx, y − y0 , τ ) =ε −∞ −∞ Z ∞Z ∞ ¡ dx0 dy0 k(x0 , y0 )x0 f 0 (ψ(y − y0 − θ (εx, τ )) − ε2 ×
−∞
−∞
∂ u1 (εx, y − y0 , τ ) − f 00 (ψ(y − y0 − θ (εx, τ ))) × ∂εx ¶ ∂θ (εx, τ ) 0 0 0 × ψ (y − y − θ(εx, τ ))u1 (εx, y − y , τ ) ∂εx ¡
+ O(ε 3 ) 0
(A.7)
¢
ε K f (ψ(y − θ(ξ, τ )))u2 (ξ, y, τ ) Z ∞Z ∞ dx0 dy0 k(x0 , y0 ) f 0 (ψ(y − y0 − θ (εx, τ )))u2 (εx, y − y0 , τ ) ε2 2
−∞
ε2 2
³
−∞
+ O(ε 3 )
K f 00 (ψ(y − θ(ξ, τ )))u1 (ξ, y, τ )2 Z Z ε2 ∞ ∞ 0 0 0 0 dx dy k(x , y ) = 2 −∞ −∞
(A.8)
´
× f 00 (ψ(y − y0 − θ (εx, τ )))u1 (εx, y − y0 , τ ) + O(ε 3 ) 2
(A.9)
Erasing odd moments and neglecting higher-order terms, we derive equation 4.3.
668
Toru Aonishi and Koji Kurata
Appendix B We show that an uneven stationary solution in the symmetric model (see equation 3.2) is transformed into a traveling wave solution by some small symmetrical perturbation. Here, we add small constant perturbation εs to equation 3.2 as follows: ∂ w(x, y, t) = −w + K f (w) + εs, ∂t
(B.1)
Solutions in the neighborhood of an equilibrium solution for an unperturbed system can be represented as w(x, y, t) = ψ(y − θ(τ )) + εu(y, τ )
(B.2)
where τ = εt. Substituting equation B.2 into B.1, expanding a polynomial around ε = 0, and neglecting higher-order terms, we obtain ψ 0 (y − θ(τ ))
dθ = Lθ u + s. dτ
(B.3)
Averaging equation B.3 using the weight f 0 (ψ(y−θ ))ψ 0 (y−θ ) as in equation 4.7, we obtain the following phase equation: c
dθ = d, dτ Z
c=
∞
−∞
dyf 0 (ψ(y))ψ 0 (y)ψ 0 (y),
(B.4) Z d=s
∞
−∞
dyf 0 (ψ(y))ψ 0 (y).
(B.5)
If ψ(y) is uneven, d 6= 0. Therefore, an uneven stationary solution is transformed into a traveling wave solution by a constant perturbation. Acknowledgments This work was partially supported by Grants-in-Aid for Scientific Research in Priority Area (2) No. 07252219, Grants-in-Aid for the Encouragement of Young Scientists No. 2871, and JSPS Research Fellowships for Young Scientists. References Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biol. Cybern., 27, 77–87. Amari, S. (1980). Topographic organization of nerve fields. Bulletin of Mathematical Biology, 42, 339–364.
Deformation Theory of Dynamic Link Matching
669
Aonishi, T., Kurata, K., & Mito, T. (1997). A phase locking theory of matching between rotated images by a dynamic link matching. Unpublished manuscript. Bienenstock, E., & Doursat, R. (1994). A shape-recognition model using dynamical links. Network, 5, 241–258. Bienenstock, E., & von der Malsburg, C. (1987). A neural network for invariant pattern recognition. Europhysics Letters, 4, 121–126. Ermentrout, G. B. (1981). n : m Phase-locking of weakly coupled oscillators. Journal of Mathematical Biology, 6, 327–342. H¨aussler, A. F., & von der Malsburg, C. (1983). Development of retinotopic projections—An analytical treatment. J. Theor. Neurobio., 2, 47–73. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybern., 43, 59–69. Konen, W. K., Maurer, T., & von der Malsburg, C. (1994). A fast dynamic link matching algorithm for invariant pattern recognition. Neural Networks, 7(6/7), 1019–1030. Konen, W. K., & von der Malsburg, C. (1993). Learning to generalize from single examples in the dynamic link architecture. Neural Computation, 5, 719–735. Lades, M., Vorbruggen, ¨ J. C., Buhmann, J., Lange, J., von der Malsburg, C., Wurtz, ¨ R. P., & Konen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computer, 42(3), 300–311. Poggio, T., Torre, V., & Koch, C. (1985). Computational vision and regularization theory. Nature, 317, 314–319. Takeuchi, A., & Amari, S. (1979). Formation of topographic maps and columnar microstructures in nerve fields. Biol. Cybern., 35, 63–72. Received August 26, 1996; accepted June 7, 1997.
LETTER
Communicated by Klaus Obermayer
Constrained Optimization for Neural Map Formation: A Unifying Framework for Weight Growth and Normalization Laurenz Wiskott Computational Neurobiology Laboratory, Salk Institute for Biological Studies, San Diego, CA 92186-5800, U.S.A. http://www.cnl.salk.edu/CNL/
Terrence Sejnowski Computational Neurobiology Laboratory, Howard Hughes Medical Institute, Salk Institute for Biological Studies, San Diego, CA 92186-5800, U.S.A. Department of Biology, University of California, San Diego, La Jolla, CA 92093, U.S.A.
Computational models of neural map formation can be considered on at least three different levels of abstraction: detailed models including neural activity dynamics, weight dynamics that abstract from the neural activity dynamics by an adiabatic approximation, and constrained optimization from which equations governing weight dynamics can be derived. Constrained optimization uses an objective function, from which a weight growth rule can be derived as a gradient flow, and some constraints, from which normalization rules are derived. In this article, we present an example of how an optimization problem can be derived from detailed nonlinear neural dynamics. A systematic investigation reveals how different weight dynamics introduced previously can be derived from two types of objective function terms and two types of constraints. This includes dynamic link matching as a special case of neural map formation. We focus in particular on the role of coordinate transformations to derive different weight dynamics from the same optimization problem. Several examples illustrate how the constrained optimization framework can help in understanding, generating, and comparing different models of neural map formation. The techniques used in this analysis may also be useful in investigating other types of neural dynamics. 1 Introduction Neural maps are an important motif in the structural organization of the brain. The best-studied maps are those in the early visual system. For example, the retinotectal map connects a two-dimensional array of ganglion cells in the retina to a corresponding map of the visual field in the optic tectum of vertebrates in a neighborhood-preserving fashion. These are called topographic maps. The map from the lateral geniculate nucleus (LGN) to the primary visual cortex (V1) is more complex because the inputs coming from Neural Computation 10, 671–716 (1998)
c 1998 Massachusetts Institute of Technology °
672
Laurenz Wiskott and Terrence Sejnowski
Figure 1: Goal of neural map formation: The initially random all-to-all connectivity self-organizes into an orderly connectivity that appropriately reflects the correlations within the input stimuli and the induced correlations within the output layer. The output correlations also depend on the connectivity within the output layer.
LGN include signals from both eyes and are unoriented, but most cells in V1 are tuned for orientation, an emergent property. Neurons with preferred orientation and ocular dominance in area V1 form a columnar structure, where neurons responding to the same eye or the same orientation tend to be neighbors. Other neural maps are formed in the somatosensory, the auditory, and the motor systems. All neural maps connect an input layer, possibly divided into different parts (e.g., left and right eye), to an output layer. Each neuron in the output layer can potentially receive input from all neurons in the input layer (here we ignore the limits imposed by restricted axonal arborization and dendritic extension). However, particular receptive fields develop due to a combination of genetically determined and activity-driven mechanisms for self-organization. Although cortical maps have many feedback projections (for example, from area V1 back to the LGN), these are disregarded in most models of map formation and will not be considered here. The goal of neural map formation is to self-organize from an initial random all-to-all connectivity a regular pattern of connectivity, as in Figure 1, for the purpose of producing a representation of the input on the output layer that is of further use to the system. The development of the structure depends on the architecture, the lateral connectivity, the initial conditions, and the weight dynamics, including growth rule and normalization rules. The first model of map formation, introduced by von der Malsburg (1973), was for a small patch of retina stimulated with bars of different orientation. The model self-organized orientation columns, with neighboring neurons having receptive fields tuned to similar orientation. This model already included all the crucial ingredients important for map formation: (1) characteristic correlations within the stimulus patterns, (2) lateral interactions within the output layer, inducing characteristic correlations there
Neural Map Formation
673
as well, (3) Hebbian weight modification, and (4) competition between synapses by weight normalization. Many similar models have been proposed since then for different types of map formation (see Erwin, Obermayer, & Schulten, 1995; Swindale, 1996; and Table 2 for examples). We do not consider models that are based on chemical markers (e.g., von der Malsburg & Willshaw, 1977). Although they may be conceptionally similar to those based on neural activities, they can differ significantly in the detailed mathematical formulation. Nor do we consider in detail models that treat the input layer as a low-dimensional space, say two-dimensional for the retina, from which input vectors are drawn (e.g., Kohonen, 1982, but see section 6.8). The output neurons then receive only two synapses per neuron, one for each input dimension. The dynamic link matching model (e.g., Bienenstock & von der Malsburg, 1987; Konen, Maurer, & von der Malsburg, 1994) is a form of neural map formation that has been developed for pattern recognition. It is mathematically similar to the self-organization of retinotectal projections; in addition, each neuron has a visual feature attached, so that a neural layer can be considered as a labeled graph representing a visual pattern. Each synapse has associated with it an individual value, which affects the dynamics and expresses the similarity between the features of connected neurons. The self-organization process then not only tends to generate a neighborhood preserving map, it also tends to connect neurons having similar features. If the two layers represent similar patterns, the map formation dynamics finds the correct feature correspondences and connects the corresponding neurons. Models of map formation have been investigated by analysis (e.g., Amari, 1980; H¨aussler & von der Malsburg, 1983) and computer simulations. An important tool for both methods is the objective function (or energy function) from which the dynamics can be generated as a gradient flow. The objective value (or energy) can be used to estimate which weight configurations would be more likely to arise from the dynamics (e.g., MacKay & Miller, 1990). In computer simulations, the objective function is maximized (or the energy function is minimized) numerically in order to find stable solutions of the dynamics (e.g., Linsker, 1986; Bienenstock & von der Malsburg, 1987). Objective functions, which can also serve as a Lyapunov function, have many advantages. First, the existence of an objective function guarantees that the dynamics does not have limit cycles or chaotic attractors as solutions. Second, an objective function often provides more direct and intuitive insight into the behavior of a dynamics, and the effects of each term can be understood more easily. Third, an objective function allows additional mathematical tools to be used to analyze the system, such as methods from statistical physics. Finally, an objective function provides connections to more abstract models, such as spin systems, which have been studied in depth.
674
Laurenz Wiskott and Terrence Sejnowski
Although objective functions have been used before in the context of neural map formation, they have not yet been investigated systematically. The goal of this article is to derive objective functions for a wide variety of models. Although growth rules can be derived from objective functions as gradient flows, normalization rules are derived from constraints by various methods. Thus, objective functions and constraints have to be considered in conjunction and form a constrained optimization problem. We show that although two models may differ in the formulation of their dynamics, they may be derived from the same constrained optimization problem, thus providing a unifying framework for the two models. The equivalence between different dynamics is revealed by coordinate transformations. A major focus of this article is therefore on the effects of coordinate transformations on weight growth rules and normalization rules. 1.1 Model Architecture. The general architecture considered here consists of two layers of neurons, an input and an output layer, as in Figure 2. (We use the term layer for a population of neurons without assuming a particular geometry.) Input neurons are indicated by ρ (retina) and output neurons by τ (tectum); the index ν can indicate a neuron in either layer. Neural activities are indicated by a. Input neurons are connected all-to-all to output neurons, but there are no connections back to the input layer. Thus, the dynamics in the input layer is completely independent of the output layer and can be described by mean activities haρ i and correlations haρ , aρ 0 i. Effective lateral connections within a layer are denoted by Dρρ 0 and Dτ τ 0 ; connections projecting from the input to the output layer are denoted by wτρ . The second index always indicates the presynaptic neuron and the first index the postsynaptic neuron. The lateral connections defined here are called effective, because they need not correspond to physical connections. For example, in the input layer, the effective lateral connections represent the correlations between input neurons regardless of what induced the correlations, Dρρ 0 = haρ , aρ 0 i. In the example below, the output layer has shortterm excitatory and long-term inhibitory connections; the effective lateral connections, however, are only excitatory. The effective lateral connections thus represent functional properties of the lateral interactions and not the anatomical connectivity itself. To make the notation simpler, we use the definitions i = {ρ, τ }, j = {ρ 0 , τ 0 }, Aij = Dτ τ 0 Aρ 0 = Dτ τ 0 haρ 0 i, and Dij = Dτ τ 0 Dρρ 0 = Dτ τ 0 haρ , aρ 0 i in section 3 and later. We assume symmetric matrices Aij = Aji and Dij = Dji , which requires some homogeneity of the architecture, that is, haρ i = haρ 0 i, haρ , aρ 0 i = haρ 0 , aρ i, and Dτ τ 0 = Dτ 0 τ . In the next section, a simple model is used to demonstrate the basic procedure for deriving a constrained optimization problem from detailed neural dynamics. This procedure has three steps. First, the neural dynamics is transformed into a weight dynamics, where the induced correlations are expressed directly in terms of the synaptic weights, thus eliminating neu-
Neural Map Formation
675
Dtt’ t’
t
output layer
wt’ r’
r’ input layer
r
Dr’ r
Figure 2: General architecture: Neurons in the input layer are connected all-toall to neurons in the output layer. Each layer has effective lateral connections D representing functional aspects of the lateral connectivity (e.g., characteristic correlations). As an example, a path through which activity can propagate from neuron ρ to neuron τ is shown by solid arrows. Other connections are shown as dashed arrows.
ral activities from the dynamics by an adiabatic approximation. Second, an objective function is constructed, which can generate the dynamics of the growth rule as a gradient flow. Third, the normalization rules need to be considered and, if possible, derived from constraint functions. The last two steps depend on each other insofar as growth rule, as well as normalization rules, must be inferred under the same coordinate transformation. The three important aspects of this example—deriving correlations, constructing objective functions, and considering the constraints—are then discussed in greater detail in the following three sections, respectively. Readers may skip section 2 and continue directly with these more abstract considerations beginning in section 3. In section 6, several examples are given for how the constrained optimization framework can be used to understand, generate, and compare models of neural map formation. 2 Prototypical System As a concrete example, consider a slightly modified version of the dynamics proposed by Willshaw and von der Malsburg (1976) for the self-organization
676
Laurenz Wiskott and Terrence Sejnowski
of a retinotectal map, where the input and output layer correspond to retina and tectum, respectively. The dynamics is qualitatively described by the following set of differential equations: Neural activity dynamics ˙ ρ = −mρ + (k ∗ aρ 0 )ρ m ˙ τ = −mτ + (k ∗ aτ 0 )τ + m
X
(2.1) wτρ 0 aρ 0
(2.2)
ρ0
Weight growth rule w˙ τρ = aτ aρ
(2.3)
Weight normalization rules if wτρ < 0:wτρ = 0 if
X
wτρ 0 > 1:wτρ = w˜ τρ +
ρ0
if
X τ0
wτ 0 ρ > 1:wτρ = w˜ τρ
1 Mτ
1 + Mρ
à 1−
X
w˜ τρ 0
ρ0
à 1−
X
(2.4)
! for all ρ
(2.5)
for all τ
(2.6)
! w˜ τ 0 ρ
τ0
where m denotes the membrane potential, aν = σ (mν ) is the mean firing rate determined by a nonlinear input-output function σ , (k ∗ aν 0 ) indicates a convolution of the neural activities with the kernel k representing lateral connections with local excitation and global inhibition, w˜ τρ indicates weights as obtained by integrating the differential equations for one time step, that is, w˜ τρ (t+1t) = wτρ (t)+1t w˙ τρ (t), Mτ is the number of links terminating on output neuron τ , and Mρ is the number of links originating from input neuron ρ. Equations 2.1 and 2.2 govern the neural activity dynamics on the two layers, equation 2.3 is the growth rule for the synaptic weights, and equations 2.4–2.6 are the normalization rules that keep the sums over synaptic weights originating from an input neuron or terminating on an output neuron equal to 1 and prevent the weights from becoming negative. Notice that since the discussion is qualitative, we included only the basic terms and discarded some parameters required to make the system work properly. One difference from the original model is that subtractive instead of multiplicative normalization rules are used. 2.1 Correlations. The dynamics within the neural layers is well understood (Amari, 1977; Konen et al., 1994). Local excitation and global inhibition lead to the development of a local patch of activity, called a blob. The shape and size of the blob depend on the kernel k and other parameters of the
Neural Map Formation
677
system and can be described by Bρ 0 ρ0 if centered on input neuron ρ0 and Bτ 0 τ0 if centered on output neuron τ0 . The location of the blob depends on the input, which is assumed to be weak enough that it does not change the shape of the blob. Assume the input layer receives noise such that the blob arises with equal probability p(ρ0 ) = 1/R centered on any of the input neurons, where R is the number of input neurons. For simplicity we assume cyclic boundary conditions to avoid boundary effects. The location of the blob in the output layer, on the other hand, is affected by the input, X wτ 0 ρ 0 Bρ 0 ρ0 , (2.7) iτ 0 (ρ0 ) = ρ0
received from the input layer and therefore depends on the position ρ0 of the blob in the input layer. Only one blob can occur in each layer, and the two layers need to be reset before new blobs can arise. A sequence of blobs is required to induce the appropriate correlations. Konen et al. (1994) have shown that without noise, blobs in the output 0 layer will arise at location τ0 with the largest overlap between P input iτ (ρ0 ) and the final blob profile Bτ 0 τ0 , that is, the location for which τ 0 Bτ 0 τ0 iτ 0 (ρ0 ) is maximal. This winner-take-all behavior makes it difficult to analyze the system. We therefore make the assumption that in contrast to this deterministic dynamics, the blob arises at location τ0 with a probability equal to the overlap between the input and blob activity, X X Bτ 0 τ0 iτ 0 (ρ0 ) = Bτ 0 τ0 wτ 0 ρ 0 Bρ 0 ρ0 . (2.8) p(τ0 |ρ0 ) = τ0
τ 0ρ0
P P Assume the blobs are normalized such that ρ 0 Bρ 0 ρP 0 = 1 and τ0 Bτ 0 τ0 = 1 and that the connectivity is normalized such that τ 0 wτ 0 ρ 0 = 1, which is the case for the system above if the P input layer does notPhave more neurons than the output layer. This implies τ 0 iτ 0 (ρ0 ) = 1 and τ0 p(τ0 |ρ0 ) = 1 and justifies the interpretation of p(τ0 |ρ0 ) as a probability. Although it is plausible that such a probabilistic blob location could be approximated by noise in the output layer, it is difficult to develop a concrete model. For a similar but more algorithmic activity model (Obermayer, Ritter, & Schulten, 1990), an exact noise model for the probabilistic blob location can be formulated (see the appendix). With equation 3.8 the probability for a particular combination of blob locations is p(τ0 , ρ0 ) = p(τ0 |ρ0 )p(ρ0 ) =
X τ 0ρ0
Bτ 0 τ0 wτ 0 ρ 0 Bρ 0 ρ0
1 , R
(2.9)
and the correlation between two neurons defined as the average product of their activities is X p(τ0 , ρ0 )Bτ τ0 Bρρ0 (2.10) haτ aρ i = τ 0 ρ0
678
Laurenz Wiskott and Terrence Sejnowski
XX
1 Bτ τ0 Bρρ0 R τ 0 ρ0 τ 0 ρ 0 Ã ! Ã ! X 1X X = Bτ 0 τ0 Bτ τ0 wτ 0 ρ 0 Bρ 0 ρ0 Bρρ0 R τ 0 ρ 0 τ0 ρ0 =
=
Bτ 0 τ0 wτ 0 ρ 0 Bρ 0 ρ0
1X¯ Bτ τ 0 wτ 0 ρ 0 B¯ ρ 0 ρ , R τ 0ρ0
with B¯ ν 0 ν =
X ν0
(2.11)
(2.12) Bν 0 ν0 Bνν0 ,
(2.13)
where the brackets h·i indicate the ensemble average over a large number of blob presentations. R1 B¯ ρ 0 ρ and B¯ τ τ 0 are the effective lateral connectivities of the input and the output layer, respectively, and are symmetrical even if the individual blobs Bρρ0 and Bτ τ0 are not, that is, Dρ 0 ρ = R1 B¯ ρ 0 ρ , Dτ τ 0 = B¯ τ τ 0 , and Dij = Dji = Dτ τ 0 Dρ 0 ρ = R1 B¯ τ τ 0 B¯ ρ 0 ρ . Notice the linear relation between the weights wτ 0 ρ 0 and the correlations haτ aρ i in the probabilistic blob model (see equation 2.13). Substituting the correlation into equation 2.3 for the weight dynamics leads to: hw˙ τρ i = haτ aρ i =
1X¯ Bτ τ 0 wτ 0 ρ 0 B¯ ρ 0 ρ . R τ 0ρ0
(2.14)
The same normalization rules given above (equations 2.4–2.6) apply to this dynamics. Since there is little danger of confusion, we neglect the averaging brackets for hw˙ τρ i in subsequent equations and simply write w˙ τρ = haτ , aρ i. Although we did not give a mathematical model of the mechanism by which the probabilistic blob location as given in equation 2.8 could be implemented, it may be interesting to note that the probabilistic approach can be generalized to other activity patterns, such as stripe patterns or hexagons, which can be generated by Mexican hat interaction functions (local excitation, finite-range inhibition) (von der Malsburg, 1973; Ermentrout & Cowan, 1979). If the probability for a stripe pattern’s arising in the output layer is linear in its overlap with the input, the same derivation follows, though the indices ρ0 and τ0 will then refer to phase and orientation of the patterns rather than location of the blobs. Using the probabilistic blob location in the output layer instead of the deterministic one is analogous to the soft competitive learning proposed by Nowlan (1990) as an alternative to hard (or winner-take-all) competitive learning. Nowlan demonstrated superior performance of soft competition over hard competition for a radial basis function network tested on recognition of handwritten characters and spoken vowels, and suggested there might be a similar advantage for neural map formation. The probabilistic blob location induced by noise might help improve neural map formation by avoiding local optima.
Neural Map Formation
679
2.2 Objective Function. The next step is to find an objective function that generates the dynamics as a gradient flow. For the above example, a suitable objective function is H(w) =
1 X wτρ B¯ ρρ 0 B¯ τ τ 0 wτ 0 ρ 0 , 2R τρτ 0 ρ 0
(2.15)
since it yields equation 2.14 from w˙ τρ = B¯ νν 0 = B¯ ν 0 ν .
∂H(w) ∂wτρ ,
taking into account that
2.3 Constraints. The normalization rules given above ensure that synaptic weights do not become negative and that the sums over synaptic weights originating from an input neuron or terminating on an output neuron do not become larger than 1. This can be written in the form of inequalities for constraint functions g: gτρ (w) = wτρ ≥ 0, X gτ (w) = 1 − wτρ 0 ≥ 0,
(2.16) (2.17)
ρ0
gρ (w) = 1 −
X
wτ 0 ρ ≥ 0.
(2.18)
τ0
These constraints define a region within which the objective function is to be maximized by steepest ascent. While the constraints follow uniquely from the normalization rules, the converse is not true. In general, there are various normalization rules that would enforce or at least approximate the constraints, but only some of them are compatible with the constrained optimization framework. As shown in section 5.2.1, compatible normalization rules can be obtained by the method of Lagrangian multipliers. If a constraint gx , x ∈ {τρ, τ, ρ} is violated, a normalization rule of the form ˜ <0: if gx (w)
wτρ = w˜ τρ + λx
∂gx ∂ w˜ τρ
for all τρ,
(2.19)
has to be applied, where λx is a Lagrangian multiplier and determined such that gx (w) = 0. This method actually leads to equations 2.4–2.6, which are therefore a compatible set of normalization rules for the constraints above. This is necessary to make the formulation as a constrained optimization problem (see equations 2.15–2.18) an appropriate description of the original dynamics (see equations 2.3–2.6). This example illustrates the general scheme by which a detailed model dynamics for neural map formation can be transformed into a constrained optimization problem. The correlations, objective functions, and constraints are discussed in greater detail and for a wide variety of models below.
680
Laurenz Wiskott and Terrence Sejnowski
3 Correlations In the above example, correlations in a highly nonlinear dynamics led to a linear relationship between synaptic weights and the induced correlations. We derived effective lateral connections in the input as well as the output layer mediating these correlations. Corresponding equations for the correlations have been derived for other, mostly linear activity models (e.g., Linsker, 1986; Miller, 1990; von der Malsburg, 1995), as summarized here. Assume the dynamics in the input layer is described by neural activities aρ (t) ∈ R, which yield mean activities haρ i and correlations haρ , aρ 0 i. The input received by the output layer is assumed to be a linear superposition of the activities of the input neurons: iτ 0 =
X
wτ 0 ρ 0 aρ 0 .
(3.1)
ρ0
This input then produces activity in the output layer through effective lateral connections in a linear fashion: aτ =
X
Dτ τ 0 iτ 0 =
τ0
X
Dτ τ 0 wτ 0 ρ 0 aρ 0 .
(3.2)
τ 0ρ0
As seen in the above example, this linear behavior could be generated by a nonlinear model. Thus, the neurons need not be linear, only the effective behavior of the correlations (cf. Sejnowski, 1976; Ginzburg & Sompolinsky, 1994). The mean activity of output neurons is haτ i =
X
Dτ τ 0 wτ 0 ρ 0 haρ 0 i =
τ 0ρ0
X
Aij wj .
(3.3)
j
Assuming a linear correlation function (haρ , α(aρ 0 + aρ 00 )i = αhaρ , aρ 0 i + αhaρ , aρ 00 i with a real constant α) such as the average product or the covariance (Sejnowski, 1977), the correlation between input and output neurons is haτ , aρ i =
X τ 0ρ0
Dτ τ 0 wτ 0 ρ 0 haρ 0 , aρ i =
X
Dij wj .
(3.4)
j
Note that i = {ρ, τ }, j = {ρ 0 , τ 0 }, Aij = Aji = Dτ τ 0 Aρ 0 = Dτ τ 0 haρ 0 i, and Dij = Dji = Dτ τ 0 Dρ 0 ρ = Dτ τ 0 haρ 0 , aρ i. Since the right-hand sides of equations 3.3 and 3.4 are formally equivalent, we will consider only the latter one in the further analysis, bearing in mind that equation 3.3 is included as a special case. In this linear correlation model, all variables may assume negative values. This may not be plausible for the neural activities aρ and aτ . However,
Neural Map Formation
681
equation 3.4 can be derived also for nonnegative activities, and a similar equation as equation 3.3 can be derived if the mean activities haρ i are positive. The difference for the latter would be an additional constant, which can always be compensated for in the growth rule. The correlation model in Linsker (1986) differs from the linear one introduced here in two respects. The input (see equation 3.1) has an additional constant term, and correlations are defined by subtracting positive constants from the activities. However, it can be shown that correlations in the model in Linsker (1986) are a linear combination of a constant and the terms of equations 3.3 and 3.4. 4 Objective Functions In general, there is no systematic way of finding an objective function for a particular dynamical system, but it is possible to determine whether there exists an objective function. The necessary and sufficient condition is that the flow field of the dynamics be curl free. If there exists an objective function H(w) with continuous partial derivatives of order two that generates the dynamics w˙ i = ∂H(w)/∂wi , then ∂ w˙ j ∂ 2 H(w) ∂ 2 H(w) ∂ w˙ i = = = . ∂wj ∂wj ∂wi ∂wi ∂wj ∂wi
(4.1)
The existence of an objective function is thus equivalent to ∂ w˙ i /∂wj = ∂ w˙ j /∂wi , which can be checked easily. For the dynamics given by w˙ i =
X
Dij wj
(4.2)
j
(cf. equation 2.14), for example, ∂ w˙ i /∂wj = Dij = ∂ w˙ j /∂wi , which shows that it can be generated as a gradient flow. A suitable objective function is H(w) =
1X wi Dij wj 2 ij
(4.3)
(cf. equation 2.15), since it yields w˙ i = ∂H(w)/∂wi . A dynamics that cannot be generated by an objective function directly is w˙ i = wi
X
Dij wj ,
(4.4)
j
as used in H¨aussler and von der Malsburg (1983), since for i 6= j we obtain ∂ w˙ i /∂wj = wi Dij 6= wj Dji = ∂ w˙ j /∂wi , and w˙ i is not curl free. However, it is
682
Laurenz Wiskott and Terrence Sejnowski
sometimes possible to convert a dynamics with curl into a curl-free dynamics by a coordinate transformation. Applying the transformation wi = 14 v2i (C w ) to equation 4.4 yields v˙ i =
√ X dvi 1 X 1 Dij wj = vi Dij vj2 , w˙ i = wi dwi 2 4 j j
(4.5)
which is curl free, since ∂ v˙ i /∂vj = 12 vi Dij 21 vj = ∂ v˙j /∂vi . Thus, the dynamics of v˙ i in the new coordinate system V w can be generated as a gradient flow. A suitable objective function is H(v) =
1X1 2 1 2 v Dij vj , 2 ij 4 i 4
(4.6)
since it yields v˙i = ∂H(v)/∂vi . Transforming the dynamics of v back into the original coordinate system W , of course, yields the original dynamics in equation 4.4: w˙ i =
X 1 dwi 1 X Dij vj2 = wi Dij wj . v˙ i = v2i dvi 4 4 j j
(4.7)
Coordinate transformations thus can provide objective functions for dynamics that are not curl free. Notice that H(v) is the same objective function as H(w) (see equation 4.3) evaluated in V w instead of W . Thus H(v) = H(w(v)) and H is a Lyapunov function for both dynamics. More generally, for an objective function H and a coordinate transformation wi = wi (vi ), w˙ i =
d dwi dwi ∂H = v˙ i = [wi (vi )] = dt dvi dvi ∂vi
µ
dwi dvi
¶2
∂H , ∂wi
(4.8)
which implies that the coordinate transformation simply adds a factor (dwi /dvi )2 to the original growth term obtained in the original coordinate system W . For the dynamics in equation 4.4 derived under the coordinate transformation wi = 14 v2i (C w ) relative to the dynamics of equation 4.2, we verify that (dwi /dvi )2 = wi . Equation 4.8 also shows that fixed points are preserved under the coordinate transformation in the region where dwi /dvi is defined and finite but that additional fixed points may be introduced if dwi /dvi = 0. This effect of coordinate transformations is known from the general theory of relativity and tensor analysis (e.g., Dirac, 1996). The gradient of a potential (or objective function) is a covariant vector, which adds the factor
Neural Map Formation
w2
683
H(w) = 2 w 1 + w 2
v2 = w 2
w1
H(v) = v 1 + v 2
v1 = 2 w 1
Figure 3: The effect of coordinate transformations on the induced dynamics. The figure shows a simple objective function H in the original coordinate system W (left) and the new coordinate system V (right) with w1 = v1 /2 and w2 = v2 . The gradient induced in W (dashed arrow) and the gradient induced in V and then backtransformed into W (solid arrows) have the same component in the w2 direction but differ by a factor of four in the w1 direction (cf. equation 4.8). Notice that the two dynamics differ in amplitude and direction, but that H is a Lyapunov function for both.
dwi /dvi through the transformation from W to V . Since v˙ as a kinematic description of the trajectory is a contravariant vector, this adds another factor dwi /dvi through the transformation back from V to W . If both vectors were either covariant or contravariant, the back-and-forth transformation between the different coordinate systems would have no effect. The same argument holds for the constraints in section 5.2. In some cases, it may also be useful to consider more general coordinate transformations wi = wi (v) where each weight wi may depend on all variables vj , as is common in the general theory of relativity and tensor analysis. Equation 4.8 would have to be modified correspondingly. In Figure 3, the effect of coordinate transformations is illustrated by a simple example. Table 1 shows two objective functions and the corresponding dynamics terms they induce under different coordinate transformations. The first objective function, L, is linear in the weights and induces constant weight growth (or decay) under coordinate transformation C 1 . The growth of one weight does not depend on other weights. This term can be useful for dynamic link matching to introduce a bias for each weight depending on the similarity of the connected neurons. The second objective function, Q, is a quadratic form. The induced growth rule for one weight includes other weights and is usually P based on correlations between input and output neurons, haτ aρ i = j Dij wj , and possibly also the mean activities of out-
684
Laurenz Wiskott and Terrence Sejnowski
P put neurons, haτ i = j Aij wj . This term is, for instance, important to form topographic maps. Functional aspects of term Q are discussed in section 6.3.
5 Constraints A constraint is either an inequality describing a surface (of dimensionality RT − 1 if RT is the number of weights) between valid and invalid region or an equality describing the valid region as a surface. A normalization rule is a particular prescription for how the constraint has to be enforced. Thus, constraints can be uniquely derived from normalization rules but not vice versa.
5.1 Orthogonal Versus Nonorthogonal Normalization Rules. Normalization rules can be divided into two classes: those that enforce the constraints orthogonal to the constraint surface, that is, along the gradient of the constraint function, and those that also have a component tangential to the constraint surface (see Figure 4). We refer to the former ones as orthogonal and to the latter ones as nonorthogonal. Only the orthogonal normalization rules are compatible with an objective function, as is illustrated in Figure 5. For a dynamics induced as an ascending gradient flow of an objective function, the value of the objective function constantly increases as long as the weights change. If the weights cross a constraint surface, a normalization rule has to be applied iteratively to the growth rule. Starting from the constraint surface at point w0 , the gradient ascent causes a step to point w ˜ in the invalid region, where w ˜ − w0 is in general nonorthogonal to the constraint surface. A normalization rule causes a step back to w on the constraint surface. If the normalization rule is orthogonal, that is, w − w ˜ is orthogonal to the constraint surface, w − w ˜ is shorter than or equal to w ˜ − w0 and the cosine of the angle between the ˜ − w0 is nonnegative, that is, the combined step w − w0 and the gradient w value of the objective function does not decrease. This cannot be guaranteed for nonorthogonal normalization rules, in which case the objective function of the unconstrained dynamics may not even be a Lyapunov function for the combined system, including weight dynamics and normalization rules. Thus, only orthogonal normalization rules can be used in the constrained optimization framework. The term orthogonal is not well defined away from the constraint surface. However, the constraints used in this article are rather simple, and a natural orthogonal direction is usually available for all weight vectors. Thus, the term orthogonal will also be used for normalization rules that do not project back exactly onto the constraint surface but keep the weights close to the surface and affect the weights orthogonal to it. For more complicated constraint surfaces, more careful considerations may be required.
Neural Map Formation
685
4 3
2
invalid region
wj
1
gi= 0
wj
gn= 0
gn= 0
valid region 5
gj = 0
wi
wi
Figure 4: Different constraints and different ways in which constraints can be violated and enforced. The constraints along the axes are given by gi = wi ≥ 0 and gj = wj ≥ 0, which keep the weights wi and wj nonnegative. The constraint gn = 1 − (wi + wj ) ≥ 0 keeps the sum of the two weights smaller or equal to 1. Black dots indicate points in state-space that may have been reached by the growth rule. Dot 1: None of the constraints is violated, and no normalization rule is applied. Dot 2: gn ≥ 0 is violated, and an orthogonal subtractive normalization rule is applied. Dot 3: gn ≥ 0 is violated, and a nonorthogonal multiplicative normalization rule is applied. Notice that the normalization does not follow the gradient of gn ; it is not perpendicular to the line gn = 0. Dot 4: Two constraints are violated, and the respective normalization rules must be applied simultaneously. Dot 5: gn ≥ 0 is violated, but the respective normalization rule violates gj ≥ 0. Again both rules must be applied simultaneously. The dotted circles indicate regions considered in greater detail in Figure 5.
Whether a normalization rule is orthogonal depends on the coordinate system in which it is applied. This is illustrated in Figure 6 and discussed in greater detail below. The same rule can be nonorthogonal in one coordinate system but orthogonal in another. It is important to find the coordinate system in which an objective function can be derived and the normalization rules are orthogonal. This then is the coordinate system in which the model can be most conveniently analyzed. Not all nonorthogonal normalization rules can be transformed into orthogonal ones. In Wiskott and von der Malsburg (1996), for example, a normalization rule is used that affects a group of weights if single weights grow beyond their limits. Since the constraint surface depends on only one weight, only that weight can be affected by an orthogonal normalization rule. Thus, this normalization rule cannot be made orthogonal. 5.2 Constraints Can Be Enforced in Different Ways. For a given constraint, orthogonal normalization rules can be derived using various meth-
686
Laurenz Wiskott and Terrence Sejnowski
~ w ~ w w’ w
w
w’
Figure 5: The effect of orthogonal versus nonorthogonal normalization rules. The two circled regions are taken from Figure 4. The effect of the orthogonal subtractive rule is shown on the left, and the nonorthogonal multiplicative rule is shown on the right. The growth dynamics is assumed to be induced by an objective function, the equipotential curves of which are shown as dashed lines. The objective function increases to the upper right. The growth rule (dotted arrows) and normalization rule (dashed arrows) are applied iteratively. The net effect is different in the two cases. For the orthogonal normalization rule, the dynamics increases the value of the objective function, while for the nonorthogonal normalization, the value decreases and the objective function that generates the growth rule is not even a Lyapunov function for the combined system.
ods. These include the method of Lagrangian multipliers, the inclusion of penalty terms, and normalization rules that are integrated into the weight dynamics without necessarily having any objective function. The former two methods are common in optimization theory. The latter is more specific to a model of neural map formation. It is also possible to substitute a constraint by a coordinate transformation. 5.2.1 Method of Lagrangian Multipliers. Lagrangian multipliers can be used to derive explicit normalization rules, such as equations 2.4–2.6. If the ˜ as obtained after one integration step constraint gn (w) ≥ 0 is violated for w of the learning rule, w˜ i (t + 1t) = wi (t) + 1t w˙ i (t), the weight vector has to be corrected along the gradient of the constraint function gn , which is orthogonal to the constraint surface gn (w) = 0, if gn (w) ˜ <0:
wi = w˜ i + λn
∂gn ∂ w˜ i
for all i,
(5.1)
˜ and λn = λn (w) ˜ is a Lagrangian where (∂gn /∂ w˜ i ) = (∂gn /∂wi ) at w = w multiplier and determined such that gn (w) = 0 is obtained. If no constraint
Neural Map Formation
687
wj
vj gn= 0 gn= 0
wi
vi
Figure 6: The effect of a coordinate transformation on a normalization rule. The constraint function is gn = 1 − (wi + wj ) ≥ 0, and the coordinate transformation is wi = 14 v2i , wj = 14 vj2 . In the new coordinate system V w (right), the constraint becomes gn = 1 − 14 (v2i + vj2 ) ≥ 0 and leads there to an orthogonal multiplicative normalization rule. Transforming back into W (left) then yields a nonorthogonal multiplicative normalization rule.
is violated, the weights are simply taken to be wi = w˜ i . The constraints that must be taken into account, either because they are violated or because they become violated if a violated one is enforced, are called operative. All others are called inoperative and do not need to be considered for that integration step. If there is more than one operative constraint, the normalization rule becomes ˜ <0: if gn (w)
wi = w˜ i +
X n∈NO
λn
∂gn ∂ w˜ i
for all i,
(5.2)
where NO denotes the set of operative constraints. The Lagrangian multipliers λn are determined such that gn0 (w) = 0 for all n0 ∈ NO (cf. Figure 4). Computational models of neural map formation usually take another strategy and simply iterate the normalization rules (see equation 5.1) for the operative constraints individually, which is in general not accurate but may be sufficient for most practical purposes. It should also be mentioned that in the standard method of Lagrangian multipliers as usually applied in physics or optimization theory, the two steps, weight growth and normalization, are combined in one dynamical equation such that w remains on the constraint surface. The steps were split here to obtain explicit normalization rules independent of growth rules.
688
Laurenz Wiskott and Terrence Sejnowski
Consider now the effect of coordinate transformations on the normalization rules derived by the method of Lagrangian P multipliers. The constraint in equation 2.17 can be written as gn (w) = θn − i∈In wi ≥ 0 and leads to a subtractive normalization rule as in the example above (see equation 2.5). Under the coordinate transformation C w (wi = 14 v2i ), the constraint becomes P gn (v) = θn − i∈In 14 v2i ≥ 0, and in the coordinate system V w , the normalization rule is: √ ¶ µ 1 θ n (5.3) ˜ <0: vi = v˜ i − 2 qP − 1 − v˜ i if gn (v) 1 2 2 v˜ √
= qP
j∈In 4 j
θn v˜ i
for all i ∈ In .
1 2 ˜j j∈In 4 v
(5.4)
Taking the square on both sides and applying the backtransformation from V w to W leads to if gn (w) ˜ <0:
θn w˜ i wi = P ˜j j∈In w
for all i ∈ In .
(5.5)
This is a multiplicative normalization rule in contrast to the subtractive one obtained in the coordinate system W (see also Figure 6). It is listed w as normalization rule Nw ≥ in Table 1 (or N= for constraint g(w) = 0). This multiplicative rule is commonly found in the literature (cf. Table 2), but it is not orthogonal in W , though it is in V w . For a more general coordinate transformation wi = wi (vi ) and a constraint function g(w), an orthogonal normalization rule can be derived in V with the method of Lagrangian multipliers and transformed back into W , which results in general in a nonorthogonal normalization rule: µ if constraint is violated:
wi = w˜ i + λ
dwi dv˜ i
¶2
∂g + O(λ2 ). ∂ w˜ i
(5.6)
The λ actually would have to be calculated in V , but since λ ∝ 1t, secondand higher-order terms can be neglected for small 1t and λ calculated such that g(w) = 0. Notice the similar effect of the coordinate transformation on the growth rules (see equation 4.8), as well as on the normalization rules (see equation 5.6). In both cases, a factor (dwi /dvi )2 is added to the modification rate. As for gradient flows derived from objective functions, for a more general coordinate transformation wi = wi (v), equation 5.6 would have to be modified accordingly. We indicate these normalization rules by a subscript = (for an equality) and ≥ (for an inequality), because the constraints are enforced immediately and exactly.
Neural Map Formation
689
5.2.2 Integrated Normalization Without Objective Function. Growth rule and explicit normalization rule as derived by the method of Lagrangian multipliers can be combined in one dynamical equation. As an example, consider the growth rule w˙ i = fi , that is, w˜ i (t + 1t) = wi (t) + 1tfi (t), where fi is an arbitrary function in w and can be interpreted as a fitness of a synapse. Together with the normalization rule Nw = (see equation 5.5) and assuming P w (t) = θ, it follows that (von der Malsburg & Willshaw, 1981): j∈I j £ ¤ θ wi (t) + 1tfi (t) ¤ wi (t + 1t) = P £ j∈I wj (t) + 1tfj (t) = wi (t) + 1tfi (t) − 1t H⇒
w˙ i (t) = fi (t) −
and with W(t) =
P i∈I
wi (t) X fj (t) + O(1t2 ) θ j∈I
wi (t) X fj (t), θ j∈I
(5.7) (5.8) (5.9)
wi (t)
µ ¶ W(t) X ˙ W(t) = 1 − fj (t), θ j∈I
(5.10)
which shows that W = θ is indeed a stable fixed point under the dynamics of equation 5.9. However, this is not always the case. The same growth rule combined with the subtractive normalization rule N1= (see equation 2.5) would yield a dynamics that provides only a neutrally stable fixed point for P W = θ. An additional term (θ − j∈I wj (t)) would have to be added to make the fixed point stable. This is the reason that this type of normalization rule is listed in Table 1 only for C w . We indicate these kinds of normalization rules by the subscript ' because the dynamics smoothly approaches the constraint surface and will stay there exactly. Notice that this method differs from the standard method of Lagrangian multipliers, which also yields a dynamics such that w remains on the constraint surface. The latter applies only to the dynamics at g(w) = 0 and P ∂g always produces neutrally stable fixed points because i w˙ i (t) ∂wi = 0 is required by definition. If applied to a weight vector outside the constraint surface, the standard method of Lagrangian multipliers yields g(w) = const 6= 0. An advantage of this method is that it provides one dynamics for the growth rule as well as the normalization rule and that the constraint is enforced exactly. However, difficulties arise when interfering constraints are combined; that is, different constraints affect the same weights. This type of formulation is required for certain types of analyses (e.g., H¨aussler & von der Malsburg, 1983). A disadvantage is that in general there no longer exists an
690
Laurenz Wiskott and Terrence Sejnowski
objective function for the dynamics, though the growth term itself without the normalization term still has an objective function that is a Lyapunov function for the combined dynamics. 5.2.3 Penalty Terms. Another method of enforcing the constraints is to add penalty terms to the objective function (e.g., Bienenstock & von der Malsburg). For instance, if the constraint is formulated as an equality g(w) = 0, then add − 12 g2 (w); if the constraint is formulated as an inequality g(w) ≤ 0 or g(w) ≥ 0, then add ln |g(w)|. Other penalty functions, such as g4 and 1/g, are possible as well, but those used here induce the required terms as used in the literature. The effect of coordinate transformations is the same as in the case of objective functions. Consider, for example, the simple constraint gi (w) = wi ≥ 0 (I≥ in Table 1), which keeps weights wi nonnegative. The respective penalty term is ln |wi | (I> ) and the induced dynamics under the four different transformations considered in Table 1 are w1i , wαii , 1, and αi . An advantage of this approach is that a coherent objective function, as well as a weight dynamics, is available, including growth rules and normalization rules. A disadvantage may be that the constraints are only approximate and not enforced strictly, so that g(w) ≈ 0 and g(w) < 0 or g(w) > 0. We therefore indicate these kinds of normalization rules by subscripts ≈ and >. However, the approximation can be made arbitrarily precise by weighting the penalty terms accordingly. 5.2.4 Constraints Introduced by Coordinate Transformations. An entirely different way by which constraints can be enforced is by means of a coordinate transformation. Consider, for example, the coordinate transformation C w (wi = 14 v2i ). Negative weights are not reachable under this coordinate transformation because the factor (dwi /dvi )2 = wi added to the growth rules (see equation 4.8) as well as to the normalization rules (see equation 5.6) allows the weight dynamics of weight wi to slow down as it approaches zero, so that positive weights always stay positive (This can be generalized to positive and negative weights by the coordinate transformation wi = 14 vi |vi |.) Thus the coordinate transformation C w (and also C αw ) implicitly introduces limitation constraint I> . This is interesting because it shows that a coordinate transformation can substitute for a constraint, which is well known in optimization theory. The choice of whether to enforce the constraints by explicit normalization, an integrated dynamics without an objective function, penalty terms, or even implicitly a coordinate transformation depends on the system as well as the methods applied to analyze it. Table 1 shows several constraint functions and their corresponding normalization rules as derived in different coordinate systems and by the three different methods discussed above. Not shown is normalization implicit in a coordinate transformation. It is
Neural Map Formation
691
interesting that there are only two types of constraints. All variations arise from using different coordinate systems and different methods by which the normalization rules are implemented. The first type is a limitation constraint I, which limits the range of individual weights. The second type is a normalization constraint N, which affects a group of weights, usually the sum, very rarely the sum of squares as indicated by Z. In the next section we show how to use Table 1 for analyzing models of neural map formation and give some examples from the literature. 6 Examples and Applications 6.1 How to Use Table 1. The aim of Table 1 is to provide an overview of the different objective functions and derived growth terms as well as the constraint functions and derived normalization rules and terms discussed in this article. The terms and rules are ordered in columns belonging to a particular coordinate transformation C . Only entries in the same column may be combined to obtain a consistent, constrained optimization formulation for a system. However, some terms can be derived under different coordinate transformations. For instance, the normalization rule I= is the same for all coordinate transformations, and term Lαw with βi = 1/αi is the same as term Lw with βi = 1. To analyze a model of neural map formation, first identify possible candidates in Table 1 representing the different terms of the desired dynamics. Notice P that the average activity of output neurons is represented by haτ i = j Aij wj and that the correlation between input and output neurons P is represented by haτ , aρ i = j Dij wj . Usually both terms will be only an approximation of the actual mean activities and correlations of the system under consideration (cf. section 2.1). Notice also that normalization rules αw 1 α Nw = , N= , Z= , and Z= are actually multiplicative normalization rules and not subtractive ones, as might be suggested by the special form in which they are written in Table 1. Next identify the column in which all terms of the weight dynamics can be represented. This gives the coordinate transformation under which the model can be analyzed through the objective functions and constraint or penalty functions listed on the left side of the table. Equivalent models (cf. section 6.4) can be derived by moving from one column to another and by using normalization rules derived by a different method. Thus, Table 1 provides a convenient tool for checking whether a system can be analyzed within the constrained optimization framework presented here and for identifying the equivalent models. The function of each term can be coherently interpreted with respect to the objective, constraint, and penalty functions on the left side. The table can be extended with respect to additional objective, constraint, and penalty functions, as well as additional coordinate transformations. Although the table is compact, it suffices to
ij
wi Dij wj
w j∈In j
P
− 12 γi (θi − wi )2 γi ln |θi − wi | P − 12 γn (θn − j∈In βj wj )2
Penalty Functions H(w)
θn −
Constraint Functions g(w)
θP i − wi θn − j∈In βj wj P θn − j∈In βj wj2
Dij wj
αi j
Dij wj
αi βi P
or wi j
Dij wj
βi wi P
θi w˜ i + λn αi βi w˜ i + λn αi βi w˜ i or
γi (θi − wi ) γi − θi −w i βi γn × P (θn − j βj wj )
αi γi (θi − wi ) i γi − θαi −w i αi βi γn × P (θn − j βj wj )
Normalization Terms: w˙ i = · · · + · · ·
Normalization Terms: w˙ i = · · ·
θi w˜ i + λn βi w˜ i + λn βi w˜ i f j j
P
w˜ i = wi + 1t(· · · + · · ·)
wi θn
γi wi (θi − wi ) i wi − θγi −w i β i γn w i × P (θn − j βj wj )
or
fi −
w˜ i = wi + 1t(· · ·)
θi w˜ i + λn βi w˜ i w˜ i + λn βi w˜ 2i
∀i ∈ In
w˜ i = wi + 1t(· · · + · · ·)
Cw wi = 14 v2i ³ ´2 dwi = wi d vi
Normalization Rules (if constraint is violated): wi = · · ·
j
P βi
Growth Terms: w˙ i = · · · + · · ·
Cα √ wi = αi vi ³ ´2 dwi = αi dvi
j
Dij wj . But it can also
Aij wj . I indicates a limitation constraint that limits the range for individual weights (I may stand for “interval”). N indicates a normalization constraint that limits the sum
j over a set of weights. Z is a rarely used variation of N (the symbol Z can be thought of as a rotated N). Subscript signs distinguish between the different ways in which constraints can be enforced. Iw , for instance, ≈ indicates the normalization term γi wi (θi − wi ) induced by the penalty function − 1 γi (θi − wi )2 under the coordinate transformation C w . Subscripts n and i for θ, λ, and γ denote different constraints of the same 2 type, for example, the same constraint applied to different output neurons. Normalization terms are integrated into the dynamics directly, while normalization rules are applied iteratively to the dynamics of the growth rule. fi denotes a fitness by which a weight would grow without any normalization (cf. section 5.2.2).
account for mean activities haτ i =
P
P
αi γi wi (θi − wi ) γi wi − αθii−w i αi βi γn wi × P (θn − j βj wj )
θi w˜ i + λn αi βi w˜ i w˜ i + λn αi βi w˜ 2i
αP i βi wi αi wi j Dij wj
C αw wi = 14 αi v2i ³ ´2 dwi = αi wi d vi
Note: C indicates a coordinate transformation that is specified by a superscript. L indicates a linear term. Q indicates a quadratic term that is usually induced by correlations haτ , aρ i =
I≈ I> N≈
N'
2
Constraint Functions g(w)
I= , I≥ N= , N≥ Z= , Z ≥
L Q
P P i βi wi 1
Objective Functions H(w)
C1 wi = vi ³ ´2 dwi =1 dvi
Coordinate Transformations
Table 1: Objective Functions, Constraint Functions, and the Dynamics Terms Induced in Different Coordinate Systems.
692 Laurenz Wiskott and Terrence Sejnowski
Neural Map Formation
693
explain a wide range of representative examples from the literature, as discussed in the next section. 6.2 Examples from the Literature. Table 2 shows representative models from the literature. The original equations are listed, as well as the classification in terms of growth rules and normalization rules listed in Table 1. Detailed comments for these models and the model in Amari (1980) follow below. The latter is not listed in Table 2 because it cannot be interpreted within our constrained optimization framework. The dynamics of the introductory example of section 2 can be classified as Q1 (see equation 2.3), I1≥ (see equation 2.4), and N1≥ (see equations 2.5 and 2.6). The models are discussed here mainly with respect to whether they can be consistently described within the constrained optimization framework, that is, whether growth rules and normalization rules can be derived from objective functions and constraint functions under one coordinate transformation (that does not imply anything about the quality of a model). Another important issue is whether the linear correlation model introduced in section 3 is an appropriate description for the activity dynamics of these models. It is an accurate description for some of them, but others are based on nonlinear models, and the approximations discussed in section 2.1 and appendix A have to be made. Models typically contain three components: the quadratic term Q to induce neighborhood-preserving maps, a limitation constraint I to keep synaptic weights positive, and a normalization constraint N (or Z) to induce competition between weights and to keep weights limited. The limitation constraint can be waived for systems with positive weights and multiplicative normalization rules (Konen & von der Malsburg, 1993; Obermayer et al., 1990; von der Malsburg, 1973) (cf. section 5.2.4). A presynaptic normalization rule can be introduced implicitly by the activity dynamics (cf. section A.2 in the appendix). In that case, it may be necessary to use an explicit presynaptic normalization constraint in the constrained optimization formulation. Otherwise the system may have a tendency to collapse on the input layer (see section 6.3), a tendency it does not have in the original formulation as a dynamical system. Only few systems contain the linear term L, which can be used for dynamic link matching. In H¨aussler and von der Malsburg (1983) the linear term was introduced for analytical convenience and does not differentiate between different links. The two models of dynamic link matching (Bienenstock & von der Malsburg, 1987; Konen & von der Malsburg, 1993) introduce similarity values implicitly and not through the linear term. The models are now discussed individually in chronological order. von der Malsburg (1973): The activity dynamics of this model is nonlinear and based on hexagon patterns in the output layer. Thus, the applicability of the linear correlation model is not certain (cf. section 2.1). The weight
Bienenstock and von der Malsburg (1987)
Linsker (1986)
H¨aussler and von der Malsburg (1983)
Whitelaw and Cowan (1981)
von der Malsburg (1973)
Reference
−
h
¢
ρ0
P
+ 1 NG
P
fτρ 0
¢
¢
˜ τρ w
i
0 0 0 0 τ 0 ρ 0 Dτ τ Dρρ wτ ρ
QFρρ 0 + k2 wτ 0 ρ 0
P ¡
QFρρ 0 + k2 wτρ 0
1 0 k1a + τ 0 fτ τ ρ0 NG Aρ −k2 P 0 0 0 0 τ 0 ρ 0 Dτ τ Aρ wτ ρ NGP
ρ0
P ¡
P
1 NG
+ Rb k10
ρ=1
wτρ ∈ [0, Tτρ ]
τ
ρ
wτρ − p0
´2
Dτ τ 0 wτ 0 ρ 0 wτρ Dρρ 0
P ³P 0
τ τ 0 ρρ 0
P
+γ
H=− +γ
ρ
τ
P ³P 0
wτρ − p0
´2
some wτρ ∈ [0, 1] and some wτρ ∈ [−1, 0] or all wτρ ∈ [−0.5, 0.5]
(k10 = k1 + Rb k1a τ 0 fτ τ 0 , Dτ τ 0 = Rb fτ τ 0 + δτ τ 0 (δτ τ 0 Kronecker), Dρρ 0 = haρ aρ 0 i, Aρ = haρ i, k2 < 0)
=
¡P
1 0 ˙ τρ = fτρ − 2N w wτρ τ 0 fτ ρ + fτρ = αP+ βwτρ Cτρ Cτρ = τ 0 ρ 0 Dτ τ 0 Dρρ 0 wτ 0 ρ 0
˙ τρ = k1 + w
˜τ = w
P19
˙ τρ = ατρ aρ aτ P w − αaτ + Ä (Ä: small noise term) P 0 0 ρ 0 wτρ = 1, τ 0 wτ ρ = 1
˜ τρ = wτρ + haρ aτ w ˜ τρ · 19 · w2 /w ˜ τ, wτρ = w
Weight Dynamics
Table 2: Examples of Weight Dynamics from Previous Studies.
(2)
(5)
(2.1) (2.2) (2.3)
(2) (5)
Equation
Q1 + N1≈ I1≥
I1≥
L1 + Q1
w w w (Iw > +Q )-(L +N' )
Qα - Q1 + ? N?=
Q1 Nw =
Classification
694 Laurenz Wiskott and Terrence Sejnowski
£
¤
Weight Dynamics
£
¤
Equation
£
P
¤
wτρ 0 (t)+²(t)aτ (t)aρ 0
¢2
wτρ −t 0
if wτρ −t>0 otherwise , t = ρ0
P nτ
wτρ 0 −Nτ
, nτ =
{ρ 0 |0<wτρ 0 }
P
b) wτρ = P τ0
wτ 0 ρ
Nρ wτρ
wτρ 0
(wτρ are the “effective couplings” Jτρ Tτρ )
→ wτρ /
τρ
wτ 0 ρ τ 0 ατ 0 ρ
P
wτρ → wτρ + ²wτρ ατρ aτ aρ P w 0 → wτρ / ρ 0 α τρ0
ρ0
N w (if some weights have become zero due to I1≥ : wτρ = Pτ τρ )
a) wτρ =
wτρ = wτρ½+ αaρ aτ
˙ τρ = wτρ κ0 − κ1 ρ 0 βρ 0 wτρ 0 + gmτ wτρ aρ + γτρ w (later in the article βρ 0 = 1)
ρ0
w (t)+²(t)aτ (t)aρ wτρ (t + 1) = qP τρ ¡
1
(3.5)
(2.1)
(4)
L LR R L ˙ Lτρ = λατρ τ 0 ρ 0 Dτ τ 0 DLL w (1) ρρ 0 wτ 0 ρ 0 + Dρρ 0 wτ 0 ρ 0 − γ wτρ + ²ατρ P P L R L L ˜ τρ + λτ ατρ a) ρ 0 (wτρ 0 + wτρ 0 ) = 2 ρ 0 ατρ 0 , wτρ = w (Note 23) P ˜ Lτρ + λτ ατρ b) τ 0 wLτ 0 ρ = const, wLτρ = w L ˜ Lτρ + λτ w ˜ Lτρ ) wτρ ∈ [0, 8 ατρ ] (If weights were cut due to Iα≥ : wLτρ = w Interchanging L (left eye) and R (right eye) yields equations for wRτρ .
P
Qαw Nαw = Nαw =
N= I1≥ (Nw =) Nw =
1
½Q 1
w w Nαw ≈ + Q + I> αw w (N≈ = N≈ )
Q1 Z1=
Qα - Iα≈ Nα= Nα= Iα≥ (Nw =)
Classification
Note: The original equations are written in a form that uses the notation of this article. The classification of the original equations by means of the terms and coordinate transformations listed in Table 1 are shown in the right column (the coordinate transformations are indicated by superscripts). See section 6.2 for further comments on these models.
Konen and von der Malsburg (1993)
Goodhill (1993)
Tanaka (1990)
Obermayer et al. (1990)
Miller, Keller, and Stryker, 1989
Reference
Table 2: Continued.
Neural Map Formation 695
696
Laurenz Wiskott and Terrence Sejnowski
dynamics is inconsistent in its original formulation. However, Miller and 1 MacKay (1994) have shown that constraints Nw = and Z= have a very similar effect on the dynamics, so that the weight dynamics could be made consistent by using Z1= instead of Nw = . No limitation constraint is necessary because neither the growth rule nor the multiplicative normalization rule can lead to negative weights, and the normalization rule limits the growth of positive weights. Amari (1980): This is a particularly interesting model not listed in Table 2. It is based on a blob dynamics, but no explicit normalization rules are applied, so that the derivation of correlations and mean activities as discussed in section 3 cannot be used. Weights are prevented from growing infinitely by a simple decay term, which is possible because correlations induced by the blob model are finite and do not grow with the total strength of the synapses. Additional inhibitory inputs received by the output neurons from a constantly active neuron ensure that the average activity is evenly distributed in the output layer, which also leads to expanding maps. In this respect, the architecture deviates from Figure 2. Thus, this model cannot be formulated within our framework. Whitelaw and Cowan (1981): The activity dynamics is nonlinear and based on blobs. Thus, the linear correlation model is only an approximation (cf. section 2.1). The weight dynamics is difficult to interpret in the constrained optimization framework. The normalization rule is not specified precisely, but it is probably multiplicative because a subtractive one would lead to negative weights and possibly infinite weight growth. The quadratic term −Q1 is based on mean activities and would lead by itself to zero weights. The Ä term was introduced only to test the stability of the system. H¨aussler and von der Malsburg (1983): This model is directly formulated in terms of weight dynamics; thus, the linear correlation model is accurate. The weight dynamics is consistent; however, as argued in section 5.2.2, there is usually no objective function for the normalization rule Nw ' , but by w w replacing Nw ' by N= or N≈ , the system can be expressed as a constrained optimization problem without qualitatively changing the model behavior. w The limitation term Iw > and the linear term L are induced by the constant α and were introduced for analytical reasons. The former is meant to allow weights to grow from zero strength, and the latter limits this growth. α needs to be small for neural map formation, and for a stable one-to-one mapping, α strictly should be zero. Thus, these two terms could be discarded if all weights would be initially larger than zero. Notice that the linear term does not differentiate between different links and thus does not have a function as suggested for dynamic link matching (cf. sections 4 and 6.5). Linsker (1986): This model is also directly formulated in terms of weight dynamics; thus, the linear correlation model is accurate. The weight dynamics is consistent. Since the model uses negative and positive weights
Neural Map Formation
697
and weights have a lower and an upper bound, no normalization rule is necessary. The weights converge to their upper or lower limit. Bienenstock and von der Malsburg (1987): This is a model of dynamic link matching and was originally formulated in terms of an energy function. Thus the classification is accurate. The energy function does not include the linear term. The features are binary, black versus white, and the similarity values are therefore 0 and 1 and do not enter the dynamics as continuous similarity values. The Tτρ in the constraint I1≥ represent the stored patterns in the associative memory, not similarity values. Miller et al. (1989): This model is directly formulated in terms of weight dynamics; thus, the linear correlation model is accurate. One inconsistent part in the weight dynamics is the multiplicative normalization rule Nw =, which is applied when subtractive normalization leads to negative weights. But it is only an algorithmic shortcut to solve the problem of interfering constraints (limitation and subtractive normalization). A more systematic treatment of the normalization rules could replace this inconsistent rule (cf. section 5.2.1). Another inconsistency is that weights that reach their upper or lower limit become frozen, or fixed at the limit value. With some exception, this seems to have little effect on the resulting maps (Miller et al., 1989, n. 23). Thus, this model has only two minor inconsistencies, which could be modified to make the system consistent. Limitation constraints enter the weight dynamics in two forms, Iα≈ and Iα≥ . The former tends to keep wLτρ = − γ² ατρ while the latter keeps wLτρ ∈ [0, 8ατρ ], which can unnecessarily introduce conflicts. However, γ = ² = 0, so that only the latter constraint applies and the Iα≈ term is discarded in later publications. In principle, the system can be simplified by using coordinate transformation C 1 instead of C α , thereby eliminating ατρ in the growth rule Qα as well as in the normalization rule Nα= , but not in the normalization rule Iα≥ . This is different from setting ατρ to a constant in a certain region. Using coordinate transformation C 1 would result in the same set of stable solutions, though the trajectories would differ. Changing ατρ generates a different set of solutions. However, the original formulation using C α is more intuitive and generates the “correct” trajectories—those that correspond to the intuitive interpretation of the model. Obermayer et al. (1990): This model is based on an algorithmic blob model and the linear correlation model is only an approximation (cf. the appendix). The weight dynamics is consistent. It employs the rarely used normalization constraint Z, which induces a multiplicative normalization rule under the coordinate transformation C 1 . No limitation constraint is necessary because neither the growth rule nor the multiplicative normalization rule can lead to negative weights, and positive weights are limited by the normalization rule. Tanaka (1990): This model uses a nonlinear input-output function for the neurons, which makes a clear distinction between membrane potential and
698
Laurenz Wiskott and Terrence Sejnowski
firing rate. However, this nonlinearity does not seem to play a specific functional role and is partially eliminated by linear approximations. Thus, the linear correlation model seems to be justified. The weight dynamics includes parameters βρ 0 ( fSP in the original notation), which make it inconsistent. The αw , which induces the first terms of the weight dynamics, is penalty P term N≈P − 2κ11 τ 0 (κ0 − κ1 ρ 0 βρ 0 wτ 0 ρ 0 )2 , which has to be evaluated under the coordinate transformation C αw with ατρ = 1/βρ . Later in the article, the parameters βρ 0 are set to 1, so that the system becomes consistent. Tanaka gives an objective function for the dynamics, employing a coordinate transformation for this purpose. The objective function is not listed here because it is derived under a different set of assumptions, including the nonlinear input-output function of the output neurons and a mean field approximation. Goodhill (1993): This model is based on an algorithmic blob model and the linear correlation model is only an approximation (cf. the appendix). Like the model in Miller et al. (1989), this model uses an inconsistent normalization rule as a backup, and it freezes weights that reach their upper or lower limit. In addition, it uses an inconsistent normalization rule for the input neurons. But since this inconsistent multiplicative normalization for the input neurons is applied after a consistent subtractive normalization for the output neurons, its effect is relatively weak, and substituting it by a subtractive one would make little difference (G. J. Goodhill, personal communication). To avoid dead units (neurons in the output layer that never become active), Goodhill (1993) divides each output activity by the number of times each output neuron has won the competition for the blob in the output layer. This guarantees a roughly equal average activity of the output neurons. With the probabilistic blob model (cf. the appendix), dead units do not occur as long as output neurons have any input connections. The specific parameter setting of the model even guarantees a roughly equal average activity of the output neurons under the probabilistic blob model because the sum over the weights converging on an output neuron is roughly the same for all neurons in the output layer. Thus, despite some inconsistencies, this model can probably be well approximated within the constrained optimization framework. Konen and von der Malsburg (1993): The activity dynamics is nonlinear and based on blobs. Thus the linear correlation model is only an approximation (cf. section 2.1). The weight dynamics is consistent. Although this is a model of dynamic link matching, it does not contain the linear term to bias the links. It introduces the similarity values in the constraints and through the coordinate transformation C αw (see section 6.4). No limitation constraint is necessary because neither the growth rule nor the multiplicative normalization rule can lead to negative weights, and positive weights are limited by the normalization rule. 6.3 Some Functional Aspects of Term Q. So far the focus of the considerations has been only on formal aspects of models of neural map formation.
Neural Map Formation
699
In this section some remarks on functional aspects of the quadratic term Q are made. Assume the effective lateral connectivities in the output layer, and in the input layer are sums of positive and/or negative contributions. Each contribution can be either a constant, C, or a centered gaussian-like function, G, which depends on only the distance of the neurons, for example, Dρρ 0 = D|ρ−ρ 0 | if ρ is a spatial coordinate. The contributions can be indicated by subscripts to the objective function Q. First index indicates the lateral connectivity of the input layer, the second index the one of the output layer. A negative gaussian (constant) would have to be indicated by −G (−C). Q(−C)G , for instance, would indicate a negative constant Dρρ 0 and a positive gaussian Dτ τ 0 . QG(G−G0 ) would indicate a positive gaussian Dρρ 0 and a Dτ τ 0 that is a difference of gaussians. Notice that negative signs can cancel each other, for example Q(G−C)G = −Q(C−G)G = −Q(G−C)(−G) . We thus discuss the terms only in their simplest form: −QCG instead of Q(−C)G . All feedforward weights are assumed to be positive. Assuming all weights to be negative would lead to equivalent results because Q does not change if all weights change their sign. The situation becomes more complex if some weights were positive and others negative. A term Q is called positive if it can be written in a form where it has a positive sign and only positive contributions; for example, −Q(−C)G = QCG is positive, while Q(G−C)G is not. Since Q is symmetrical with respect to Dρρ 0 and Dτ τ 0 , a term such as Q(G−C)G has the same effect as QG(G−C) with the role of input layer and output layer exchanged. A complicated term can be analyzed most easily by splitting it into its elementary components. For instance, the term QG(G−C) can be split into QGG −QGC and analyzed as a combination of these two simpler terms. Some elementary terms are now discussed in greater detail. The effect of the terms is considered under two types Pof constraints. In constraint A, 1. In constraint B, the total sum of weights is constrained, ρ 0 τ 0 wρ 0 τ 0 = P the sums of weights originating from an input neuron, τ 0 wρτ 0 = 1/R, or P terminating on an output neuron, ρ 0 wρ 0 τ = 1/T, are constrained, where R and T denote the number of input and output neurons, respectively. Without further constraints, a positive term always leads to infinite weight growth and a negative term to weight decay. P Terms ±QCC simplify to ±QCC = ±Dρρ Dτ τ ( ρ 0 τ 0 wρ 0 τ 0 )2 and depend on only the sum of weights. Thus, neither term has any effect under constraints A or B. Term +QCG takes its maximum value under constraint A if all links terminate on one output neuron. The map has the tendency to collapse. This is because the lateral connections in the output layer are higher for smaller distances and maximal for zero Pdistance between connected neurons. Under P the constraint τ 0 wρτ 0 ≤ 1, ρ 0 wρ 0 τ ≤ 1, for instance, the resulting map connects the input layer to a region in the output layer that is of the size of the input layer even if the output layer is much larger. No topography is taken
700
Laurenz Wiskott and Terrence Sejnowski
into account because Dρρ 0 is constant and does not differentiate between different input neurons. Thus, this term has no effect under constraint B. Term −QCG has the Popposite P effect of +QCG . Consider the induced growth ˙ = −D term w 0 Dτ τ 0 ρτ ρρ τ ρ 0 wτ 0 ρ 0 . This is a convolution of Dτ τ 0 with P 0 0 w and induces the largest decay in regions where the weighted sum ρ0 τ ρ over terminating links is maximal. A stable solution would require equal decay for all weights because constraint P A can compensate only for equal decay. Thus, the convolution of Dτ τ 0 with Pρ 0 wτ 0 ρ 0 must be a constant. Since Dτ τ 0 is a gaussian, this is possible only if ρ 0 wτ 0 ρ 0 is a constant, as can be easily seen in Fourier space. Thus, the map expands over the output layer, and each output neuron receives the same sum of weights. Constraint A could be substituted by a constant growth term L, in which case the expansion effect could be obtained without any explicit constraint. As +QCG , this term has no effect under constraint B. Term +QGG takes its maximum value under constraint A if all but one weight are zero. The map collapses on the input and the output layer. Under constraint B, the map becomes topographic because links that originate from neighboring neurons (high Dρρ 0 value) favorably terminate on neighboring neurons (high Dτ τ 0 value). A more rigorous argument would require a definition of topography, but as argued in section 6.7, the term +QGG can be directly taken as a generalized measure for topography. Term −QGG has the opposite effect of +QGG . Thus, it leads under constraint A to a map that is expanded over input and output layer. In addition, the map becomes antitopographic. Further analytical or numerical investigations are required to show whether the expansion is as even as for the term −QCG and how an antitopographic map may look. Constraint B also leads to an antitopographic map. 6.4 Equivalent Models. The effect of coordinate transformations has been considered so far only for single growth terms and normalization rules. Coordinate transformations can be used to generate different models that are equivalent in terms of their constrained optimization problem. Consider the system in Konen and von der Malsburg (1993). Its objective function and constraint function are Q and N≥ , H(w) =
1X wi Dij wj , 2 ij
gn (w) = 1 −
X wj j∈In
αj
= 0,
(6.1)
which must be evaluated under the coordinate transformation C αw to induce the original weight dynamics Qαw and Nαw ≥ , w˙ i = αi wi
X j
Dij wj ,
wi = P
w˜ i
w˜ j j∈In αj
.
(6.2)
Neural Map Formation
701
If evaluated directly (i.e., under the coordinate transformation C 1 ), one would obtain
w˙ i =
X j
Dij wj ,
wi = w˜ i + P
1
−2 j∈In αj
1 −
X w˜ j j∈In
αj
1. αi
(6.3)
As argued in section 5.2.4, an additional limitation constraint I1> (or I1≥ ) has to be added to this system to account for the limitation constraint implicitly introduced by the coordinate transformation C αw for the dynamics above (see equation 6.2). It follows from equation 4.8 that the flow fields of the weight dynamics in equations 6.2 and 6.3 differ, but since dwi /dvi 6= 0 for positive weights, the fixed points are the same. That means that the resulting maps to which the two systems converge, possibly from different initial states, are the same. In this sense, these two dynamics are equivalent. This also holds for other coordinate transformations within the defined region as long as dwi /dvi is finite (dwi /dvi = 0 may introduce additional fixed points). Thus, this method of generating equivalent models makes it possible to abstract the objective function from the dynamics. Different equivalent dynamics may have different convergence properties, their attractor basins may differ, and some regions in state space may not be reachable under a particular coordinate transformation. In any case, within the reachable state space, the fixed points are the same. Thus, coordinate transformations make it possible to optimize the dynamics without changing its objective function. Normalization rules derived by different methods can substitute each other without changing the qualitative behavior of a system. For instance, I= can be replaced by I≈ , or N≥ can be replaced by N> under any coordinate transformation. These replacements will also generate equivalent systems in a practical sense. 6.5 Dynamic Link Matching. In the previous section, the similarity values αi entered the weight dynamics in two places. In equation 6.2, the differential effect of αi enters only the growth rule, while in equation 6.3, it enters only the normalization rule. Growth and normalization rules can, to some extent, be interchangeably used to incorporate feature information in dynamic link matching. However, the objective function (see equation 6.1) shows that the similarity values are introduced through the constraints and that they are transferred to the growth rule only by the coordinate transformation C αw . Similarity values can enter the growth rule more directly through the linear term L. An alternative objective function for dynamic
702
Laurenz Wiskott and Terrence Sejnowski
link matching is H(w) =
X i
βi wi +
1X wi Dij wj , 2 ij
gn (w) = 1 −
X
wj = 0,
(6.4)
j∈In
with βi = αi . The first term now directly favors links with high similarity values. This may be advantageous because it allows better control over the influence of the topography versus the feature similarity term. Furthermore, this objective function is more closely related to the similarity function of elastic graph matching in Lades et al. (1993), which has been developed as an algorithmic abstraction of dynamic link matching (see section 6.7). 6.6 Soft versus Hard Competitive Normalization. Miller and MacKay (1994) have analyzed the role of normalization rules for neural map formation. They consider a linear Hebbian growth rule Q1 and investigate the dynamics under a subtractive normalization rule N1= (S1 in their notation) 1 and two types of multiplicative normalization rules, Nw = and Z= (M1 and M2 in their notation, respectively). They show that when considering an isolated output neuron with the multiplicative normalization rules, the weight vector tends to the principal eigenvector of the matrix D, which means that many weights can maintain some finite value. Under the subtractive normalization rule, a winner-take-all behavior occurs, and the weight vector tends to saturate with each single weight having either its minimal or maximal value producing a more compact receptive field. If no upper bound is imposed on individual weights, only one weight survives, corresponding to a point receptive field. von der Malsburg and Willshaw (1981) have performed a similar, though less comprehensive, analysis using a different approach. Instead of modifying the normalization rule, they considered different growth rules with the same multiplicative normalization rule Nw ' . They also found two qualitatively different behaviors: a highly competitive case in which only one link survives (or several if single weights are limited in growth by individual bounds) (case µ=1 or µ=2 in their notation) and a less competitive case in which each weight is eventually proportional to the correlation between pre- and postsynaptic neuron (case µ=0). Hence, one can either change the normalization rule and keep the growth rule or, vice versa, modify the growth rule and keep the normalization rule the same. Either choice generates the two different behaviors. As shown above, by changing both the growth and normalization rules consistently by a coordinate transformation, it is possible to obtain two different weight dynamics with qualitatively the same behavior. More precisely, the system (Qw , Nw ) is equivalent to (Q1 , N1 , I1 ) and has the same fixed points; the former one uses a multiplicative normalization rule, and the latter uses a subtractive one. This also explains why changing the growth rule or changing the normalization rule can be equivalent.
Neural Map Formation
703
It may therefore be misleading to refer to the different cases by the specific normalization rules (subtractive versus multiplicative), because that is valid only for the linear Hebbian growth rule Q1 . We suggest using a more generally applicable nomenclature that refers to the different behaviors rather than the specific mathematical formulation. Following the terminology of Nowlan (1990) in a similar context, the term hard competitive normalization could be used to denote the case where only one link survives (or a set of saturated links, which are limited by upper bounds); the term soft competitive normalization could be used to denote the case where each link has some strength proportional to its fitness. 6.7 Related Objective Functions. Objective functions also provide means for comparing weight dynamics with other algorithms or dynamics of a different origin for which an objective function exists. First, maximizing the objective functions L and Q under linear constraints I and N is the quadratic programming problem, and finding an optimal one-to-one mapping between two layers of same size for objective function Q is the quadratic assignment problem. These problems are known to be NP-complete. However, there is a large literature on algorithms that efficiently solve special cases or find good approximate solutions in polynomial time (e.g., Horst, Pandalos, & Thoai, 1995). Many related objective functions are defined only for maps for which each input neuron terminates on exactly one output neuron with weight 1, which makes the index τ = τ (ρ) a function of index ρ. An objective function of this kind may have the form H=
X
Gτρτ 0 ρ 0 ,
(6.5)
ρρ 0
where G encodes how well a pair of links from ρ to τ (ρ) and from ρ 0 to τ 0 (ρ 0 ) preserves topography. A pair of parallel links, for instance, would yield high G values, while others would yield lower values. Now define a particular family of weights w that realize one-to-one connectivities: ½ w¯ τρ =
if τ = τ (ρ) otherwise.
1 0
(6.6)
¯ is a subset of w with w¯ τρ ∈ {0, 1} as opposed to wτρ ∈ [0, 1]. It indicates that w an objective function was originally defined for a one-to-one map rather than the more general case of an all-to-all connectivity. Then objective functions of one-to-one maps can be written as ¯ = H(w)
X τρτ 0 ρ 0
w¯ τρ Gτρτ 0 ρ 0 w¯ τ 0 ρ 0 =
X ij
w¯ i Gij w¯ j ,
(6.7)
704
Laurenz Wiskott and Terrence Sejnowski
¯ by w with i = {ρ, τ }, j = {ρ 0 , τ 0 } as defined above. Simply replacing w then yields a generalization of the original objective function to all-to-all connectivities. Goodhill, Finch, and Sejnowski (1996) have compared 10 different objective functions for topographic maps and have proposed another, the C measure. They show that for the case of an equal number of neurons in the input and the output layer, most other objective functions can be either reduced to the C measure, or they represent a closely related objective function. This suggests that the C measure is a good unifying measure for topography. The C measure is equivalent to our objective function Q with ¯ instead of w. Adapted to the notation of this article the C measure has w the form X ¯ = (6.8) w¯ i Gij w¯ j , C(w) ij
with a separable Gij , that is, Gij = Gρτρ 0 τ 0 = Gτ τ 0 Gρρ 0 . Thus, the objective function Q is the typical term for topographic maps in other contexts as well. Elastic graph matching is an algorithmic counterpart to dynamic link matching and has been used for applications such as object and face recognition (Lades et al., 1993). It is based on a similarity function that in its simplest version is ¯ = H(w)
X i
βi w¯ i +
1X w¯ i Gij w¯ j , 2 ij
(6.9)
where Gij = −[(pρ − pρ 0 ) − (pτ − pτ 0 )]2 , and pρ and pτ are two-dimensional position vectors in the image plane. This similarity function corresponds formally to the objective function in equation 6.4. The main difference between these two functions is hidden in G and D. The latter ought to be separable into two factors Dρτρ 0 τ 0 = Dρρ 0 Dτ τ 0 while the former is clearly not. G actually favors a metric map, which tends to preserve not only neighborhood relations but also distances, whereas with D, the maps always tend to collapse. 6.8 Self-Organizing Map Algorithm. Models of the self-organizing map (SOM) algorithm can be high-dimensional or low-dimensional, and two different learning rules, which we have called weight dynamics, are commonly used. The validity of the probabilistic blob model for the highdimensional models is discussed in the appendix. A classification of the high-dimensional model by Obermayer et al. (1990) is given in Table 2. The low-dimensional models do not fall into the class of one-to-one mappings considered in the previous section, because the input layer is represented as a continuous space and not as a discrete set of neurons.
Neural Map Formation
705
One learning rule for the high-dimensional SOM algorithm is given by w˜ τρ (t) = wτρ (t − 1) + ²Bτ τ0 Bρρ0 w˜ τρ (t) wτρ (t) = qP , ˜ 2τρ 0 (t) ρ0 w
(6.10) (6.11)
as used, for example, in Obermayer et al. (1990). Bτ τ0 denotes the neighborhood function (commonly indicated by h) and Bρρ0 denotes the stimulus pattern (sometimes indicated by x) with index ρ0 . Bρρ0 does not need to have a blob shape, so that ρ0 may be an arbitrary index. Output neuron τ0 is the winner neuron in response to stimulus pattern ρ0 . This learning rule is a consistent combination of growth rule Q1 and normalization rule Z1= and an objective function exists, which is a good approximation to the extent that the probabilistic blob model is valid. The second type of learning rule is given by wτρ (t + 1) = wτρ (t) + ²Bτ τ0 (Bρρ0 − wτρ (t)),
(6.12)
as used, for example, in Bauer, Brockmann, and Geisel (1997). For this learning rule, P the weights and the P input stimuli are assumed to be sum normalized: ρ wτρ = 1 and ρ Bρρ0 = 1. For small ² this learning rule is equivalent to w˜ τρ (t) = wτρ (t − 1) + ²Bτ τ0 Bρρ0 w˜ τρ (t) wτρ (t) = P , ˜ τρ 0 (t) ρ0 w
(6.13) (6.14)
which shows that it is a combination of growth rule Q1 and normalization rule Nw = . Thus, this system is inconsistent, and to formulate it within our constrained optimization framework Nw = would have to be approximated by Z1= , which leads back to the learning rule in equations 6.10 and 6.11. There are two ways of going from these high-dimensional models to the low-dimensional models. The first is simply to use fewer input neurons (e.g., two). A low-dimensional input vector is then represented by the activities of these few neurons. However, since the low-dimensional input vectors are usually not normalized to homogeneous mean activity of the input neurons and since the receptive and projective fields of the neurons do not codevelop in a homogeneous way, the probabilistic blob model is usually not valid. A second way of going from a high-dimensional model to a low-dimensional model is by considering the low-dimensional input vectors and weight vectors as abstract representatives of the high-dimensional ones (Ritter, Martinetz, & Schulten, 1991; Behrmann, 1993). Consider, for example, the weight dynamics in equation 6.12 and a two-dimensional input layer. Let pρ be a
706
Laurenz Wiskott and Terrence Sejnowski
position vector of input neuron ρ. The center of the receptive field of neuron τ can be defined as X pρ wτρ , (6.15) mτ (w) = ρ
and the center of the input blob can be defined similarly, x(Bρ0 ) =
X ρ
pρ Bρρ0 .
(6.16)
Notice that the input P blobs as well as the weights are normalized, that is, P B = 1 and ρρ 0 ρ ρ wτρ = 1. Using these definitions and given a pair of blobs at locations ρ0 and τ0 , the high-dimensional learning rule (see equation 6.12) yields the low-dimensional learning rule
⇐⇒
X
¢ ¡ pρ wτρ (t) + ²Bτ τ0 (Bρρ0 − wτρ (t))
(6.17)
¢ ¡ = mτ (w(t)) + ²Bτ τ0 x(Bρ0 ) − mτ (w(t)) ¢ ¡ mτ (t + 1) = mτ (t) + ²Bτ τ0 xρ0 − mτ (t) .
(6.18)
mτ (w(t + 1)) =
ρ
(6.19)
One can first calculate the centers of the receptive fields of the high-dimensional model and then apply the low-dimensional learning rule, or one can first apply the high-dimensional learning rule and then calculate the centers of the receptive fields; the result is the same. Notice that the low-dimensional learning rule is even formally equivalent to the high-dimensional one and that it is the rule commonly used in low-dimensional models (Kohonen, 1990). Even though the high- and the low-dimensional learning rules are equivalent for a given pair of blobs, the overall behavior of the models is not. This is because the positioning of the output blobs is different in the two models (Behrmann, 1993). It is clear that many different high-dimensional weight configurations having different output blob positioning can lead to the same low-dimensional weight configuration. However, for a highdimensional model that self-organizes a topographic map with point receptive fields, the positioning may be similar for the high- and the lowdimensional models, so that the stable maps may be similar as well. These considerations show that only the high-dimensional model in equations 6.10 and 6.11 can be consistently described within our constrained optimization framework. The high-dimensional model of equation 6.12 is inconsistent. The probabilistic blob model in general is not applicable to low-dimensional models, because some assumptions required for its derivation are not valid. The simple relation between the high- and the lowdimensional model sketched above holds only for the learning step but not for the blob positioning, though the positioning and thus the resulting maps may be very similar for topographic maps with point receptive fields.
Neural Map Formation
707
7 Conclusions and Future Perspectives The results presented here can be summarized: • A probabilistic nonlinear blob model can behave like a linear correlation model under fairly general conditions (see section 2.1 and the appendix). This clarifies the relationship between deterministic nonlinear blob models and linear correlation models and provides an approximation of the former by the latter. • Coordinate transformations can transform dynamics with curl into curl-free dynamics, allowing the otherwise impossible formulation of an objective function (see section 4). A similar effect exists for normalization rules. Coordinate transformations can transform nonorthogonal normalization rules into orthogonal ones, allowing the normalization rule to be formulated as a constraint (see section 5.1). • Growth rules and normalization rules must have a special relationship in order to make a formulation of the system dynamics as a constrained optimization problem possible: the growth rule must be a gradient flow, and the normalization rules must be orthogonal under the same coordinate transformation (see section 5.1). • Constraints can be enforced by various types of normalization rules (see section 5.2), and they can even be implicitly introduced by coordinate transformations (see section 5.2.4) or the activity dynamics (see section A.2). • Many all-to-all connected models from the literature can be classified within our constrained optimization framework based on only four terms: L, Q, I, and N (Z) (see section 6.2). The linear term L has rarely been used, but it can have a specific function that may be useful in future models (see section 6.5). • Models may differ considerably in their weight dynamics and still solve the same optimization problem. This can be revealed by coordinate transformations and by comparing the different but possibly equivalent types of normalization rules (see section 6.4). Coordinate transformations make it in particular possible to optimize the dynamics without changing the stable fixed points. • The constrained optimization framework provides a convenient formalism to analyze functional aspects of the models (see sections 6.3, 6.5, and 6.6). • The constrained optimization framework for all-to-all connected models presented here is closely related to approaches for finding optimal one-to-one maps (see section 6.7) but is not easily adapted to the selforganizing map algorithm (see section 6.8).
708
Laurenz Wiskott and Terrence Sejnowski
• Models of neural map formation formulated as constrained optimization problems provide a unifying framework. It abstracts from arbitrary differences in the design of models and leaves only those differences that are likely to be crucial for the different structures that emerge by self-organization. It is important to note that our constrained optimization framework is unifying in the sense that it provides a canonical formulation independent of most arbitrary design decisions, for example, due to different coordinate transformations or different types of normalization rules. This does not mean that most models are actually equivalent. But with the canonical formulation of the models as constrained optimization problems, it should be possible to focus on the crucial differences and to understand better what the essentials of neural map formation are. Based on the constrained optimization framework presented here, a next step would be to consider specific architectures with particular effective lateral connectivities and to investigate the structures that emerge. The role of parameters and effective lateral connectivities might be investigated analytically for a variety of models by means of objective functions, similar to the approach sketched in section 6.3 or the one taken in MacKay and Miller (1990). We have considered here only three levels of abstraction: detailed neural dynamics, abstract weight dynamics, and constrained optimization. There are even higher levels of abstraction, and the relationship between our constrained optimization framework and these more abstract models should be explored. For example, in section 6.7 our objective functions were compared with other objective functions defined only for one-to-one connectivities. Another possible link is with Bienenstock and von der Malsburg (1987) and Tanaka (1990), who have proposed spin models for neural map formation. An interesting approach is that taken by Linsker (1986), who analyzed the receptive fields of the output neurons, which were oriented edge filters of arbitrary orientation. He derived an energy function to evaluate how the different orientations would be arranged in the output layer due to lateral interactions. The only variables of this energy function were the orientations of the receptive fields, an abstraction from the connectivity. Similar models were proposed earlier in Swindale (1980), though not derived from a receptive field model, and more recently in Tanaka (1991). These approaches and their relationships to our constrained optimization framework need to be investigated more systematically. A neural map formation model of Amari (1980) could not be formulated within the constrained optimization framework presented here (cf. section 6.2). The weight growth in this model is limited by weight decay rather than explicit normalization rules, which is possible because the blob dynamics provides only limited correlation values even if the weights would grow large. This model is particularly elegant with respect to the
Neural Map Formation
709
way it indirectly introduces constraints and should be investigated further. Our discussion in section 6.3 indicates that the system L+Q might also show map expansion and weight limitation without any explicit constraints, but further analysis is needed to confirm this. The objective functions listed in Table 1 have a tendency to produce either collapsing or expanding maps. It is unlikely that the terms can be counterbalanced such that they have the tendency to preserve distances directly, independent of normalization rules and the size of the layers, as does the algorithmic objective function in equation 6.9. A solution to this problem might be found by examining propagating activity patterns in the input as well as the output layer, such as traveling waves (Triesch, 1995) or running blobs (Wiskott & von der Malsburg, 1996). Waves and blobs of activity have been observed in the developing retina (Meister, Wong, Baylor, & Shatz, 1991). If the waves or blobs have the same intrinsic velocity in the two layers, they would tend to generate metric maps, regardless of the scaling factor induced by the normalization rules. It would be interesting to investigate this idea further and derive correlations for this class of models. Another limitation of the framework discussed here is that it is confined to second-order correlations. As von der Malsburg (1995) has pointed out, this is appropriate only for a subset of phenomena of neural map formation, such as retinotopy and ocular dominance. Although orientation tuning can arise by spontaneous symmetry breaking (e.g., Linsker, 1986), a full understanding of the self-organization of orientation selectivity and other phenomena may require taking higher-order correlations into account. It would be interesting as a next step to consider third-order terms in the objective function and the conditions under which they can be derived from detailed neural dynamics. There may also be an interesting relationship to recent advances in algorithms for independent component analysis (Bell & Sejnowski, 1995), which can be derived from a maximum entropy method and is dominated by higher-order correlations. Finally, it may be interesting to investigate the extent to which the techniques used in the analysis presented here can be applied to other types of neural dynamics, such as learning rules. The existence of objective functions for dynamics with curl may make it possible to formulate more learning rules within the constrained optimization framework, which could lead to new insights. Optimizing the dynamics of a learning rule without changing the set of stable fixed points may be an interesting application for coordinate transformations. Appendix: Probabilistic Blob Model A.1 Noise Model. Consider the activity model of Obermayer et al. (1990) as an abstraction of the neural activity dynamics in section 2.1 (see equations 2.1 and 2.2). Obermayer et al. use a high-dimensional version of the self-organizing map algorithm (Kohonen, 1982). A blob Bρ 0 ρ0 is located at
710
Laurenz Wiskott and Terrence Sejnowski
a random position ρ0 in the input layer, and the input iτ 0 (ρ0 ) received by the output neurons is calculated as in equation 2.7. A blob B¯ τ 0 τ0 in the output layer is located at the position τ0 of highest input, that is, iτ0 (ρ0 ) = maxτ 0 iτ 0 (ρ0 ). Only the latter step differs in its outcome from the dynamics in section 2, the maximal input instead of the maximal overlap determining the location of the output blob. The transition to the probabilistic blob location can be done by assuming that the blob B¯ τ 0 τ0 in the output layer is located at τ0 with probability X wτ0 ρ 0 Bρ 0 ρ0 . (A.1) p(τ0 |ρ0 ) = iτ0 (ρ0 ) = ρ0
For the following considerations, the same as P P normalization assumptions in section 2.1 are made, which leads to τ 0 iτ 0 (ρ0 ) = 1 and τ0 p(τ0 |ρ0 ) = 1 and justifies the interpretation of p(τ0 |ρ0 ) as a probability. The effect of different normalization rules, like those used by Obermayer et al. (1990), is discussed in the next section. The probabilistic blob location can be achieved by multiplicative noise ητ with the cumulative density function f (η) = exp (−1/η), which leads to a modified input lτ = ητ iτ with a cumulative density function µ ¶ iτ (ρ0 ) , (A.2) fτ (lτ ) = exp − lτ and a probability density function µ ¶ ∂ fτ iτ (ρ0 ) iτ (ρ0 ) = exp − . pτ (lτ ) = ∂lτ l2τ lτ
(A.3)
Notice that the noise is different for each output neuron but always from the same distribution. The probability of neuron τ0 having larger input lτ0 than all other neurons τ 0 , that is, the probability of the output blob being located at τ0 , is p(τ0 |ρ0 ) = p(lτ0 > lτ 0 ∀τ 0 6= τ0 ) Z ∞ Y pτ0 (lτ0 ) fτ 0 (lτ0 ) dlτ0 = 0
Z∞ = 0
(A.4) (A.5)
τ 0 6=τ0
à ! iτ0 (ρ0 ) 1 X exp − iτ 0 (ρ0 ) dlτ0 l2τ0 lτ0 τ 0
iτ (ρ0 ) = P0 iτ 0 (ρ0 ) τ0
= iτ0 (ρ0 )
(A.6) (A.7)
à since
X τ0
! iτ 0 (ρ0 ) = 1 ,
(A.8)
Neural Map Formation
711
which is the desired result. Thus, the model by Obermayer et al. (1990) can be modified by multiplicative noise to yield the probabilistic blob location behavior. A problem is that the modified input lτ has an infinite mean value, but this can be corrected by consistently transforming the cumulative density functions by the substitution lτ = kτ2 , yielding µ ¶ iτ (ρ0 ) fτ (kτ ) = exp − 2 kτ
(A.9)
for the new modified inputs kτ , the means of which are finite. Due to the nonlinear transformation lτ = kτ2 , the modified inputs kτ are no longer a product of the original input iτ with noise, whose distribution is the same for all neurons, but each input iτ generates a modified input kτ with a nonlinearly distorted version of the cumulative density function in equation A.2. The probability for a particular combination of blob locations is p(τ0 , ρ0 ) = p(τ0 |ρ0 )p(ρ0 ) =
X
wτ0 ρ 0 Bρ 0 ρ0
ρ0
1 , R
(A.10)
and the correlation between two neurons defined as the average product of their activities is X p(τ0 , ρ0 )B¯ τ τ0 Bρρ0 (A.11) haτ aρ i = τ 0 ρ0
=
XX τ 0 ρ0 ρ 0
wτ0 ρ 0 Bρ 0 ρ0
1 ¯ Bτ τ Bρρ0 R 0
(A.12)
à ! X 1X¯ = Bρ 0 ρ0 Bρρ0 Bτ τ 0 w τ 0 ρ 0 R τ 0ρ0 ρ0 =
1X¯ Bτ τ 0 wτ 0 ρ 0 B¯ ρ 0 ρ , R τ 0ρ0
with
(A.13) B¯ ρ 0 ρ =
X ρ0
Bρ 0 ρ0 Bρρ0 , (A.14)
where the brackets h·i indicate the ensemble average over a large P number of blob presentations. This is equivalent to equation 2.13 if B¯ τ 0 τ = τ0 Bτ 0 τ0 Bτ τ0 . Thus, the two probabilistic dynamics are equivalent, though the blobs in the output layer must be different. A.2 Different Normalization Rules. The derivation of correlations in the probabilistic blob model P given above assumes explicit presynaptic normalization of the form τ 0 wτ 0 ρ 0 = 1. This assumption is not valid for some models that use only postsynaptic normalization (e.g., von der Malsburg, 1973). The modelP by Obermayer et al. (1990) postsynaptically normalizes the square sum, ρ 0 w2τ 0 ρ 0 = 1, instead of the sum, which may make the applicability of the probabilistic blob model even more questionable.
712
Laurenz Wiskott and Terrence Sejnowski
To investigate the effect of these different normalization rules on the probabilistic blob model, assume that the projective (or receptive) fields of the input (or output) neurons codevelop in such a way that, at any given moment, all neurons in a layer have the same weight histogram. Neuron ρ, for instance, would have the weight histogram wτ 0 ρ taken over τ 0 , and it would be the same as those of the other neurons ρ 0 . Two neurons of same weight histogram have the same number of nonzero weights, and the square sums over P differ from the sums by the same factor c, P their weights for example, τ 0 w2τ 0 ρ 0 = c τ 0 wτ 0 ρ 0 = 1 for all ρ 0 with c ≤ 1. The weight histogram, and with it the factor c, may change over time. For instance, if point receptive fields develop from an initial all-to-all connectivity, the histogram has a single peak at 1/T in the beginning and has a peak at 0 and one entry at 1 at the end of the self-organization process, and c(t) grows from 1/T up to 1, where T is the number of output neurons. Consider first the effect of the square sum normalization under the assumption of homogeneous codevelopment of receptive and projective fields. The square sum normalization differs from the sum normalization by a factor c(t) common to all neurons in the layer. Since the nonlinear blob model is insensitive to such a factor, the derived correlations and the learning rule are off by this factor c. Since this factor is common to all weights, the trajectories of the weight dynamics are identical, though the time scales differ by c between the two types of normalization. Consider now the effect of pure postsynaptic normalization under the assumption of homogeneous codevelopment of receptive and projective fields. Assume a pair of blobs is located at ρ0 and τ0 . With a linear growth rule, the sum over weights originating from an input neuron would change according to ˙ρ = W
X τ
w˙ τρ =
X τ
Bτ τ0 Bρρ0 = Bρρ0 ,
(A.15)
since the blob Bτ τ0 is normalized to one. Averaging over all input blob positions yields an average change of ˙ ρi = hW
1X 1 Bρρ0 = , R ρ0 R
(A.16)
since P we assume a homogeneous average activity in the input layer, that is, ρ0 Bρρ0 = 1. A similar expression follows for the postsynaptic sum: ˙ τi = hW =
X ρ0 τ0
X ρ0 τ0
p(τ0 , ρ0 ) Ã
X ρ
Bτ τ0 Bρρ0
! X 1X Bτ 0 τ0 wτ 0 ρ 0 Bρ 0 ρ0 Bτ τ0 Bρρ0 R τ 0ρ0 ρ
(A.17) (A.18)
Neural Map Formation
713
=
X X X X 1X Bτ τ0 Bτ 0 τ0 wτ 0 ρ 0 Bρ 0 ρ0 Bρρ0 R τ0 ρ0 ρ τ0 ρ0
(A.19)
=
1 , T
(A.20)
P where ρ 0 wτ 0 ρ 0 = R/T is assumed due to the postsynaptic normalization rule and the blobs are normalized with respect to both of their indices. R and T are the number of neurons in the input and output layer, respectively. This equation shows that each output neuron has to normalize its sum of weights by the same amount, and it has to do that by a subtractive normalization rule if the system is consistent. The amount by which each single weight wτρ is changed depends on the number of nonzero weights an output neuron receives. Since we assume the weight histograms are the same, each output neuron has the same number of nonzero weights, and each weight gets corrected by the same amount. Since we also assume same weight histograms for the projective fields, the sum over all weights originating from an input neuron is corrected by the same amount for each input neuron, namely, by 1/R per time unit. Thus, the postsynaptic normalization rule preserves presynaptic normalization. It can even be argued that a postsynaptic normalization rule stabilizes presynaptic normalization. Assume that an input neuron has a larger (or smaller) sum over its weights than the other input neurons. Then this neuron is likely to have more (fewer) nonzero weights than the other input neurons. This results in a larger (smaller) negative compensation by the postsynaptic normalization rule, since each weight is corrected by the same amount. This then reduces the difference between the input neuron under consideration and the others. It is important to notice that this effect of stabilizing the presynaptic normalization is not preserved in the constrained optimization formulation. It may be necessary to use explicit presynaptic normalization in the constrained optimization formulation to account for the implicit presynaptic normalization in the blob model. If the postsynaptic constraint is based on the square sum, then the normalization rule is multiplicative, and the projective fields of the input neurons need not have the same weight histograms. The system would still preserve the presynaptic normalization. Notice that the derivation given above does not hold for a nonlinear Hebbian rule, for example, w˙ τρ = wτρ aτ aρ . These considerations show that the probabilistic blob model may be a good approximation even if the constraints are based on the square sum instead of the sum and if only the postsynaptic neurons are constrained and not the presynaptic neurons, as was required in the derivation of the probabilistic blob model above. The homogeneous codevelopment of receptive and projective fields is probably a reasonable assumption for highdimensional models with a homogeneous architecture. For low-dimensional models, such as the low-dimensional self-organizing map algorithm (Kohonen, 1982), the assumption is less likely to be valid. However, numerical
714
Laurenz Wiskott and Terrence Sejnowski
simulations or more detailed analytical considerations are needed to verify the assumption for any given concrete model. Acknowledgments We are grateful to Geoffrey J. Goodhill, Thomas Maurer, Jozsef Fiser, and two anonymous referees for carefully reading the manuscript and offering useful comments. L. W. has been supported by a Feodor-Lynen fellowship by the Alexander von Humboldt-Foundation, Bonn, Germany. References Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural fields. Biol. Cybern., 27, 77–87. Amari, S. (1980). Topographic organization of nerve fields. Bulletin of Mathematical Biology, 42, 339–364. Bauer, H.-U., Brockmann, D., & Geisel, T. (1997). Analysis of ocular dominance pattern formation in a high-dimensional self-organizing-map model. Network: Computation in Neural Systems, 8(1), 17–33. Behrmann, K. (1993). Leistungsuntersuchungen des “Dynamischen Link-Matchings” und Vergleich mit dem Kohonen-Algorithmus (Internal Rep. No. IR-INI 93–05). Bochum: Institut fur ¨ Neuroinformatik, Ruhr-Universit¨at Bochum. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bienenstock, E., & von der Malsburg, C. (1987). A neural network for invariant pattern recognition. Europhysics Letters, 4(1), 121–126. Dirac, P. A. M. (1996). General theory of relativity. Princeton, NJ: Princeton University Press. Ermentrout, G. B., & Cowan, J. D. (1979). A mathematical theory of visual hallucination patterns. Biological Cybernetics, 34(3), 137–150. Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Computation, 7, 425–468. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Physical Review E, 50(4), 3171–3191. Goodhill, G. J. (1993). Topography and ocular dominance: A model exploring positive correlations. Biol. Cybern., 69, 109–118. Goodhill, G. J., Finch, S., & Sejnowski, T. J. (1996). Optimizing cortical mappings. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems (Vol. 8, pp. 330–336). Cambridge, MA: MIT Press. H¨aussler, A. F., & von der Malsburg, C. (1983). Development of retinotopic projections—An analytical treatment. J. Theor. Neurobiol., 2, 47–73. Horst, R., Pardalos, P. M., & Thoai, N. V. (1995). Introduction to global optimization. Dordrecht: Kluwer. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybern., 43, 59–69.
Neural Map Formation
715
Kohonen, T. (1990). The self-organizing map. Proc. of the IEEE, 78(9), 1464–1480. Konen, W., Maurer, T., & von der Malsburg, C. (1994). A fast dynamic link matching algorithm for invariant pattern recognition. Neural Networks, 7(6/7), 1019– 1030. Konen, W., & von der Malsburg, C. (1993). Learning to generalize from single examples in the dynamic link architecture. Neural Computation, 5(5), 719–735. Lades, M., Vorbruggen, ¨ J. C., Buhmann, J., Lange, J., von der Malsburg, C., Wurtz, ¨ R. P., & Konen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Transactions on Computers, 42(3), 300–311. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation columns. Ntl. Acad. Sci. USA, 83, 8779–8783. MacKay, D. J. C., & Miller, K. D. (1990). Analysis of Linsker’s simulations of Hebbian rules. Neural Computation, 2, 173–187. Meister, M., Wong, R. O. L., Baylor, D. A., & Shatz, C. J. (1991). Synchronous bursts of action potentials in ganglion cells of the developing mammalian retina. Science, 252, 939–943. Miller, K. D. (1990). Derivation of linear Hebbian equations from nonlinear Hebbian model of synaptic plasticity. Neural Computation, 2, 321–333. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6, 100–126. Nowlan, S. J. (1990). Maximum likelihood competitive learning. In D. S. Touretzky (Ed.), Advances in neural information processing systems (Vol. 2, pp. 574– 582). San Mateo, CA: Morgan Kaufmann. Obermayer, K., Ritter, H., & Schulten, K. (1990). Large-scale simulations of selforganizing neural networks on parallel computers: Application to biological modelling. Parallel Computing, 14, 381–404. Ritter, H., Martinetz, T., & Schulten, K. (1991). Neuronale Netze. Reading, MA: Addison-Wesley. Sejnowski, T. J. (1976). On the stochastic dynamics of neuronal interaction. Biol. Cybern., 22, 203–211. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biology, 4, 303–321. Swindale, N. V. (1980). A model for the formation of ocular domance stripes. Proc. R. Soc. Lond. B, 208, 243–264. Swindale, N. V. (1996). The development of topography in the visual cortex: A review of models. Network: Comput. in Neural Syst., 7(2), 161–247. Tanaka, S. (1990). Theory of self-organization of cortical maps: Mathematical framework. Neural Networks, 3, 625–640. Tanaka, S. (1991). Theory of ocular dominance column formation. Biol. Cybern., 64, 263–272. Triesch, J. (1995). Metrik im visuellen System (Internal Rep. No. IR-INI 95-05). Bochum: Institut fur ¨ Neuroinformatik, Ruhr-Universit¨at Bochum. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85–100. von der Malsburg, C. (1995). Network self-organization in the ontogenesis of
716
Laurenz Wiskott and Terrence Sejnowski
the mammalian visual system. In S. F. Zornetzer, J. Davis, and C. Lau (Eds.), An introduction to neural and electronic networks (pp. 447–463). San Diego: Academic Press. von der Malsburg, C., & Willshaw, D. J. (1977). How to label nerve cells so that they can interconnect in an ordered fashion. Proc. Natl. Acad. Sci. (USA), 74, 5176–5178. von der Malsburg, C., & Willshaw, D. J. (1981). Differential equations for the development of topological nerve fibre projections. SIAM-AMS Proceedings, 13, 39–47. Whitelaw, D. J., & Cowan, J. D. (1981). Specificity and plasticity of retinotectal connections: A computational model. J. Neuroscience, 1(12), 1369–1387. Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organization. Proc. R. Soc. London, B194, 431–445. Wiskott, L., & von der Malsburg, C. (1996). Face recognition by dynamic link matching. In J. Sirosh, R. Miikkulainen, & Y. Choe (Eds.), Lateral interactions in the cortex: structure and function (Chap. 11) [Electronic book]. Austin, TX: UTCS Neural Networks Research Group. Available from http://www.cs.utexas.edu/users/nn/web-pubs/htmlbook96/. Received April 25, 1997; accepted September 3, 1997.
LETTER
Communicated by Helge Ritter
Breaking Rotational Symmetry in a Self-Organizing Map Model for Orientation Map Development M. Riesenhuber Department of Brain and Cognitive Sciences and Center for Biological and Computational Learning, Massachusetts Institute of Technology, E25-221, Cambridge, MA 02139, U.S.A.
H.-U. Bauer D. Brockmann T. Geisel ¨ Str¨omungsforschung, Postfach 28 53, 37018 G¨ottingen, Max-Planck-Institut fur Federal Republic of Germany
We analyze the pattern formation behavior of a high-dimensional selforganizing map (SOM) model for the competitive projection of ONcenter-type and OFF-center-type inputs to a common map layer. We mathematically show, and numerically confirm, that even isotropic stimuli can drive the development of oriented receptive fields and an orientation map in this model. This result provides an important missing link in the spectrum of pattern formation behaviors observed in SOM models. Extending the model by including further layers for binocular inputs, we also investigate the combined development of orientation and ocular dominance maps. A parameter region for combined patterns exists; corresponding maps show a preference for perpendicular intersection angles between iso-orientation lines and ocularity domain boundaries, consistent with experimental observations. 1 Introduction Topographic maps are a ubiquitous pattern of organization in the brain. Among the most intensely investigated such patterns are orientation maps and ocular dominance maps in the visual cortex. Various models have been formulated that generate topographic maps as a consequence of activitydriven self-organization processes (for two recent reviews, see Erwin, Obermayer, & Schulten, 1995; Swindale, 1996). The simulated maps coincide with observed maps in many aspects. Yet distinctive differences remain between simulated and observed maps, as well as between simulated maps from different modeling frameworks. The better we can relate such differences to specific underlying assumptions, the more differences we can eliminate, and the more experimental observations we can account for within uniNeural Computation 10, 717–730 (1998)
c 1998 Massachusetts Institute of Technology °
718
M. Riesenhuber, H.-U. Bauer, D. Brockmann, and T. Geisel
versal modeling frameworks, the more stable our understanding of map self-organization processes will become. A few years ago, two map formation models were presented that generate oriented receptive fields from a competition of ON-center and OFF-center cell responses in the lateral geniculate nucleus (LGN) (Miller, 1992, 1994; Miyashita & Tanaka, 1992). These models elegantly explain how orientationselective simple cell responses could be due to a self-organization process driven by nonpatterned input activity and could take place even before birth. Without discussing any of the details of these models, we note that their inputs are nonoriented, yet the resulting patterns break this symmetry. A third, rather widely applied framework for map development is Kohonen’s self-organizing map (SOM) algorithm. SOM-based models have sucessfully accounted for various aspects of visual (Obermayer, Ritter, & Schulten, 1990; Obermayer, Blasdel, & Schulten, 1992; Goodhill, 1993; Wolf, Bauer, & Geisel, 1994; Wolf, Bauer, Pawelzik, & Geisel, 1996; Bauer, Brockmann, & Geisel, 1997), auditory (Martinetz, Ritter, & Schulten, 1988) and somatosensory (Ritter & Schulten, 1986; Andres, Schluter, ¨ Spengler, & Dinse, 1994) map formation. Yet in simulations of SOM-based models for ONcenter- and OFF-center-cell competition, a break of rotational symmetry has not been observed so far, despite a lengthy search by several groups. This negative outcome could be the consequence of a suboptimal selection of parameters, or it could be the fingerprint of a fundamental difference between SOM-based models and the models by Miyashita and Tanaka, and by Miller. Clarification of this issue is an interesting problem, not only with regard to explanations of orientation map development but, in particular, with regard to theoretical consistency between modeling frameworks. In this article, we report that SOMs can break rotational symmetry, albeit in a quite small regime of parameters. Using a recently described analysis technique (Riesenhuber, Bauer, & Geisel, 1996; Bauer et al., 1997), we first mathematically analyze the pattern formation behavior of the corresponding SOM model. After a brief introduction to the SOM in the second section, we describe analytical results in the third section. Guided by the mathematical analysis, we also performed simulations of the model; the results are given in the fourth section. Finally, we investigate the development of combined orientation and ocular dominance maps. 2 “High-Dimensional” SOM Model for the Development of Orientation Maps from Nonoriented Stimuli Neurons in an SOM are characterized by positions r in a map lattice A and receptive fields wr in a map input space V . The input space is assumed to consist of one (or several) layer(s) of input channels. The typically large number of input channels give rise to the notion of a high-dimensional SOM, as opposed to a feature map description. As a consequence, stimuli v ∈ V are activity distributions, and receptive fields wr are synaptic weight distri-
Orientation Map Development
719
butions. A stimulus v is mapped onto that neuron s ∈ A, whose receptive field ws matches v best, s = arg max{wr · v}.
(2.1)
r∈A
Presenting a random sequence of stimuli and performing adaptation steps, 1wr = ²h(r − s)(v − wr ),
(2.2)
the internal shape of individual receptive fields as well as the map layout self-organize simultaneously. The neighborhood function h(r − s), µ
kr − sk2 h(r − s) = exp − 2σ 2
¶ ,
(2.3)
ensures that neighboring neurons align their receptive fields; that is, it imposes topography on the map. A comprehensive treatment of many theoretical and application-related aspects of SOMs can be found in Ritter, Martinetz, and Schulten (1992) and Kohonen (1995). Within this general framework, we now consider a projection geometry analogous to that proposed by Miller (1992, 1994) and Miyashita and Tanaka (1992). Cells in ON-center and OFF-center input layers project to the map layer. As would result from a filtering of pointlike retinal stimuli by thalamic ON-center and OFF-center cells, we assume our stimuli to consist of an activity peak in one layer, plus an activity annulus in the other layer. Mathematically, stimuli are represented as difference-of-gaussians (DOG) (stimulus center position: x0 , widths of the two gaussians: σ1,2 , relative amplitude of the gaussians: k), Ã
−kx − x0 k2 a(x; x0 ) = exp 2σ12
!
Ã
−kx − x0 k2 − k exp 2σ22
! .
(2.4)
Furthermore, a• (x; x0 ) = [a(x; x0 )]+ denotes the activity distribution of the central peak of the DOG, and a◦ (x; x0 ) = [−a(x; x0 )]+ the annulus-shaped activity distribution corresponding to the negative part of the DOG ([·]+ is the rectification operator). Naturally, ON-type and OFF-type stimuli are represented as two-component vectors, vON = (a• , a◦ ),
vOFF = (a◦ , a• ),
(2.5)
each component describing the (nonnegative) activity distribution in one of the input layers. In the simulations, the center positions x0 and polarity (that is, whether the stimulus is vON or vOFF ) are chosen at random.
720
M. Riesenhuber, H.-U. Bauer, D. Brockmann, and T. Geisel
3 Mathematical Results on the Formation of Oriented Receptive Fields Recently, Riesenhuber et al. (1996) and Bauer et al. (1997) described a new technique to calculate conditions on stimulus and map parameters for the emergence of nontrivial patterns in high-dimensional SOMs. This technique makes use of a distortion measure Ev , which is evaluated for different potentially stable states of the map and is assumed to be minimized by the SOM algorithm. A crucial feature of the method is the way in which “potentially stable states” of a SOM are formalized. Although an explicit characterization of such states in terms of the weight vectors wr seems impossible without actually simulating the SOM, the states can also be characterized by the way they distribute stimuli among map neurons (the tesselation). This is specific to the SOM, where the winner-take-all mapping rule (see equation 2.1) assigns a particular map neuron to each stimulus. Denoting by Är all stimuli that are mapped to neuron r (the Voronoi cell of r), we define a distortion measure, X h(r − r0 ) w(r, r0 ), (3.1) Ev = r,r0
0
w(r, r ) =
X X v∈Är
(v − v0 )2 .
(3.2)
v0 ∈Är0
Each term in Ev consists of the mean squared difference w(r, r0 ) between stimuli within the same, or between neighboring, Voronoi cells, weighted by the neighborhood function h(r − r0 ). The definition of Ev is motivated by the analogy of SOMs to vector quantizers (see Riesenhuber et al., 1996). Qualitatively different states of a map yield different values of Ev because they correspond to different tesselations {Är }. What are the typical tesselations in the present context? To simplify the analysis, we assume an equal number of ON-center input channels, OFF-center input channels, and map neurons, with a basically retinotopic projection. We further assume that the stimulus center positions x0 are constrained to the positions of the input channels, resulting in twice as many stimuli as map neurons. Three qualitatively different possible map states can be distinguished: 1. System B : Each neuron responds to both an ON- and an OFF-stimulus, each located at the same retinal position. This tesselation yields neurons with orientation-insensitive receptive fields. 2. System S : As in system B , each neuron responds to stimuli of both polarities, but now displaced one step along one retinal coordinate. The displacement breaks isotropy. It causes the receptive fields to exhibit internal ON-center and OFF-center structure, with orientation specificity.
Orientation Map Development
721
3. System O: Each neuron responds to two retinally neighboring stimuli of identical polarity. Although this tesselation induces an orientation specificity, it also breaks the symmetry between ON-center and OFF-center inputs to each neuron. Neurons segregate into ON-centerand OFF-center-dominated populations, analogous to an ocular dominance map. While this state is an imaginable (and numerically observed, see below) state of SOMs, we do not consider this state as biologically interesting. To evaluate w(r, r0 ) and Ev for these tesselations, we need to consider the difference δ s between two stimuli of same polarity, separated by a distance of 1x0 = x0 − x00 , δ s (1x0 ) =
X¡
¢2
a• (x, x0 ) − a• (x, x00 )
¡ ¢2 + a◦ (x, x0 ) − a◦ (x, x00 ) ,
(3.3)
x
and the difference δ a between two oppositely polarized stimuli, δ a (1x0 ) =
X¡
¢2
a• (x, x0 ) − a◦ (x, x00 )
¡ ¢2 + a◦ (x, x0 ) − a• (x, x00 ) .
(3.4)
x
Using these distances, and exploiting the isotropy with respect to one retinal coordinate, we obtain for the interaction terms w (see equation 3.2): wB (1r) = δ s (1r) + δ a (1r), 1 wS (1r) = δ s (1r) + {δ a (1r − 1) + δ a (1r + 1)}, 2 X1 1 O i δ (1r) + {δ i (1r + 1) + δ i (1r − 1)}. w (1r) = 4 i=s,a 2
(3.5) (3.6) (3.7)
Inserting equations 3.5–3.7 into equation 3.1 and performing a numerical S O summation, we obtain the distortion measures EB v , Ev , and Ev . This analysis predicts that, depending on the stimulus parameters σ1,2 , k, and the map neighborhood parameter σ , different final states of the map will be attained. Figure 1a shows a state diagram in the σ, k-plane, at widths σ1 = 1, σ2 = 2. At large values of the annulus amplitude k, receptive fields segregate into “monopolar” ON- and OFF-receptive fields. Large values of the neighborhood width σ prohibit internal structure of the receptive fields to occur. Only in a rather small regime of σ, k-values, the biologically interesting map state S is attained. 4 Numerical Results To corroborate the mathematical analysis above and to obtain orientation maps, we also investigated the model numerically. In a first series we ran
722
M. Riesenhuber, H.-U. Bauer, D. Brockmann, and T. Geisel
Figure 1: Analytical (a) and numerical (b) phase diagrams for the SOM orientation map model. The parameters σ and k denote the neighborhood width of the SOM algorithm and the annulus amplitude of the stimuli, respectively. +: the nonoriented state B. ?: the oriented state S. 4: the (nonbiological) state O. In both diagrams, further parameters were σ1 = 1, σ2 = 2, in the numerical maps we applied 105 learning steps, learning step size was decreased from ²init = 0.2 to ²final = 0.01, and the neighborhood width σ was kept constant during the simulation.
simulations with 16 × 16 neuron maps, at various values of σ and k. Classifying the resulting receptive fields with regard to the states B , S , O, we obtained the state diagram depicted in Figure 1b, which corresponds quite well to the mathematically obtained diagram. To obtain receptive fields with a fine spatial resolution, we also simulated maps with 48 × 48-channel input layers, projecting to a 24 × 24-neuron map layer. Figure 2 shows exemplary receptive fields of neurons in a 4 × 8 segment of the map. The receptive fields show a multilobed structure and are clearly oriented. The variation of orientation over the whole map is shown in an angle map in Figure 3, where the preferred orientation of each cell is given by a circular color code. As in orientation maps obtained by optical imaging methods in the cat (Bonhoeffer & Grinvald, 1991) or monkey (Blasdel & Salama, 1986), we find a patchy arrangement of different preferred orientations and also pinwheel-like singularities. In addition to the map of preferred orientation angles of the receptive fields, we also calculated the phases of the receptive field, that is, the shift angles that would occur in a Gabor function fit to the receptive field profile (see the caption for Figure 3). It has been hypothesized that the phase angle is also arranged in a topographic fashion in the primary visual cortex, with a topology of the combined orientation and phase stimulus space equivalent to that of a Klein bottle (Tanaka, 1995). In our simulated maps, we find the phase angle to vary indeed in a smooth way in many areas of the map (see the arrows in Figure 3). Phase and orientation values are not correlated.
Orientation Map Development
723
Figure 2: Sample receptive fields of an SOM orientation map (4 × 8 segment of a 24 × 24-neuron SOM, with two 48 × 48 input layers, periodic boundary conditions). Further parameters of the simulation were: σ1 = 3.4, σ2 = 6.8, k = 0.3, width of SOM neighborhood function σ = 0.85, 3 × 105 learning steps, ² = 0.1 → 0.01. For each neuron, the difference between ON-center and OFFcenter cell layer connection strengths is shown as a gray-value image. The gray background means no connection strength; black and white regions indicate preferred connections to the ON- or OFF-center layers.
2π -singularities can be found in the phase, at locations other than those of orientation singularities. All the topological properties we could identify in our map are consistent with Tanaka’s suggestion of the Klein bottle topology. 5 Development of Combined Orientation and Ocular Dominance Maps Finally, we complemented the two ON-center and OFF-center input cell layers for one eye by two further ON-center and OFF-center cell layers for the other eye. The repertoire of possible patterns in this extended model should go beyond merely oriented receptive fields in an orientation map. It should also include monocular receptive fields and ocular dominance maps, and combinations of the two types of patterns. Stimuli in the extended model consist of activity distributions in all four input layers. Although the difference in the shape of the activity distributions between ON-center and OFF-center layers is the same as before, the partial stimuli are assumed to be of identical shape in the corresponding
724
M. Riesenhuber, H.-U. Bauer, D. Brockmann, and T. Geisel
o o
+
+
Figure 3: The complete map described in the caption of Figure 2, now depicted as an angle map using a circular color code for the preferred orientation angles of each cell. Superimposed on the color-coded orientation map, we show (as arrows) the phase angle of the receptive fields. The phase angle is calculated by rotating each receptive field by the inverse of its preferred orientation (so that the lobes of an ideal bilobed receptive would fall on different sides of the horizontal meridian after rotation) and then determining the phase shift as compared to a dampened sine wave (i.e., a Gabor filter) of the receptive field profile. A rightward arrow designates a phase of zero degrees, that is, the receptive field is bilobed with the positive lobe being in the upper hemisphere after rotation. An upward arrow represents a phase angle of 90 degrees, that is, a trilobed receptive field with the negative lobe in the middle. Crosses and circles show examplary locations of orientation and phase singularities, respectively.
layers for either eye, but attenuated by a factor of c, 0 ≤ c ≤ 1, in one of the eyes (analogous to the assumptions underlying a recently analyzed SOM-based model for ocular dominance formation; Bauer et al., 1997; see also Goodhill, 1993). This yields, apart from the random variations of the
Orientation Map Development
725
stimulus center, four different types of stimuli:
vL,ON
vR,ON
a• a◦ = c · a• , c · a◦ c · a• c · a◦ = a• , a◦
vL,OFF
vR,OFF
a◦ a• = c · a◦ , c · a• c · a◦ c · a• = a◦ . a•
(5.1)
The analysis technique introduced in section 3 can be applied to this more complicated case as well, considering the different tesselation possibilities for four stimuli per neuron. To save space, we omit the details of the rather lengthy calculations and proceed to a description of the results. Since we have five parameters in the model now (σ1 , σ2 , k for the DOG, c for the between-eye correlations, and σ as a map control parameter), the full state diagram cannot be depicted. Instead, we show in Figure 4a a section in the k-c-plane. Regions with orientation only and with ocular dominance only are found. Most important, there is a region with a combination of both orientation and ocular dominance at small values of k and c. In computer simulations, we found maps with monocular receptive fields, oriented receptive fields, or combined monocular, oriented receptive fields, each in the parameter regimes predicted by the analysis (see Figure 4b). Figure 5 shows one combined map in a plot that displays the boundaries of the iso-ocularity domains superimposed on the color-coded orientation map. Determining the transition lines between iso-ocularity regions in the simulated map, and computing the intersection angles with the iso-orientation lines at these locations, we compiled an angle histogram (see Figure 6). Iso-orientation lines intersect the boundaries between iso-ocularity regions preferably at larger angles, consistent with experimental observations by Bartfeld and Grinvald (1992) and Obermayer and Blasdel (1993). 6 Discussion We showed mathematically and numerically how in a high-dimensional SOM model for the competitive projection of ON-center and OFF-center inputs to a common map layer, a rotation symmetry of stimuli can be broken to yield oriented receptive fields. This pattern formation behavior can be described only in a high-dimensional map formation framework, which also allows consideration of the internal structure of receptive fields. In low-
726
M. Riesenhuber, H.-U. Bauer, D. Brockmann, and T. Geisel
a
b
1.0
0.8
1.0
0.8
0.6
0.6
c
c 0.4
0.4
0.2
0.2
0.0 0.0
0.2
0.4
0.6
k
0.8
1.0
0.0 0.0
0.2
0.4
0.6
0.8
1.0
k
Figure 4: Phase diagram for combined orientation and ocular dominance maps, as a function of ON-OFF stimulus parameter k and between-eye-correlation parameter c, at σ1 = 0.84, σ2 = 1.68, σ = 1.0. ◦: states with unoriented receptive fields. +: oriented, binocular receptive fields. ?: monocular receptive fields with type O ON-center OFF-center cell segregation. ¤: monocular, oriented receptive fields. (a) The analytically obtained diagram. (b) The diagram resulting from classification of the receptive fields of simulated maps.
dimensional feature map models, where each map dimension corresponds to a particular stimulus and receptive field parameter, a nontrivial structure along a particular dimension cannot be obtained if the stimuli have no extension along this dimension (as is the case for round stimuli with respect to an orientation dimension). A break of isotropy has already been observed in other frameworks for map development models (Miyashita & Tanaka, 1992; Miller, 1992, 1994). The results we describe here for the SOM framework close a somewhat puzzling gap in the qualitative behavior of these different frameworks, reducing the relative importance of the specific mathematical formalizations, and increasing the importance of common mechanisms. Our results are based not only on numerical simulations but also on a mathematical analysis. The coincidence of the mathematically derived parameter regimes for particular map structures and the numerical observation of these structures underlines the value of our energy formalism to analyze pattern formation in high-dimensional SOM models and to guide simulations of these models. For the case of combined ocular dominance and orientation maps, the analysis turned out to involve a substantially larger number of map patterns, which need to be considered. The increase in effort necessary for the two-variable case suggests that this kind of analysis is not feasible for maps with three underlying symmetries. For the com-
Orientation Map Development
727
Figure 5: Combined ocular dominance and orientation map, with stimulus parameters as described in the caption of Figure 4. The orientation angle map is given in color code (as in Figure 3); the iso-ocularity domain boundaries are superimposed as black lines. Further parameters are: 24 × 24-neuron SOM, 34 × 34-channel input layers, σ = 0.71, 3 × 105 learning steps, ² = 0.1 → 0.01, periodic boundary conditions.
bined maps, we identified a parameter regime in which ocular dominance and orientation structure stably coexist. In the correlation-based framework, combined maps were numerically found to exist (Erwin & Miller, 1995), but the mathematical underpinnings were judged controversial (Piepenbrock, Ritter, & Obermayer, 1996, 1997).
728
M. Riesenhuber, H.-U. Bauer, D. Brockmann, and T. Geisel
40
35
30
counts
25
20
15
10
5
0
0
10
20
30
40
50
60
70
80
90
angle of intersection
Figure 6: Histogram of angle of intersection of iso-orientation lines and isoocularity domain boundaries, computed for all cells along iso-ocularity domain boundaries of the map shown in Figure 5.
Acknowledgments We gratefully acknowledge interesting discussions with Ken Miller and Fred Wolf. This work has been supported by the Deutsche Forschungsgemeinschaft through Sonderforschungsbereich 185 Nichtlineare Dynamik, TP E6.
References Andres, M., Schluter, ¨ O., Spengler, F., & Dinse, H. R. (1994). A model of fast and reversible representational plasticity using Kohonen mapping. In M. Marinao & P. G. Morasso (Eds.), Proceedings of the ICANN94, Sorrento (pp. 306–309). Berlin: Springer-Verlag. Bartfeld, E., & Grinvald A. (1992). Relationship between orientation preference pinwheels, cytochrome oxidase blobs and ocular-dominance columns in primate striate cortex. Proc. Nat. Acad. Sci. USA, 89, 11905–11909. Bauer, H.-U., Brockmann, D., & Geisel, T. (1997). Analysis of ocular dominance pattern formation in a high-dimensional self-organizing-map-model. Network, 8, 17–33.
Orientation Map Development
729
Blasdel, G. G., & Salama, G. (1986). Voltage-sensitive dyes reveal a modular organization in monkey striate cortex. Nature, 321, 579–585. Bonhoeffer, T., & Grinvald, A. (1991). Iso-orientation domains in cat visual cortex are arranged in pinwheel-like patterns. Nature, 353, 429–430. Erwin, E., & Miller, K. D. (1995). Modeling joint development of ocular dominance and orientation maps in primary visual cortex. In J. Bower (Ed.), Computational neuroscience (pp. 179–184). Boston: Kluwer. Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neur. Comp., 7, 425–468. Goodhill, G. J. (1993). Topography and ocular dominance: A model exploring positive correlations. Biol. Cyb., 69 109–118. Kohonen, T., (1995). The self-organizing map. Berlin: Springer-Verlag. Martinetz, T., Ritter, H., & Schulten, K. (1988). Kohonen’s self-organizing map for modeling the formation of the auditory cortex of a bat. In SGAICOProceedings “Connectionism in Perspective,” 403–412. Miller, K. D. (1992). Development of orientation columns via competition between ON- and OFF-center inputs. NeuroRep., 3, 73–79. Miller, K. D. (1994). A model for the development of simple-cell receptive fields and the ordered arrangement of orientation columns through activity dependent competition between On- and Off-center inputs. J. Neurosci., 14, 409–441. Miyashita, M., & Tanaka, S. (1992). A mathematical model for the selforganization of orientation columns in visual cortex. NeuroRep., 3, 69–72. Obermayer, K., & Blasdel, G. G. (1993). Geometry of orientation and ocular dominance columns in monkey striate cortex. J. Neurosci., 13, 4114–4129. Obermayer, K., Blasdel, G. G., & Schulten, K. (1992). Statistical-mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev. A, 45, 7568–7569. Obermayer K., Ritter, H., & Schulten, K. (1990). A principle for the formation of the spatial structure of cortical feature maps. Proc. Nat. Acad. Sci. USA, 87, 8345–8349. Piepenbrock, C., Ritter, H., & Obermayer, K. (1996). Linear correlation-based learning models require a two-stage process for the development of orientation and ocular dominance. Neur. Proc. Lett., 3, 1–7. Piepenbrock, C., Ritter, H., & Obermayer, K. (1997). The joint development of orientation and ocular dominance: Role of constraints. Neur. Comp., 9, 959– 970. Riesenhuber, M., Bauer, H.-U., & Geisel, T. (1996). Analyzing phase transitions in high-dimensional self-organizing maps. Biol. Cyb., 75, 397–407. Ritter, H., Martinetz, T., & Schulten, K., (1992). Neural computation and selforganizing maps. Reading, MA: Addison-Wesley. Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen’s selforganizing sensory mapping. Biol. Cyb., 54, 99–106. Swindale, N. V. (1996). The development of topography in the visual cortex: A review of models. Network, 7, 161–247. Tanaka, S. (1995). Topological analysis of point singularities in stimulus preference maps of the primate visual cortex. Proc. R. Soc. London B, 261, 81–88.
730
M. Riesenhuber, H.-U. Bauer, D. Brockmann, and T. Geisel
Wolf, F., Bauer, H.-U., & Geisel, T. (1994). Formation of field discontinuities and islands in visual cortical maps. Biol. Cyb., 70, 525–531. Wolf, F., Bauer, H.-U., Pawelzik, K., & Geisel, T. (1996). Organization of the visual cortex. Nature, 382, 306–307. Received May 6, 1997; accepted September 18, 1997.
LETTER
Communicated by Steven Nowlan
Nonlinear Time-Series Prediction with Missing and Noisy Data Volker Tresp Reimar Hofmann Siemens AG, Corporate Technology, Department of Information and Communications, 81730 Munich, Germany
We derive solutions for the problem of missing and noisy data in nonlinear time-series prediction from a probabilistic point of view. We discuss different approximations to the solutions—in particular, approximations that require either stochastic simulation or the substitution of a single estimate for the missing data. We show experimentally that commonly used heuristics can lead to suboptimal solutions. We show how error bars for the predictions can be derived and how our results can be applied to Kstep prediction. We verify our solutions using two chaotic time series and the sunspot data set. In particular, we show that for K-step prediction, stochastic simulation is superior to simply iterating the predictor. 1 Introduction Neural networks have been applied successfully in numerous applications to nonlinear time-series prediction (Weigend & Gershenfeld, 1994). Common problems in time-series prediction are missing and noisy data. The goal is to obtain optimal predictions even if some measurements are unavailable, are not recorded, or are uncertain. For linear systems, efficient algorithms exist for prediction with missing data (Kalman, 1960; Shumway & Stoffer, 1982). In particular, the Kalman filter is based on a state-space formulation and achieves optimal predictions with arbitrary patterns of missing data. For nonlinear systems, the extended Kalman filter, based on a first-order series expansion of the nonlinearities, can be employed. The extended Kalman filter is suboptimal (Bar-Shalom & Li, 1993) and summarizes past data by an estimate of the means and the covariances of the variables involved. The extended Kalman filter fails to give good predictions if the system is not approximated well by a localized linearization, that is, for highly nonlinear systems, in particular if the inaccuracies in the approximations propagate through several iterations, as in K-step prediction. In this article, we propose stochastic sampling, which converges to the optimal solution as the number of samples approaches infinity and can handle arbitrary patterns of noisy and missing data. We demonstrate the benefits of stochastic sampling using three examples. Neural Computation 10, 731–747 (1998)
c 1998 Massachusetts Institute of Technology °
732
Volker Tresp and Reimar Hofmann
The related issue of training a time-series model with missing and noisy data will be addressed in a companion article (Tresp & Hofmann, 1997). In section 2 we derive equations for prediction with missing data. As in the case of regression and classification with missing data (Little & Rubin, 1987; Ahmad & Tresp, 1993; Buntine & Weigend, 1991), the solution consists of integrals over the unknown variables weighted by the conditional probability density of the unknown variables given the known variables. In time-series prediction, we can use the fact that the unknown data themselves are part of the time series. By unfolding the time series in time, we obtain a Bayesian network (Pearl, 1988; Jensen, 1996) (a probabilistic graph with directed arcs) that allows us to clarify dependencies between the variable to be predicted and the measurements that provide information about that variable. In section 3 we generalize the results toward noisy measurements. For nonlinear systems, the integrals cannot be solved in closed form and have to be approximated numerically. In section 4 we propose stochastic sampling, which has the advantage that asymptotically (i.e., with the number of samples approaching infinity) we obtain the optimal prediction. As an alternative approximation, we propose that maximum likelihood estimates can be substituted for the missing data. Furthermore, we discuss solutions based on an iterative approximation of the information provided by past data using probability density estimates. In section 5 we present experimental results demonstrating the superiority of the stochastic sampling approach. In particular, we show that for K-step prediction, stochastic sampling is superior to both simply iterating the system and the extended Kalman filter (the latter two turn out to be identical for K-step prediction). In section 6 we present conclusions. 2 Prediction with Missing Data 2.1 An Illustrative Example. Consider the situation depicted in Figure 1 (top). The time-series model is yt = f (yt−1 , yt−2 ) + ²t , where ²t is additive independent, identically distributed (i.i.d.) noise and f ( ) is a nonlinear function. The goal is to predict yt based on past measurements. Let us assume that yt−2 is missing. A common procedure is to obtain an estimate yˆ t−2 of the missing value and then substitute that estimate in the predictive model, yˆ t = f (yt−1 , yˆ t−2 ). In some applications, it might make sense to substitute for the missing value the previous value yˆ t−2 = yt−3 or to substitute the predicted value
Nonlinear Time-Series Prediction
733
Figure 1: (Top) yt−2 is missing, and the goal is to predict yt . The estimate yˆ t is dependent on the substituted value for yt−2 . (Bottom) A time series unfolded in time. Open squares indicate unknown variables, and filled squares indicate measured variables. The arrows indicate that the next realization of the time series can be predicted from only the two most recent values, yt = f (yt−1 , yt−2 ) + ²t . Here, yt−2 is assumed to be missing. The bracket indicates the nodes in the Markov boundary of yt−2 (see section 4.1).
yˆ t−2 = f (yt−3 , yt−4 ). Both heuristics might often work in practice, but note the following two points: 1. Since in our example yt−1 is known, it should improve our estimate of yt−2 . 2. Since yt−2 is only estimated, it should be possible to achieve better predictions by substituting not just one estimate but several estimates and then averaging the predictions based on those estimates. In the following sections, we show that a theoretical analysis confirms these intuitions.
734
Volker Tresp and Reimar Hofmann
2.2 Theory. Let yt be the value of the discrete time series at time t. We assume that the underlying probabilistic model of the time series is of order N and can be described by yt = f (yt−1 , yt−2 , . . . , yt−N ) + ²t ,
(2.1)
where f ( ) is either known or approximated sufficiently well by a function approximator such as a neural network. ²t is assumed to be additive i.i.d. zero-mean noise with probability density P² (²) and typically represents unmodeled dynamics. The conditional probability density of the predicted value of the time series is then P(yt |yt−1 , yt−2 , . . . , yt−N ) = P² (yt − f (yt−1 , yt−2 , . . . , yt−N )).
(2.2)
Often gaussian noise is assumed such that P(yt |yt−1 , yt−2 , . . . , yt−N ) = G(yt ; f (yt−1 , . . . , yt−N ), σ 2 ),
(2.3)
where G(x; c, σ 2 ) is our notation for a normal density evaluated at x with center c and variance σ 2 . It is convenient to unfold the system in time which leads to the system shown in Figure 1 (bottom). The realizations of the time series can now be considered random variables or nodes in a Bayesian network, in which directed arcs indicate direct dependencies (Pearl, 1988). The joint probability density in a Bayesian network is the product of all conditional densities and the prior probabilities P(y1 , y2 , . . . , yt ) = P(y1 , . . . , yN )
t Y
P(yl |yl−1 , . . . , yl−N ),
(2.4)
l=N+1
where P(y1 , . . . , yN ) is the prior probability of the first N values of the time series. We use the following notation: Ytu2 ,t1 ⊆ {yt1 , yt1 +1 , . . . , yt2 } is the set of missing variables from t1 to t2 , Ytm2 ,t1 ⊆ {yt1 , yt1 +1 , . . . , yt2 } is the set of measurements between t1 and t2 , and Yt2 ,t1 = Ytm2 ,t1 ∪ Ytu2 ,t1 (t1 ≤ t2 ). The theory of Bayesian networks is helpful to decide which past measurements provide information about yt . Let A and B be nodes in a directed acyclic graph D (in our case, a Bayesian network). A and B are independent given the evidence entered into the network if they are d-separate. The definition of d-separation (Pearl, 1988; Jensen, 1996) follows. Definition (d-separation). Two variables A and B in a directed acyclic graph are d-separated if for all paths between A and B there is an intermediate variable V such that either (1) the connection is serial or diverging
Nonlinear Time-Series Prediction
735
and the state of V is known or (2) the connection is converging and neither V nor any of V’s descendants has received evidence.1 In other words, A and B are d-separated if every path between both nodes is blocked by condition 1 or 2. An example of a serial connection is → V →, of a diverging connection is ← V →, and of a converging connection is → V ←. We now apply the concept of d-separation to timeseries prediction. Let yt−L be the most recent case, where N consecutive measurements are known; that is, yt−L , yt−L−1 , . . . , yt−L−N+1 are all known. In this case, yt is d-separate from measurements previous to t − L − N + 1 given yt−L , yt−L−1 , . . . , yt−L−N+1 . Consider Figure 1 (bottom). Here, yt−5 is d-separated from yt by yt−3 and yt−4 since these nodes block all paths from yt−5 to yt . The same d-separation is true for all measurements previous to yt−5 . yt−4 , on the other hand is not blocked by yt−3 and yt−1 since the path yt−4 → yt−2 → yt is not blocked. Following the discussion in the previous paragraph, yt is independent of measurements earlier than yt−L−N+1 given yt−L , yt−L−1 , . . . , yt−L−N+1 . This m , and means that we have to condition yt only on measurements Yt−1,t−L−N+1 we obtain for the expected value of the next realization of the time series, Z m )= E(yt |Yt−1,1
=
Z
m yt P(yt |Yt−1,t−L−N+1 )dyt
f (yt−1 , . . . , yt−k , . . . , yt−N )
u m u |Yt−1,t−L−N+1 ) dYt−1,t−N × P(Yt−1,t−N Z = f (yt−1 , . . . , yt−k , . . . , yt−N ) u m u |Yt−1,t−L−N+1 ) dYt−1,t−L+1 , × P(Yt−1,t−L+1
(2.5)
where (assuming t − L ≥ N) u m |Yt−1,t−L−N+1 )= P(Yt−1,t−L+1
t−1 Y 1 × P(yl |yl−1 , . . . , yl−N ), const l=t−L+1
m m |Yt−L,t−L−N+1 ) is a normalization constant indepenand const = P(Yt−1,t−L+1 dent of the unknown variables.
3 Prediction with Noisy Measurements Let again yt = f (yt−1 , yt−2 , . . . , yt−N ) + ²t , but now we assume that we have no access to yt directly. Instead, we measure zt = yt + δt where δt is independent zero-mean noise (see Figure 2) with probability density Pδ (δ). 1
In our case this means that neither V nor any of V’s descendants are known.
736
Volker Tresp and Reimar Hofmann
yt-6
yt-5
yt-4
yt-3
yt-2
yt-1
zt-6
zt-5
zt-4
zt-3
zt-2
zt-1
yt
...
Figure 2: The Bayesian network corresponding to the problem of time-series prediction with noisy measurements (N = 2). Open squares indicate unknown variables, and filled squares indicate measured variables.
Let Zt−1,1 = {z1 . . . zt−1 } and Yt,1 = {y1 . . . yt }. The joint probability density is
P(Yt,1 , Zt−1,1 ) = P(y1 , . . . , yN )
t Y
P(yl |yl−1 , . . . , yl−N )
l=N+1
×
t−1 Y
P(zl |yl ),
(3.1)
l=1
with P(zl |yl ) = Pδ (zl − yl ). The corresponding Bayesian network is shown in Figure 2. Note that for each known variable zt−k , there is a path to yt that is not blocked by any of the other known variables and has no converging arrows; that is, the path zt−k ← yt−k → yt−k+1 → · · · → yt . This means that yt is dependent on all past measurements. The expression for the expected value of the next instance of the time series (prediction) is then Z f (yt−1 , . . . , yt−N ) P(Yt−1,t−N |Zt−1,1 ) dYt−1,t−N E(yt |Zt−1,1 ) = Z = f (yt−1 , . . . , yt−N ) P(Yt−1,1 |Zt−1,1 ) dYt−1,1 , (3.2) where P(Yt−1,1 |Zt−1,1 ) = 1/const × P(Yt−1,1 , Zt−1,1 ) which is obtained from equation 3.1. const = P(Zt−1,1 ) is a normalization constant independent of Yt−1,1 . Note that the case of noisy measurements includes the case of missing data. In particular, if we allow the measurement noise to be time dependent (which does not introduce any additional complexity), we can use σδ2 (t) = 0 for certain measurements and σδ2 (t) = ∞ for unknown data. 4 Approximations to the Theoretical Solutions In general, if f ( ) is a nonlinear function, the equations we obtained for prediction—equations 2.5 and 3.2—cannot be solved analytically and must be approximated numerically. First, we propose an approximation based on
Nonlinear Time-Series Prediction
737
stochastic simulation that provides the optimal prediction when the number of samples approaches infinity. As a second approximation, we discuss an approach where the most likely values are substituted for the missing data. The latter approach tends to be computationally less expensive but provides biased predictions. Finally, we discuss the extended Kalman filter, which can be used on-line and is based on a first-order series expansion of the nonlinearities. 4.1 Stochastic Simulation. We will discuss a solution based onR stochastic simulation. Note that all solutions have the general form h(U, M) P(U|M) dU, where U is a set of unknown variables and M is a set of known variables. An integral of this form can be solved by drawing random samples of the unknown variables following P(U|M). Let U1 , . . . , US denote these samples. Then we can approximate Z h(U, M)P(U|M) dU ≈
S 1X h(Us , M). S s=1
The problem now reduces to sampling from P(U|M). Let us first assume that only one variable is missing. Then the problem reduces to sampling from a one-variate distribution, which can be done using sampling-importanceresampling or other sampling techniques (Bernardo & Smith, 1994). If more than one realization is missing, the situation becomes more complicated. The reason is that the unknown variables are in general dependent, and we have to draw from the joint probability distribution of all unknowns. A general solution to this problem is Markov chain Monte Carlo sampling, with the Metropolis-Hastings algorithm and Gibbs sampling being the two most important representatives. We briefly describe the last. In Gibbs sampling, we initialize the unknown variables either randomly or better with reasonable initial values. Then we select one of the unknown variables ui ∈ U and pick a sample from the one-dimensional conditional density P(ui |MB(i)) and set ui to that value. MB(i) is the Markov boundary of ui .2 Then we repeat the procedure for another unknown variable, and so on. In this way, repeated samples of all unknowns are drawn. Discard the first samples since they strongly depend on which initial values were chosen. Then, for strictly positive distributions, samples are produced with the correct distribution, that is, for s → ∞, Us tends in distribution to a joint 2 We have to condition only on the nodes in the Markov boundary since, by definition of the Markov boundary, under the assumption that all nodes in the Markov boundary are unknown, the node ui is d-separated from the remaining variables in a Bayesian network. The Markov boundary of a node consists of its direct parents, its direct successors, and all direct parents of its direct successors (Pearl, 1988) (as an example, see Figure 1).
738
Volker Tresp and Reimar Hofmann
random vector whose joint density is P(U|M) (Bernardo & Smith, 1994). Gibbs sampling reduces the problem of drawing a sample from the joint density of all unknowns to sequentially drawing samples from the univariate densities of each unknown conditioned on the variables in its Markov boundary. In the case of missing data, we have to generate samples from all missing u . In the case of noisy measurements, we even have to sample data Yt−1,t−L+1 from all Yt−1,1 . In practice, one would restrict the sampling to a reasonably chosen time window in the past. For independent samples, the variance of an estimated mean is equal to σs2 /S, where σs2 is the variance of an individual sample. Unfortunately, samples generated by Gibbs sampling and other Markov chain Monte Carlo sampling techniques are typically highly correlated such that, depending on the particular problem, a large number of samples might be required for a good estimate. This is particularly true if regions of high probability are separated by regions of low probability such that the transition between regions has low probability. Another disadvantage is that for each new prediction, we have to perform a separate sampling process. Neal (1993) discusses hybrid Monte Carlo methods and other advanced sampling techniques that try to overcome some of the difficulties associated with dependent samples. Sampling is simple if only samples of future values are required as in K-step prediction (for details, see section 5.1). The reason is that we can sample forward in time by simply simulating the system. In this procedure, independent samples are generated. The idea of generating multiple samples from the unknown variables and averaging the responses using those samples confirms the intuition formulated in section 2.1 and is known as multiple imputation in statistical approaches to regression and classification with missing data (Little & Rubin, 1987). The samples can also be used to estimate variances and covariances from which error bars can easily be derived. As examples, if {yst }Ss=1 are samples generated from yt , the standard deviation of yt can be estimated as v u u stdev(y ) ≈ t t
S 1 X (ys − yˆ t )2 , S − 1 s=1 t
and the standard deviation of the estimated yˆ t = 1/S mated as v u u stdev(yˆ ) ≈ t t
S X 1 (ys − yˆ t )2 . S(S − 1) s=1 t
PS
s s=1 yt
can be esti-
Nonlinear Time-Series Prediction
739
4.2 Maximum Likelihood Substitution. The approach consists of substituting the most likely values, ml = arg max P(Yt−1,1 ), Yt−1,1 u Yt−1,1
for the missing variables. Then we estimate ml m yˆ t = f (Yt−1,t−N , Yt−1,t−N ).
(4.1)
Considering the case with one missing variable yt−k and assuming gaussian noise, yml t−k = arg min yt−k
t−1 X
(yl − f (yl−1 , yl−2 , . . . , yl−N ))2 ,
(4.2)
l=t−k
we simply find the substitution that minimizes the sum of the squared errors. As another interesting case, consider noisy measurements and gaussian noise distributions, £ ml − log P(YN,1 ) = arg min Yt−1,1 u Yt−1,1
t−1 1 X (yl − f (yl−1 , yl−2 , . . . , yl−N ))2 2σ²2 l=N+1 # t−1 1 X 2 + (yl − zl ) , 2σδ2 l=1
+
where σ²2 and σδ2 are the variances of the two noise sources (see section 3). This is a multidimensional optimization problem. Note that for highly nonlinear systems, equation 4.1 can be a crude estimate of the expected value, and the prediction based on a maximum likelihood estimate of the unknowns can therefore be highly biased. 4.3 Solutions Based on Iterative Density Estimation and the Extended Kalman Filters. We consider the case of prediction with noisy measurements. A solution based on stochastic simulation of equation 3.2 (noisy measurements) means that we have to sample from the space of all unknown variables, y1 , . . . , yt . This becomes intractable for large t. To summarize the information about past measurements more efficiently, we can use that P(Yt−1,t−N |Zt−1,1 ) = R P(zt−1|yt−1 ) P(Yt−2,t−N−1|Zt−2,1 )P(yt−1|Yt−2,t−N−1 )dyt−N−1 R . P(zt−1|yt−1 )P(Yt−2,t−N−1|Zt−2,1 )P(yt−1|Yt−2,t−N−1 )dYt−1,t−N−1
(4.3)
740
Volker Tresp and Reimar Hofmann
This equation can be derived from the Chapman-Kolmogorov equation and by applying Bayes’ rule (Lewis, 1986). The update equation implies that we can summarize all information provided by the past measurements by approximating P(Yt−1,t−N |Zt−1,1 ) and use equation 4.3 to update the estimates on-line as time progresses and more measurements become available. If the system is linear and the noise is normally distributed, equation 4.3 can be solved analytically, and the probability densities can be represented by a multidimensional normal distribution. This is the well-known Kalman filter. In general, the integral in equation 4.3 must be solved numerically, and an appropriate representation for the conditional density has to be found. Neural network techniques for approximating joint and conditional densities exist (Neuneier, Hergert, Finnoff, & Ormoneit, 1994; Bishop, 1994). In Lewis (1986) it is shown that for continuous time systems, the time update leads to the Fokker-Planck equation, which can be solved in only a few simple cases. The problem can be simplified by requiring only that the iterative estimates of the mean and the covariance be found. Unfortunately, this approach leads to computationally intractable solutions (Lewis, 1986). The update equations become tractable by using a first-order series expansion of the nonlinearities (Lewis, 1986; Bar-Shalom & Li, 1993), which leads to the extended Kalman filter. The extended Kalman filter can be used for both discrete and continuous time systems and summarizes past data by an estimate of the mean and the covariance of the variables involved; it is suboptimal in the sense that even with a perfect model, due to the linearization of the system, it does not provide optimal predictions (Lewis, 1986; Bar-Shalom & Li, 1993). The Kalman filter is an iterative algorithm and has the great advantage that it can be used on-line. The Kalman filter has been used for training neural networks and for neural control (Singhal & Wu, 1989; Kadirkamanathan & Niranjan, 1991; Puskorius & Feldkamp, 1994). 5 Experiments 5.1 K-Step Prediction. K-step prediction can be considered a special case of prediction with missing data: yt must be predicted with yt−1 , . . . , yt−K+1 missing. In this case, stochastic simulation is very simple: generate a sample yst−K+1 of the first missing value using the distribution P(yt−K+1 | yt−K , . . . , yt−K−N+1 ). Using that sample and the previous measurements, generate a sample of yt−K+2 following P(yt−K+2 |yst−K+1 , . . . , yt−K−N+2 ), and so on until a sample of each unknown is produced. Repeat this procedure S times and approximate E(yt |Yt−K,1 ) ≈
S 1X f (yst−1 , yst−2 , . . . , yst−N ), S s=1
where we have assumed that K > N. If K ≤ N, substitute measured values
Nonlinear Time-Series Prediction
741
K−step Prediction 0.25
1 0.2
mean squared error
y
0.8
0.6
0.15
0.1
0.4
0.05
0.2
0 0
1
2
3
4
5 K
6
7
8
9
10
0 1
1.5
2
2.5
3 K
3.5
4
4.5
5
Figure 3: (Left) The noisy logistic map (solid line), the K-step prediction using stochastic simulation (dashed line) and the K-step prediction by simply iterating the logistic map (dotted line). The prediction based on stochastic simulation converges for large K toward the mean of the time series (which is the optimal solution, since chaotic time series quickly become unpredictable for large K). (Right) The mean squared error as a function of K in K-step prediction. The iterated solution (solid line) and the approximation based on stochastic simulation with 3 (dotted line) and 20 samples (dashed line) are shown. For K = 1 (one-step prediction), the iterated system gives the optimal prediction. For K > 1, the accuracy of the prediction of the iterated solution quickly deteriorates. The error bars (± one standard deviation) are derived from 2000 independent runs.
for yt−k for k ≥ K. In this procedure, samples are simply generated by simulating the system, including the noise model. 5.1.1 Logistic Map. In the first experiment, we used the noisy logistic map yt = 4qt−1 (1 − qt−1 ) + ²t with 0 ≤ qt−1 < 1 and where yt yt − 1 qt = yt + 1
if 0 ≤ yt < 1 if yt ≥ 1 if yt < 0
where ²t is uncorrelated gaussian noise with a variance of σ 2 = 0.01.3 The left panel of Figure 3 shows a realization of the time series and the predictions based on stochastic simulation and a simple iteration of the map. The right panel shows the mean squared error as a function of K averaged over 2000 realizations. Shown are the iterated system (continuous line) and 3 Here and in the following experiments, q is introduced only for notational convet nience to differentiate the cases when additive noise results in a value of the time series for which the iteration is not defined. qt is therefore not a “real” hidden variable.
742
Volker Tresp and Reimar Hofmann
solutions following the stochastic sampling approach (dotted and dashed). As expected, for K = 1 the iterated solution is optimal, but for K > 1, stochastic simulation with even only a few samples is far superior. This indicates that for highly nonlinear stochastic time series, simply iterating the model K-times as it is usually done in K-step prediction is suboptimal if K > 1. Note that the K-step prediction of the extended Kalman filter, which is based on a local linearization of the nonlinearities, is identical to the iterated system (and therefore is suboptimal as well). 5.1.2 Sunspot Data. The second experiment uses records of yearly sunspot activities from the year 1700 to 1979. First, a multilayer perceptron was trained to predict the sunspot activity based on the 12 previous years of sunspot activity. The neural network had 12 inputs and one hidden layer with 8 hidden units. Following other authors, we trained on data from 1700 to 1920. We used a weight decay parameter of 0.2.4 After training, the mean squared error on the training set is 51.6, on test set 1 (data from 1921 to 1955) the mean squared error is 161.5, and on test set 2 (data from 1956 to 1979) the mean squared error is 682.0. We assumed normally distributed additive noise with a variance equal to the average error on the whole data set σ 2 = 124. Figure 4 shows the sunspot data (dots) from T = 1738 to T = 1987. In the experiment, we perform K-step prediction starting from T = 1738 (i.e., T = 1738 corresponds to one-step prediction and T = 1987 corresponds to 250-step prediction). The top panel of the figure displays the prediction of the iterated system, and the second panel shows the prediction by stochastic simulation using 1000 samples. The bottom panel shows one simulated run (including the noise model). Since the last includes the simulated noise, it is noisier than the iterated system, but the noisier time series is more similar to the true time series (dots). Unlike the prediction based on the iterated system, the prediction based on stochastic simulation converges toward a constant for large K and gives the correct estimate in predicting the mean if K is large. Figure 5 shows the mean squared prediction error as a function of K. We see that for K À 1, stochastic simulation is clearly superior. Recall that for K = 1, the iterated prediction is optimal. 5.2 Prediction with Missing Data. In this experiment we used the Henon map5 yt = 1 − aq2t−1 + bqt−2 + ²t with a = 1.4, b = 0.3 and where yt yt − 1.26 qt = yt + 1.26 4
if −1.26 ≤ yt < 1.26 if yt ≥ 1.26 if yt < 1.26
Readers who are unfamiliar with weight decay or the multilayer perceptron should consult Bishop (1994). 5 A variation of this experiment has been presented by Tresp and Hofmann (1995).
Nonlinear Time-Series Prediction
743
Figure 4: Sunspot data from T = 1738 to T = 1987 (dots). The continuous lines show the K-step predicted value (K increasing with T) based on three different methods. The top plot shows the iterated system, the middle plot shows the prediction based on stochastic simulation using S = 1000 samples, and the bottom plot shows one run of the stochastic simulation.
and where ²t is uncorrelated gaussian noise with a variance of σ 2 = 0.1. The goal is to predict yt with different patterns of yt−1 , yt−2 , yt−3 , yt−4 missing, and yt−5 , yt−6 known. We used stochastic simulation (here, Gibbs sampling) of equation 2.5 for prediction. Figure 6 shows the results. The considerable reduction in error for the solution based on stochastic simulation compared to the heuristic solution is apparent. 6 Conclusions We have shown how the problem of missing and noisy data can be approached in a principled way in time series prediction. By unfolding the time series in time, we could apply ideas and methods from the theory of Bayesian networks. We proposed approximations based on stochastic simulations. Experimental results using the logistic map, the Henon map, and the sunspot data confirmed that stochastic sampling leads to excellent
744
Volker Tresp and Reimar Hofmann Iterated (dashdotted) vs. Sampled System (solid) 2500
2000
MSE
1500
1000
500
0 0
5
10
15 20 25 30 Number of Steps to Predict
35
40
45
Figure 5: Mean squared error for K-step prediction for the iterated system (dashdotted line) and the prediction based on stochastic simulation (solid line) for the sunspot data. It is apparent that for K À 1, the prediction based on stochastic simulation is superior. Shown are averages over all possible experiments, where in each experiment the prediction was started from a different point in time. For 1-step prediction, we used 250 different starting times, which means we averaged over 250 experiments; for 50-step prediction, we used 200 possible starting times and consequently could average over 200 experiments.
predicitions which are clearly superior to simple heuristic approaches. The main drawback of stochastic sampling is that generated samples are often highly correlated and a large number of samples might be required to obtain good approximations. For the problem of noisy measurements, the solution would require generating samples from the joint probability space of all past realizations of the time series, which is clearly unfeasible. In practice, one would sample only from realizations of the time series up to a reasonably chosen time window into the past, which, as a drawback, would lead to suboptimal solutions even with a large number of samples. In this article, we focused on univariate time-series prediction. The results can easily be extended to multivariate times series (see the appendix). Appendix: Multivariate Nonlinear Time Series The results can easily be generalized to general nonlinear multivariate models. It is convenient to switch to a state-space representation where yt ∈
Nonlinear Time-Series Prediction
745
0000_
000X_
00X0_
0X00_
0X0X_
X000_
X00X_
X0X0_
XX00_
XX0X_
XXX0_
1.4
XXXX_
1.6
1.2
MSE
1
0.8
0.6
0.4
0.2
0
Pattern of Missing Inputs
Figure 6: Time-series prediction with missing data. The patterns of the missing data are indicated using X for known, 0 for unknown values, and “ ” for the value to be predicted. For example, XXOO indicates that yt−4 and yt−3 are known and that yt−1 and yt−2 are missing. yt−5 and yt−6 are always known. The goal is to predict yt using either stochastic sampling (left bars) or a heuristic where predicted values are substituted for the missing data (right bars). The height of the bars indicates the squared prediction error averaged over 1000 experiments. The error bars show ± their standard deviation. For stochastic sampling, we used 200 samples for each prediction. Except for one-step prediction (XXXX ) the stochastic sampling solution is significantly better than the heuristic.
is a Dy -dimensional state-space vector containing all relevant states of all time series involved. Typically, yt will be the present and past realizations of all time series involved, up to a time window in the past. The nonlinear state-space model is yt = f (yt−1 ) + ²t , where ²t is a Dy -dimensional vector of possibly correlated noise and with probability density P² . We assume that we have access to a Dz -dimensional
746
Volker Tresp and Reimar Hofmann
measurement vector zt ∈
t Y l=2
P² (yl − f (yl−1 ))
t−1 Y
Pδ (zl − g(yl )),
(A.1)
l=1
with Zt−1,1 = {z1 . . . zt−1 } and Yt,1 = {y1 . . . yt }. Now Z E(yt |Zt−1,1 ) =
f (yt−1 ) P(Yt−1,1 |Zt−1,1 ) dYt−1,1 ,
where P(Yt−1,1 |Zt−1,1 ) = P(Yt−1,1 , Zt−1,1 )/P(Zt−1,1 ) is obtained from equation A.1. Acknowledgments This work was supported by grant 01 IN 505 A9 from the Bundesministerium fur ¨ Bildung, Wissenschaft, Forschung und Technologie. References Ahmad, S., & Tresp, V. (1993). Some solutions to the missing feature problem in vision. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Neural information processing systems, 5 (pp. 393–440). San Mateo, CA: Morgan Kaufmann. Bar-Shalom, Y., & Li, X.-R. (1993). Estimation and tracking. Boston: Artech House. Bernardo, J. M., & Smith, A. F. M. (1994) Bayesian theory. New York: Wiley. Bishop, C. M. (1994). Neural networks for pattern recognition. New York: Oxford University Press. Buntine, W. L., & Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5, 605–643. Jensen, F. V. (1996). An introduction to Bayesian networks. New York: SpringerVerlag. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Trans. ASME J. Basic Eng., 8, 35–45. Kadirkamanathan, V., & Niranjan, M. (1991). Nonlinear adaptive filtering in nonstationary environments. ICASSP 91. Lewis, F. L. (1986). Optimal estimation with an introduction to stochastic control theory. New York: Wiley.
Nonlinear Time-Series Prediction
747
Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. No. CRG-TR-93-1). Department of Computer Science, University of Toronto. Neuneier, R., Hergert, F., Finnoff, W., & Ormoneit, D. (1994). Estimation of conditional densities: A comparison of neural network approaches (pp. 689–692). Proc. of ICANN 94, Sorrento. Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo, CA: Morgan Kaufmann. Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279–297. Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting using the EM algorithm. Journal of Time Series Analysis, 3, 253–264. Singhal, S., & Wu, L. (1989). Training multi-layer perceptrons with the extended Kalman algorithm. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 133–140). San Mateo, CA: Morgan Kaufman. Tresp, V., & Hofmann, R. (1995). Missing and noisy data in nonlinear time-series prediction. In F. Girosi, J. Makhoul, E. Manolakos, & E. Wilson (Eds.), Neural networks for signal processing 5 (pp. 1–10). New York: IEEE. Tresp, V., & Hofmann, R. (1997). Missing and noisy data in nonlinear time-series modeling. Unpublished manuscript. Weigend, A. S., & Gershenfeld, N. (Eds.). (1994). Time-series prediction. Reading, MA: Addison-Wesley.
Received July 25, 1996; accepted July 10, 1997.
LETTER
Communicated by Michael Jordan
Issues in Bayesian Analysis of Neural Network Models Peter Muller ¨ Duke University, Durham NC 27708-0251, U.S.A., and Department of Artificial Intelligence, Madrid Technical University, 28660 Madrid, Spain
David Rios Insua Department of Artificial Intelligence, Madrid Technical University, 28660 Madrid, Spain, and CNR-IAMI, 20131 Milano, Italy
Stemming from work by Buntine and Weigend (1991) and MacKay (1992), there is a growing interest in Bayesian analysis of neural network models. Although conceptually simple, this problem is computationally involved. We suggest a very efficient Markov chain Monte Carlo scheme for inference and prediction with fixed-architecture feedforward neural networks. The scheme is then extended to the variable architecture case, providing a data-driven procedure to identify sensible architectures. 1 Introduction Neural networks (NN) constitute the central theme of a huge amount of recent research. Introductions from the physical (Muller & Reinhardt, 1990), computational (Beale & Jackson, 1990), mathematical (Amari, 1993), and statistical (Cheng and Titterington 1994; Stern 1996) points of view are available. Recently, Wang (1995) has suggested the importance of incorporating human knowledge in NN models to improve their performance. This naturally leads to efforst to model this knowledge through prior distributions over the parameters. This article discusses these issues, exploring the potential of Bayesian ideas in the analysis of NN models. We propose methods to incorporate priors on the size of the hidden layer, in particular priors that could favor smaller-size networks. Buntine and Weigend (1991) and MacKay (1992, 1995) have provided frameworks for implementing Bayesian inference based on gaussian approximations, and Neal (1993, 1996) has applied hybrid Monte Carlo methods. Ripley (1993) and Cheng and Titterington (1994) have dwelled on the power of these ideas, including interpretation and architecture selection. MacKay (1995), Neal (1996), and Bishop (1995) provide excellent recent reviews of and elaborations on Bayesian approaches to NNs. We concentrate on approximation problems, though many of our suggestions can be translated to other areas. For those problems, NNs are viewed Neural Computation 10, 749–770 (1998)
c 1998 Massachusetts Institute of Technology °
750
Peter Muller ¨ and David Rios Insua
as highly nonlinear (semiparametric) approximators, with parameters typically estimated by least squares. Applications of interest for practitioners comprise nonlinear regression, stochastic optimization, and regression metamodels for simulation output. Our main focus is the computational aspects. Our contributions include an efficient, novel Markov chain Monte Carlo scheme and its extension to a scheme for handling a variable architecture model and combining NNs with more traditional models—in our case, linear regression. This scheme allows for identification of promising architectures and hence provides a step forward in the problem of NN architecture choice. In section 2, after introducing our basic model, we introduce and discuss our Markov chain Monte Carlo scheme. This leads us to study, in section 3, variable architecture models and their combination with linear regression. Several examples illustrate the discussion. 2 Posterior Analysis of Feed-Forward Neural Networks Let (x1 , x2 , . . . , xp ) be explanatory variables for a response y, which, for now, we assume to be real valued. A feedforward neural network (FFNN) with activation function ψ, p input units, one hidden layer with M hidden nodes, and one output node is defined by ˆ y(x) =
M X
βj ψ(x0 γj + δj ),
(2.1)
j=1
with βj ∈ R, γj ∈ Rp , M ∈ N. The terms δj are designated biases and may be assimilated with the rest of the γj vector if we consider an additional input with constant value one, say, x0 = 1. Interest in these models stems from results, by Cybenko (1989) and others, suggesting them as universal approximators, for appropriate choices of functions ψ. In most of the article, we shall assume that they are logistic functions. We shall undertake Bayesian analyses of the above model. 2.1 The Model. We view the above model as an implementation of a nonlinear regression of a response y on covariates x = (x0 , x1 , . . . , xp ): yi =
M X
βj ψ(x0i γj ) + ²i , i = 1, . . . , N,
j=1
²i ∼ N(0, σ 2 ), ψ(η) = exp(η)/(1 + exp(η)).
(2.2)
To undertake Bayesian inference with this model, we shall use the following prior: βj ∼ N(µβ , σβ2 ), γj ∼ N(µγ , Sγ ), j = 1, . . . , M.
(2.3)
Issues in Bayesian Analysis of Neural Network Models
751
When there is nonnegligible uncertainty about the prior (hyper-)parameters, we may complete the prior model with a hyperprior over them. We shall use the following: µβ ∼ N(aβ , Aβ ), µγ ∼ N(aγ , Aγ ), σβ−2 ∼ Gamma(cb /2, cb Cb /2), ∼ Wish(cγ , (cγ Cγ )−1 ), and σ −2 ∼ Gamma(s/2, sS/2). The particular S−1 γ choice of normal, gamma, and inverse Wishart distributions is for technical convenience. Similar hyperpriors are fairly common in Bayesian modeling (see, e.g., Lavine & West, 1992). In general, posterior inference is reasonably robust with respect to the choice of hyperpriors (see, e.g., Berger, 1990). However, if available prior information suggests different hyperpriors and hyperprior parameters, the model should be adjusted appropriately. MacKay (1995) and Neal (1996) emphasize the role of hyperparameters in NN models. We use an informative prior probability model because of the meaning and interpretation of the parameters. For example, the βj ’s should reflect the order of magnitude of the data yi . Typically positive and negative values for βj would be equally likely, calling for a symmetric prior around aβ = 0 with a standard deviation reflecting the range of plausible values for yi . Similarly, a range of reasonable values for the logistic coefficients γj will be determined by the meaning of the data yi being modeled, mainly to address smoothness issues. The Appendix describes our specific choices in the examples. 2.2 A Markov Chain Monte Carlo Method for FFNNs. We assume we have data D = {(x1 , y1 ), . . . , (xN , yN )}. Let β = (β1 , . . . , βM ) denote the network weights and γ = (γ10 , γ11 , . . . , γ1p , . . . , γM,0 , . . . , γM,p ) the logistic slopes, and let ν combine all the hyperparameters, ν = (µβ , σβ , µγ , Sγ , σ ). Let θ = (β, γ , ν) be the full parameter vector. For inference purposes, we are interested in computing the posterior distribution p(β, γ , ν|D) = R
p(β, γ , ν)p(D|β, γ , ν) , p(β, γ , ν)p(D|β, γ , ν)dβdγ dν
R and, specifically, the marginal posterior p(β, γ |D) = p(β, γ , ν|D)dν, perhaps summarized through moments and regions. Above, p(D|β, γ , ν) designates the likelihood. We shall be mainly interested, for predictive purposes, in the predictive distribution, Z p(yN+1 |D, xN+1 ) = p(yN+1 |β, γ , ν, xN+1 )p(β, γ , ν|D)dβdγ dν, possibly summarized through P moments or probability regions. Here p(y|β, γ , ν, x) is the conditional N( j βj ψ[x0 γj ], σ 2 ) distribution of the response y given parameters (β, γ ) and covariate x. One possibility to undertake the computations would be to appeal to several normal approximations (see Buntine & Weigend, 1991; MacKay, 1992; Thodberg, 1996, for examples). Assessment of these and other techniques
752
Peter Muller ¨ and David Rios Insua
for posterior inference problems may be seen in Robert (1994, chapter 9). In the specific context of NN models, posterior inference for these schemes may be misled by local modes of the posterior distribution. Buntine and Weigend (1991) mitigate this by finding several modes and basing the analysis on weighted mixtures of the corresponding normal approximations. Of course, we return to the same problem since we are probably leaving out undiscovered local modes. An alternative view is argued by MacKay (1995): inference from such schemes is best considered as approximate posterior inference in a submodel defined by constraining the parameters to a neighborhood of the particular local mode. Depending on the emphasis of the analysis, this might be reasonable, especially if in a final implementation our aim is to set the parameters at specific values. However, we prefer to propagate the uncertainty in the parameters, since this allows better predictions (see Raftery, Madigan, & Volinsky, 1996). To do this, we appeal to Markov chain Monte Carlo (MCMC) methods to implement posterior inference. The essential idea is to obtain by computer simulation a sample from the posterior and base inference on that sample by, for example, replacing posterior expectations with sample means over the simulated posterior sample. The difficulty resides in simulating a sample from the posterior p(θ|D). The rationale of MCMC is to consider a Markov chain {θn } with state θ and having p(θ |D) as stationary distribution. Tierney (1994) describes various ways of defining such chains, including Metropolis, Gibbs, and independence chains. The strategy is to start with arbitrary values of θ, let the Markov chain run until it has practically reached convergence, say after T iterations, and use the next k observed values of the chain as an approximate posterior sample {θ1 , . . . , θk }. MacKay (1995) implements an MCMC method for neural networks based on BUGS (Spiegelhalter, Thomas, & Gilks, 1994), a program for Bayesian Inference using the Gibbs sampler. Neal (1996) proposes using a hybrid Monte Carlo algorithm merging conventional Hastings-Metropolis chains with sampling techniques based on dynamic simulation. Both authors warn against the potential inefficiency of straightforward implementation of MCMC methods in Bayesian analysis of NN models. Also, Besag and Green (1993), albeit in a different application context, dwell on the special care required when using MCMC in multimodal problems. We introduce here a hybrid Markov chain Monte Carlo scheme. The method is hybrid in the sense that we sample from the posterior conditionals (steps 3 and 4 in our algorithm) when they are available, and use Metropolis steps otherwise (step 2). To fight potential inefficiencies due to multimodality, our method has two additional features. First, whenever possible, we integrate out some of the parameters (the weights β) by partial marginalization. Second, we update some of the parameters in blocks (specifically, resample jointly the weights β). These two features allow for fast and effective mixing over the various local modes in the posterior distribution. Combined with model augmentation to a variable architecture,
Issues in Bayesian Analysis of Neural Network Models
753
as described in section 3, this leads to a practically useful MCMC scheme for NN analyses. The key observation in our scheme is that given the currently imputed values of the γ ’s, we actually have a standard hierarchical normal linear model (Lindley & Smith, 1971; Bernardo & Smith, 1994). On one hand, this will allow us to sample easily from the posterior marginals of the weights β and hyperparameters, given the γ ’s. On the other hand, this allows us to marginalize model represented in equations 2.2 and 2.3) with respect to βj , j = 1, . . . , M, to obtain the marginal likelihood p(D|γ , ν). This computation will be instrumental in the Metropolis step (step ii) in our algorithm. The following lemma provides the marginalised likelihood, where for the sake of simplified notation, we shall omit dependence on the hyperparameters. j=1,...,M
Lemma 2.1. Let zij = zij (γ ) = ψ(x0i γj ), Z = (zij )i=1,...,N , 1 = (1)i=1,...,M , A = Z0 Z/σ 2 , ρ = Z0 y/σ 2 , C = 1/σβ2 I, δ = µβ /σβ2 1. Let mb (γ ) = (A + C)−1 (ρ + δ) and Sb (γ ) = (A + C)−1 . Then, p(D|γ ) =
N p[β = mb (γ )] Y p[yi |β = mb (γ ), γ ] p[β = mb (γ )|y, γ ] i=1
= p[β = mb (γ )]|Sb (γ )|1/2
N Y
p[yi |β = mb (γ ), γ ].
i=1
Proof. Conditional on γ , the model in equations 2.2 and 2.3 becomes a normal linear regression model. The posterior p(β|D, γ ) takes the form of a multivariate normal distribution N[mb (γ ), Sb (γ )], with posterior moments mb (γ ) and Sb (γ ), given, for example, in Bernardo and Smith (1994). By Bayes’ Q theorem, p(β|D, γ ) = p(β) N i=1 p(yi |β, γ )/p(D|γ ). Substituting β = mb (γ ) in the last equation, we obtain the expression for p(D|γ ). Our hybrid, blocking, partially marginalized MCMC algorithm for inference and prediction with FFNNs is as follows: 1. Start with θ equal to some initial guess (for example, the prior means). Until convergence is achieved, iterate through steps 2 through 4: 2. Given current values of ν only, (marginalizing over β) replace γ by Metropolis steps: For each γj , j = 1, . . . , M, generate a proposal γ˜j ∼ gj (γj ), with gj (γj ) described below. Compute ¸ p(D|γ˜ , ν)p(γ˜ |ν) a(γj , γ˜j ) = min 1, , p(D|γ , ν)p(γ |ν) ·
(2.4)
754
Peter Muller ¨ and David Rios Insua
where γ˜ = (γ1 , . . . , γj−1 , γ˜j , γj+1 , . . . , γM ). With probability a(γj , γ˜j ) replace γj by the new candidate γ˜j . Otherwise leave γj unchanged. Use Lemma 2.1 to evaluate p(D|γ , ν). An alternative Metropolis step which updates each coordinate of γj separately, is described in section 3.2. For the probing distribution gj (·), we use a multivariate normal N(γj , c2 Cγ ) with c = 0.1. Conceptually, any probing distribution that is symmetric in its arguments, that is, g(γ˜j |γj ) = g(γj |γ˜j ), would imply the desired posterior as stationary distribution of the corresponding Markov chain. For practical implementation, a probing distribution with acceptance rates not too close to zero or one is desirable. For a specialized setup, Gelman, Roberts, and Gilks (1996) showed that acceptance rates of around 25% are optimal. In the examples, we found appropriate values for c by trying a few alternative choices until we achieved acceptance rates in this range. 3. Given current values of (γ , ν), generate new values for β by a draw from the complete conditional p(β|γ , ν, D). This is a multivariate normal distribution with moments described in Lemma 2.1. 4. Given current values of (β, γ ), replace the hyperparameters by a draw from the respective complete conditional posterior distributions: p(µβ |β, σβ ) is a normal distribution, p(µγ |γ , Sγ ) is multivariate normal, p(σβ−2 |β, µβ ) −2 |β, γ , y) is a Gamma distribution, p(S−1 γ |γ , µγ ) is Wishart, and p(σ is Gamma, as corresponds to a normal linear model. (See Bernardo & Smith, 1994). The proof of the convergence of this chain follows from arguments in Tierney (1994). To judge convergence in practice, we rely on both sampled paths of parameters of interest and a convergence diagnostic proposed by Geweke (1992), as illustrated in examples 3 and 4. Once we have an approximate posterior sample {θ1 , . . . , θk }, we may undertake various posterior and predictive tasks as usual. For example, predictive means f (x) = E(yn+1 |xn+1 = x, D) can be evaluated via k X ˆ n+1 |xn+1 , D) = 1 E(yN+1 |xn+1 , θ = θt ). fˆ(x) = E(y k t=1
We illustrate some of these calculations in the examples below. Example 1: Galaxy Data. We try to relate velocity (yi ) and radial position (xi1 ) of galaxy NGC7531 at 323 different locations (Buta, 1987). For this example, we use only the first 80 observations. The data are shown in Figure 1. Radial positions are centered and scaled to have zero mean and unit variance, and velocities have been shifted by a constant offset of 1400. A constant
-2
-1 0 1 RADIAL POSITION
300
755
F-HAT 200 ••
••
0
0
•
• • • • •• • • •• • • • • •
•• •• • • • • ••• •
• •• • •• • ••• • • •• • • •• •• • • •• ••• • • • •• • • •• •••
100
100
F-HAT 200
300
Issues in Bayesian Analysis of Neural Network Models
2
• •• • •• • • •• • • •• •
•• •• •• • • ••• •
• ••• • •• • ••• • •• • • • • •• • • •• ••• • • • •• • • •• •••
••
-2
-1 0 1 RADIAL POSITION
2
0.0
0.0
0.05
0.01
0.10
0.02
0.15
0.03
Figure 1: Example 1. Estimated regression curve ˆf (x) (left panel) and a few draws from the posterior distribution on the regression curve (right panel) induced by the posterior distribution p(θ |D) on the parameters. In both panels, the dots show the data points.
-300
-100
0
100 B1
200
300
-10
-5
0
5
C11
Figure 2: Example 1. p(β1 |D) and p(γ11 |D) estimated by MCMC.
covariate xi0 adds an intercept to the logistic regression terms ψ(x0 γj ) of the NN model. For this problem, we fit an FFNN with three hidden nodes and moments described in the appendix. Note that we did not use a hierarchical model. Figure 2 shows some aspects of the posterior inference. The two panels show the estimated marginal posterior distributions for β1 and γ11 , showing multimodality of posteriors, a feature that hinders the use of other approximate integration methods based on normal approximations (see Bishop, 1995). As discussed in section 2.3, an order constraint on the γj was used to avoid nonidentifiability.
756
Peter Muller ¨ and David Rios Insua
We also illustrate predictive inference. Figure 1a plots the fitted curve fˆ(x). In addition to estimating the nonlinear regression curve, the MCMC allows a complete probabilistic description of the involved uncertainties. Figure 1b, for example, visualizes the posterior distribution on the nonlinear regression curve induced by the posterior distribution p(θ |D). We will revisit this example (after Example 2) to illustrate issues of multimodality in neural networks. 2.3 Posterior Multimodality. Multimodality issues have pervaded discussions of classic analysis of NN models (see Ripley, 1993). They are also important issues to be considered when implementing Bayesian inference in NNs (MacKay, 1995), since they affect the choice of the integration scheme and illuminate the discussion on model (architecture) selection. This issue of architecture choice has received relatively little attention in the literature. Besides inherent multimodality due to the nonlinearity of FFNN, multiple modes can occur for at least two more reasons related to ambiguities in the parameterization. First, multiple modes occur because prior and likelihood, and hence the posterior, are invariant with respect to arbitrary relabeling of the nodes. This problem is easily avoided by introducing an arbitrary ordering of the nodes. For example, we could impose the constraint that the γjp be ordered, that is, γ1p ≤ γ2p . . . ≤ γMp . We used this constraint in the examples. Note that the prior p(γ ) under the constraint is a factor M! larger than it would be under the same prior probabilty model without the constraint. The implementation of the Markov chain Monte Carlo scheme in section 2.2 may be simplified by the following observation. Define an alternative probing distribution γ˜j0 ∼ gj0 (·) by first generating a proposal γ˜j ∼ gj (γ˜j ) without regard to the order constraint. If γ˜j violates the order constraint, permute the indices appropriately to get γ˜ 0 , otherwise γ˜ 0 = γ˜ . Also, instead of scanning over j = 1, . . . , M when updating γj , randomly choose j ∈ {1, . . . , M}. The resulting probing distribution on the constrained parameter space still defines a Metropolis step with a symmetric probing distribution. The implementation can be even further simplified by doing the reordering only before saving or printing the imputed values (γ1 , . . . , γM ). For the evaluation of equation 2.4 and the updating of other parameters, the ordering is irrelevant since the posterior is invariant under permutations of the indices. In this case the random scanning (i.e., randomly choosing the index of the next γj to be updated) is not required. Second, a more serious source of multimodality is the duplication of terms in the network and the inclusion of irrelevant nodes. Node duplication occurs when multiple hidden nodes with practically the same γ parameters are included. An irrelevant node is a hidden node j with practically zero hidden-to-output layer weight βj . We may see both as a manifestation of model mixing as follows. Denote with M0 the fixed architecture model (see equation 2.2) with M∗ hidden nodes. Denote with MM the fixed architecture
Issues in Bayesian Analysis of Neural Network Models
757
model (see equation 2.2) with M distinct hidden nodes and nonzero weights βj . Model M0 contains MM , M = 1, . . . , M∗ as special cases by setting, for example, βi = 0 or γi = γM for i = M + 1, . . . , M∗ . While exact equality of γ ’s or βj = 0 has zero posterior probability because of the continuous priors we have adopted over the parameters, approximate equality can have considerable posterior probability. In fact, denote with pM (θ|D) the posterior distribution under model MM . The P posterior distribution p(θ|D) in model M0 can be rewritten as a mixture Pr(M = m|D)pm (θ|D). If the terms of this mixture are spiked and well enough separated, p(θ|D) exhibits local modes corresponding to the submodels, with additional multimodality entering through the different ways of nesting MM in M0 (for example, nodes could be duplicated, or some of the weights βj could be set to zero). For demonstration, we generated data from an NN model with two distinct hidden modes. Example 2: Simulated Data. We simulated y1 , . . . , yN from equations 2.2 and 2.3 with M = 2, γ1 = (γ10 , γ11 ) = (2, −1), γ2 = (γ20 , γ21 ) = (1, 1.5), β = (20, 10), N = 100 and σ = 0.1, and estimated models M3 and M2 . The marginal posterior p(γ21 |y), shown in Figure 3, has at least three local modes. The first local mode (around -1) and the third mode (around 1.5) are due to model M2 contained in M3 by duplicating node 1 in node 2 (first mode) or by duplicating node 3 (third mode). Also, under M3 , p(γ21 |y) shows a local mode around 0. This is due to nesting model M2 in M3 by setting β2 = 0. Conditional on β2 = 0 the conditional posterior for γ21 would coincide with the prior (i.e., centered around the prior mean zero). Example 1 (continued). Figure 4 shows some more aspects of the posterior inference in Example 1 relating to multimodality due to node duplication. The patterns are similar to the simulated data. However, even under model M2 , we still see some multimodality, some of which could be due to model M1 . From a predictive point of view, node duplication is no issue. If the focus of the analysis is prediction—for example, fitting a nonlinear regression surface—one could ignore the possibility of node duplication. However, it is important to be aware of the implications for the particular estimation scheme: routine application of any numerical posterior integration scheme based on approximate posterior normality and unimodality would be hindered. This includes widely used algorithms like direct normal approximation, Laplace integration, importance sampling, and iterative gaussian quadrature. If used, inference will only be applicable to the local mode (i.e., the particular submodel) on which the normal approximation was based. Node duplication will accentuate the problem of posterior multiple modes, hence hinder the efficiency of MCMC methods, especially random walk
Peter Muller ¨ and David Rios Insua ••
0.4
C21 0
0.6
2
0.8
4
1.0
758
-4
0.0
0.2
-2
• ••
•• ••• • ••••• • • • •••••••••• •• ••••••••••••••••• • • • • •• • • ••••••••••••••••• ••• • ••••••••••••••••••••• • •• • • •• ••••••••••••• • • • • • • • •••• •••• ••••••• •• ••••••••••••••••••••• •• • • • ••• ••••••••••••••••••••••••••••••••••••••••• • • • •• ••••••••••••••••••••••••••••••••••••••••••• • •••••• •••••• • •••• ••••••••• • ••• •• •
-4
-2
0 Cj1
2
4
6
-15
•
0.0
1.5
0.5
2.0
C21
2.5
1.0
3.0
1.5
-6
-6
-4
-2
0 Cj1
2
4
6
-1.4
-10
-5 C11
0
5
• • • • •• • • •• • •• •• •••••• ••• • ••• •• • • • • • ••• • • • • • • ••• • • •• ••••••••••••••• ••••••• • • ••••••••••••••••••••••••••••••• ••••• • • • • • • ••• •• ••• •••••••••••••••••••••••••••• ••••••••• •• • ••• •• •••• •••••• • • • • •••••••••••• •••••••••••••••••••••• ••••••••• •• • ••• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • • •• •••• •••••••••••••••• •••••••••••••••••••••••••••••••••• •••••••• • • • • ••• • ••••• •••••••••••••••• ••••••••• •• ••• • • • •• ••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• ••••• • • •••••• ••• • ••••••••••••••••••••••••••••••••••••••••••••••••••••• ••• • • • • • ••• • • • •• • •• •• •• • •••• ••••••••••••••••••••• •••••• •••••• •••• • • •••• •• • • • • • • • • •• • •• • • •• • • • -1.2
-1.0
-0.8
C11
Figure 3: Example 2 (simulated). Posterior distributions under model M3 (top row) and under model M2 (bottom row). The panels on the left show the marginal posterior distributions p(γ11 |D) (solid line), p(γ21 |D) (dotted line), and p(γ31 |D) (dashed line in the top left panel). The posterior multimodality can be clearly seen in the marginal posterior distributions under M3 . Note how p(γ21 |D) takes the form of a mixture of p(γ11 |D), the prior p(γ21 ), and p(γ31 |D). This multimodality vanishes under M2 . Panels on the right show the joint posterior distribution p(γ11 , γ21 |D) under M3 (top) and M2 (bottom). The line indicates the 45-degree line γ11 = γ21 . All multimodality is removed by constraining to M = 2 nodes. The data were generated from a model with two hidden nodes. In general, some multimodality might be left due to M1 .
Metropolis schemes, which could easily get trapped in a particular local mode. This is of particular concern since most commonly used convergence diagnostics are based on analyzing the simulation output and could falsely diagnose practical convergence. As a consequence, we are interested in removing ambiguities in the pa-
759
• •••• ••••• • • ••••••• • • • •• ••••••••• •• • • • ••• ••• ••• •• • ••• ••• • ••••••••••••••••••••••••••••••••••••••• ••••••••••••••••••••••••••••••••••••••••••••••• •••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••• • •••••••• • • • • • • • • •• •• •••••• ••••••••••••••••••••••••••••••••••••• • ••••••••••• •• • ••• ••• • •••• •• •••••• ••• • • • ••• ••••••••• ••••••••• • ••• • •• • • • •••••••••• ••••••••••••••••••••••••••••••••••• ••• • ••••••••••••• •• •• •••••• • • • • •• •••••••• •••••• • • •• ••• ••••••••••••••••••••••••••••••••••••••••••••••••• • •• • ••••••••••••••••••••••••••••••••••••• • ••••••••••••••••••••••••••• • •• ••••• • •• •
0.0
-10
0.05
-5
0.15
C21 0
5
0.25
10
Issues in Bayesian Analysis of Neural Network Models
-5
0 Cj1
5
10
-15
-10
-5 C11
0
4.5
••• •• •• • •• • • • • ••••• •••• • ••••••••• •••••• •••••••••• ••••••••••• •••• ••••••••• •
4.0
0.6
C21 3.5
0.4
3.0
•• • •• • •• ••••• ••••••••• • ••• • •• •• •••••••••••••••• ••• • •• • •••••••••••••••••••••• ••••• •••••••••••••••••• •••••• •••••••• •••••••••••••••••••• •••• ••••••••••••••••••••••• •••••• ••••••••••••••••••••• • • • • • • • • • • • • • • • • • • • • • • • •• • • • • •• • • • • • • • • • • • • • • • • ••••• • •••••••••••••••• •••• ••• • •• •• ••••••••••••••••••••••••• •••••••••••••••••• ••••••••••••••••••••••••• •• •••• •••••••••••••••• ••••••••••••••••••••••••••••••••••••• ••• • •• ••••
2.0
2.5
0.2 0.0
5
• • ••• • •• ••
5.0
-10
-10
-5
0 Cj1
5
10
-15
-10
-5 C11
0
5
Figure 4: Example 1. Posterior distributions under model M3 (top row) and M2 (bottom). The features are very similar to the posterior distribution for the simulated data in the previous figure. In particular, note that p(γ21 |D) is a mixture of p(γ11 |D), p(γ31 |D), and the prior p(γ21 ). Also, the second mode of p(γ11 |D) duplicates p(γ31 |D), indicating positive posterior probability for model M1 using node 3 only. As in the simulated example, most of the multimodality from the posterior under M3 is lost under M2 . Only some minor mode due to model M1 remains. It can be seen as a secondary mode in p(γ11 |D), as well as a secondary mode close to the 45-degree line in p(γ11 , γ21 |D).
rameterization whenever possible. Often multimodality can be partly removed by picking a lower-order model, that is, when symptoms of node duplication are noticed in the posterior distribution, one could consider models with M < M∗ nodes, as shown in Figures 3 and 4. This discussion leads us to issues of model (architecture) selection. We believe that a conceptually clear and straightforward approach is to include
760
Peter Muller ¨ and David Rios Insua
explicitly the number of hidden nodes M as a parameter in the model, that is, use variable architecture NN models. This is fairly natural, since, apart from simulated examples, we do not expect a “true” M for the NN model; hence we need to model uncertainty about it. There remains the problem of providing procedures for modeling it and a scheme for estimating the model. This is the topic of section 3. 3 Variable Architecture NN Models Our considerations in section 2 lead us to contemplate M as another parameter, for two main reasons. First, random M with a prior distribution favoring smaller values reduces posterior multimodality, as discussed in section 2.3. Although in principle posterior multimodality does not prevent valid Bayesian inference, we consider it good Bayesian modeling to avoid parameterizations leading to posterior multimodality. Second, the marginalization over βj (see Lemma 2.1) requires the inversion of matrices of dimension related to M. Avoiding unnecessarily large M is critical to reduce computational effort. We provide here a scheme for modeling and estimating uncertainty about M, therefore dropping the assumption of a fixed, known architecture. We actually allow the model to “select” the size of the hidden layer by including indicators dj , with dj = 1 if a node is included, and dj = 0 if a node is dropped. The extension of our algorithm to this case will allow the identification of architectures supported by the data. We generalize the fixed architecture model in yet another direction, by including a linear regression term x0 λ to model level and linear effects efficiently. Typically this would tend to reduce the size of the network. We will always use d1 = 1, assuming only problems that cannot be described by a linear model alone would be analyzed by an NN model. This corresponds to a model-building strategy based on blocks, where the linear part models linear aspects of the process of interest and the NN part models nonlinear aspects. Of course, the constraint d1 = 1 is not required and could be removed if desired. We included it in our implementation for technical convenience to avoid a separate code for the special case M = 0. Neal (1996) suggests an alternative approach based on choosing a big enough number M∗ of hidden nodes, and an appropriate prior. Our model comprises M = M∗ as a special case, but for the aforementioned reasons, we favor the approach with random M. 3.1 The Model. The model we use is: yi = x0i λ +
M∗ X
dj βj ψ(x0i γj ) + ²i , i = 1, . . . , N,
j=1
²i ∼ N(0, σ 2 ), ψ(η) = exp(η)/(1 + exp(η)).
(3.1)
Issues in Bayesian Analysis of Neural Network Models
761
We include at most M∗ hidden nodes, allowing for architectures from one hidden node (when only d1 = 1) to M∗ hidden nodes (when di = 1, ∀i). Again, we recommend including an order constraint to avoid trivial posterior multimodality due to permutations of indices, γ1p ≤ γ2p · · · ≤ γMp . The prior we introduce over the indicators is: ½ Pr(dj = d|dj−1 = 1) =
1 − α, α
for d = 0 for d = 1
j = 1, . . . , M∗ ,
(3.2)
and dj = 0 if dj−1 = 0, with d1 = 1. Observe that the indicators dj are ordered decreasingly, so that dj = 0 implies dk = 0 for k > j. The prior distribution for the indicators dj allows a priori any architecture with M ≤ M∗ hidden nodes. It actually implies a geometric prior with parameter α, truncated at M∗ , on the size M of the hidden layer. This prior enables efficient implementation and favors parsimony, in the sense of supporting architectures with a smaller number of hidden nodes. The prior over other network parameters is similar to that of the model in equations 2.2 and 2.3 with an additional prior for α: βj ∼ N(µβ , σβ2 ),
λ ∼ N(µβ , σβ2 ),
γj ∼ N(µγ , Sγ ),
α ∼ Beta(aα , bα ).
(3.3)
Finally, we complete the model with the same hyperpriors on (µβ , µγ , sβ , Sγ , σ 2 ) as we did with the fixed architecture model. 3.2 An MCMC Method for the Variable Architecture Case. The computational scheme is a natural extension of that in section 2.2, yet another advantage of favoring an MCMC approach to NN modeling. The simulation scheme outlined for inference in the fixed architecture model requires only minor modifications to be used for the models in equations 3.1, 3.2, and 3.3. Conditional on currently imputed values for the indicators dj , the model reduces to the fixed-architecture one. Given other model parameters, the conditional posterior probabilities for dj = 0 and dj = 1 are easily computed. PM∗ Denote with M = j=1 dj the number of hidden nodes currently included in the model. By definition of the indices, we always have dj = 1, j = 1, . . . , M, and dj = 0, j = M + 1, . . . , M∗ . Before discussing details of the algorithm, we outline the updating scheme, which goes through the following steps, until convergence is judged. The notation x|y, z indicates that x is being updated using the current values of y and z, γ−jk denotes γ without
762
Peter Muller ¨ and David Rios Insua
γjk , d−j = (d1 , . . . , dj−1 , dj+1 , . . . , dM+1 ), and so forth. Updating details are discussed below. 1. γjk |γ−jk , M, ν, D, j = 1, . . . , M + 1, k = 0, . . . , p. 2. dj |d−j , γ1 , . . . , γM , γM+1 , ν, D, j = 1, . . . , M + 1. 3. β1 , . . . , βM , λ|γ1 , . . . , γM , M, ν, D. 4. ν|β1 , γ1 , . . . , βM , γM , λ, D. In step 1, we marginalize over (λ, β) and include γM+1 . Conditional on M and the hyperparameters, the conditional posterior on γM+1 is just the N(µγ , Sγ ) prior. All other γj ’s (j = 1, . . . , M) are updated through Metropolis steps, similar to step (2) in section 2.2. Also, the comments in section 2.3 about resampling under the order constraint on the γj in the fixed architecture model, apply equally for the variable architecture model. Randomly select j ∈ {1, . . . , M}, k ∈ {1, . . . , p}, and generate a proposal γ˜jk ∼ g(γjk ), with g(γjk ) described below. If γ˜jk violates the order constraint, permute the indices appropriately to get γ˜ 0 , otherwise γ˜ 0 = γ˜ , where γ˜ is the γ vector with γjk replaced by γ˜jk . Compute ¸ · p(D|γ˜ 0 , M, ν)p(γ˜ 0 |M, ν) . a(γ , γ˜ 0 ) = min 1, p(D|γ , M, ν)p(γ |M, ν) Use Lemma 2.1 to evaluate p(D|γ , ν). With probability a(γ , γ˜ 0 ) replace γ by the new candidate γ˜ 0 . Otherwise, leave γ unchanged. For the probing distribution g(·), we use a normal N(γjk , c2 Cγ ,kk ) distribution, where Cγ ,kk is the kth element on the diagonal of Cγ , and c is a fixed scaling factor. The specific choices for c used in our examples are reported in the appendix. Alternatively one could consider proposals changing all coordinates of γj jointly, as we did in equation 2.4. Step 2 refers to updating the number of hidden nodes. Again we marginalize over (β, λ). Denote with γ the list of regression parameters, including the (M + 1)st term: γ = (γ1 , . . . , γM+1 ). Similarly, d = (d1 , . . . , dM+1 ). To update d we use a Metropolis step with the following probing distribution ˜ , d) to generate a proposal (γ˜ , d). ˜ We include γ in the notation beg(γ˜ , d|γ cause the proposal might include a permutation of indices, which would affect the γ and the d vector. First, randomly select an index j ∈ {1, . . . , M+1}, with uniform probability 1/(M + 1) each. Then define d˜j by flipping dj , that is, d˜j = 1 − dj . Third, permute the indices to maintain the order constraint on γjp and the constraint on the dj . The above three steps implicitly define ˜ , d). Note that by definition of g, we have g(γ˜ , d|γ ˜ , d) = 1/(M+1) for g(γ˜ , d|γ P ˜ all possible (γ˜ , d), where M = dj is the number of terms before generating P ˜ = 1/(M ˜ + 1), where M ˜ = d˜j is the number the proposals. Also g(γ , d|γ˜ , d)
Issues in Bayesian Analysis of Neural Network Models
763
of terms in the proposal. Having generated the proposal, evaluate # " ˜ ˜ D) g(γ , d|γ˜ , d) p(γ˜ , d|ν, ˜ a(γ , d, γ˜ , d) = min 1, ˜ , d) p(γ , d|ν, D) g(γ˜ , d|γ " # ˜ ν)p(γ˜ |d, ˜ ν)p(d|ν) ˜ (M + 1) p(D|γ˜ , d, = min 1, . ˜ + 1) p(D|γ , d, ν)p(γ |d, ν)p(d|ν) (M ˜ accept the proposal (γ˜ , d) ˜ as a new value for With probability a(γ , d, γ˜ , d), ˜ = (γ , d); otherwise keep (γ , d) unchanged. Note that for a proposal with M ˜ (M−1), we get p(γ˜ |d, ν)/p(γ |d, ν) = 1/M when using an order constraint on ˜ only γj , j = 1, . . . , M− γ1p , . . . , γMp . This is because under the proposal (γ˜ , d) 1 are subject to the order constraint, as opposed to γj , j = 1, . . . , M, under ˜ ν)/p(γ |d, ν) = M ˜ the current parameter vector. For the same reason, p(γ˜ |d, if the proposal increases the number of hidden nodes by one. Step 2 can be repeated several times. If this is done, it is important to note two more implementation details. First, if a dj is set to zero, then the corresponding γj becomes γM+1 , after reordering the indices and changing M, and is used for the next iteration. Second, if step 2 is repeated, say, T times, then step 1 needs to be modified to generate γM+j for j = 1, . . . , T. It is important to marginalize over β when updating the d. Conditioning on β would lead to a very slowly mixing Markov chain and render the scheme of little practical value. For example, when updating dM+1 , a move ˜ := M + 1, would only rarely be accepted when to d˜M+1 = 1, that is, M by chance the weight βM+1 which was previously sampled from the prior, happens to be “right.” Compare with the discussion in Example 3 and Figure 5. A further marginalization is possible by analytically integrating over µγ . We used this in our implementation but consider it far less critical than the marginalization over β. In step 3, we sample all βj and λ jointly. Step 3 draws from the complete conditional posterior. The complete conditional posterior p(λ, β1 , . . . , βM | γ1 , . . . , γM , M, ν, D) is a multivariate normal posterior distribution in a linear normal regression model, as described in Lemma 2.1. Step 4 is unchanged from the fixed architecture case. Convergence of the algorithm follows from the arguments in Tierney (1994). We illustrate the variable architecture model with two examples. The first shows how our model may cope with a multivariate problem. The second is structurally complicated and suggests how our model may adapt to sharp edges. The flexibility of this model for coping with these features makes it very competitive with respect to other smoothing methods, including the model in equations 2.2 and 2.3.
Peter Muller ¨ and David Rios Insua
0.001
0.005
MSE 0.050
0.500
764
0
100
200 300 ITERATION
400
500
Figure 5: Example 3 (robot arm). Predictive mean squared error (MSE) for the test data set. The figure plots MSE averaged over batches of 10 iterations versus iteration. The solid curve corresponds to simulations using the MCMC scheme described in section 3.2. The dashed curve shows the MSE for the same MCMC scheme, but without marginalizing over (β, λ) when updating γ and M. The horizontal dashed line indicates the predictive MSE after 20,000 iterations (at MSE = 0.00545).
Example 3: Robot Arm. This test problem is analyzed in MacKay (1992) and reanalyzed in Neal (1993, 1996). We have to learn a mapping from two real-valued inputs representing joint angles for a robot arm to two realvalued outputs that predict the resulting arm position, defined by yi1 = 2.0 cos(xi1 ) + 1.3 cos(xi1 + xi2 ) + ²i1 , i = 1, . . . , N, yi2 = 2.0 sin(xi1 ) + 1.3 sin(xi1 + xi2 ) + ²i2 , iid
²ik ∼ N(0, 0.05). To accommodate the bivariate response, we generalize the model in equation 3.1 to yik =
x0i λk
+
M∗ X
dj βjk ψ(x0i γj ) + ²ik , i = 1, . . . , N, k = 1, . . . , K,
j=1 iid
²ik ∼ N(0, σ 2 ), ψ(η) = tanh(η).
(3.4)
Following MacKay (1992) we replace the logistic activation function by tanh(·). To avoid nonidentifiability in the likelihood, we add a constraint
Issues in Bayesian Analysis of Neural Network Models
765
Table 1: Example 3. Geweke’s (1992) Convergence Diagnostic and Lag 10 Autocorrelations. With Marginalizing (λ, β) Variable M σ µβ µγ ,0 µγ ,1 µγ ,2 λ10 λ11 λ12 λ20 λ21 λ22
Without Marginalizing (λ, β)
Geweke
Autocorrelation
Geweke
Autocorrelation
−1.81 1.75 −0.46 −1.28 −0.75 0.08 0.19 −1.77 1.87 −0.53 0.35 0.48
0.81 0.05 0.03 0.24 −0.03 −0.05 0.34 0.58 0.70 0.27 0.48 0.91
NA 13.40 2.59 − 1.86 − 2.63 0.31 −21.80 −12.40 −19.70 −15.90 7.63 −11.10
0.99 0.97 −0.00 0.03 0.01 −0.00 0.99 0.90 0.97 0.97 0.94 0.97
Note: Estimates are based on 20,000 iterations, discarding the first 1000 as burn-in and thinning out to every 10th iteration thereafter. Without marginalization, the simulated chain did not change M over the last 50% of the iterations, making evaluation of Geweke’s convergence diagnostic impossible for M.
γj1 > 0. Without this constraint, one could change (βj , γj ) to (−βj , −γj ) without changing the likelihood. If the prior is symmetric around 0, the posterior distribution would remain invariant under such transformations. The prior model for (λk , γj , βjk ) remains the independent normal model (see equation 3.3) with hyperparameters as in the appendix. We used the same data set as MacKay (1992). We split the data into a training data set (the first 200 observations) and a test data set (the last 200 observations). Figure 5 reports the mean squared predictive error for the test data set as a function of the number of iterations. After around 300 iterations, the mean squared error is already close to the asymptotic value 0.00545 (note the theoretical minimum 2σ 2 = 0.0050), indicating that short run lengths of several hundred iterations are sufficient for predictive purposes. However, to monitor convergence diagnostics on some selected parameters, we needed 20,000 iterations to achieve practical convergence. Details are reported in Table 1. The estimated marginal posterior probabilities p(M|D) for the number of hidden nodes are 0.22, 0.47, 0.27, and 0.04 for M = 6, 7, 8, and 9, respectively. For the Markov chain Monte Carlo scheme to be of practical use, the marginalization over (β, λ) is crucial. This is illustrated in Figure 5 and the right column of Table 1.
Peter Muller ¨ and David Rios Insua
1.5
-5 -4 -3 -2 -1 0 1
-6 -5 -4 -3 -2 -1 0 1
766
1
1.5 0.5
0
X2
-0
.5
-1
-1 .5
-1.5
0 -0.5 X1 -1
0.5
1
1.5
1
0.5
0 X2 0.5
-1
-1 .5
-1.5
-1
-0.5
0
0.5
1
X1
Figure 6: Example 4 (reservoir). Data and fitted surface using the NN model of section 4. The solid triangles indicate the data points.
Example 4: Reservoir Management. We apply the described methods to a case study coming from Rios Insua and Salewicz (1995). The example concerns a reservoir operation problem, complicated by the existence of multiple objectives, uncertainty to the inflows, and the effect of time. The decisions to be made each month were the volumes of water to be released through turbines and spilled, based on maximizing a predictive expected utility. There was no analytic expression for this last one, so we could appeal to an optimization method based on function evaluations only, such as the Nelder Mead method. Alternatively, we could evaluate the expected utility at a grid of controls, fit a surface, and optimize it. We illustrate this last approach, fitting an NN model. Figure 6 shows the data and the fitted surface. Note how the NN model fitted the sharp edge in front, a case in which many commonly used smoothing methods might fail. Figure 7 shows the marginal posterior on M. Table 2 reports convergence diagnostics. In addition to the normal prior (see equation 3.3), we constrained the γjk by |γjk | < 10.0 to avoid numerical problems. Otherwise proposals for the γ vector could lead to degenerate design matrices in the regression problems required for the evaluation of p(γ |ν) (see Lemma 2.1). 4 Discussion Neural network models are used to model nonlinear features in problems like approximation, regression, smoothing, forecasting, and classification. Although they are typically presented as black box models, allowing the incorporation of prior knowledge in those models enhances their perfor-
Issues in Bayesian Analysis of Neural Network Models
(a)
767
(b)
Figure 7: Example 4 (reservoir). Posterior p(M|D) on the size of the hidden layer. (a) Plots of the estimated posterior distribution p(M|D). (b) The trajectory using the Markov chain. Table 2: Example 4. Geweke’s (1992) Convergence Diagnostic and Lag 10 Autocorrelations. Variable M σ µβ µγ ,0 µγ ,1 µγ ,2 λ0 λ1 λ2
Geweke
Autocorrelation
0.29 −0.07 −1.35 −0.87 −1.63 −1.44 −1.88 0.96 0.38
0.67 0.27 0.00 0.25 0.38 0.19 0.18 0.51 0.22
Note: Estimates are based on 40,000 iterations, discarding the first 1000 as burn-in and thinning out to every tenth iteration thereafter.
mance. This begs naturally for a Bayesian approach to NN models. Among other advantages, this allows for the coherent incorporation of all uncertainties, including those relating to the hidden layer size. This approach, however, leads to difficult computational problems. Specifically, we have noted potential problems of normal approximationbased approaches due to multimodality. As an alternative, we have provided a powerful Markov chain Monte Carlo scheme that avoids those problems and permits routine Bayesian analysis of FFNN models. The scheme allows the consideration of variable architecture networks and consequently automates the choice of the network architecture. We have also shown that
768
Peter Muller ¨ and David Rios Insua
the scheme allows the combination with more conventional models such as linear regression. In summary, we have provided a general framework for the Bayesian analysis of FFNN models. Future work will deal with a somewhat inverse problem: how FFNN models enhance the Bayesian tool kit. In particular, from a statistical modeling point of view, NNs are very close to mixture models. Many issues about posterior multimodality and computational strategies in NN modeling are of relevance in the wider class of mixture models (see Escobar & West, 1995; West, Muller ¨ & Escobar, 1994; West & Turner 1994). For example, we could explore the potentiality of our framework when dealing with uncertainty in the number of components of a mixture model. Appendix: Implementation, Initialization and Convergence Diagnostic In the examples, we have used the following initialization and hyperparameters. The covariates were standardized to have x¯ i = 0 and var(xi ) = 1.0 (except for the dummy intercept x0i = 1). In Examples 1 and 2 we fixed the hyperparameters µβ , µγ , σβ , and Sγ at µβ = µγ ,j = 0, σβ2 = 10000 and Sγ = diag(25, 25). In Examples 3 and 4 we used initial values µβ = µγ ,j = 0, σβ2 = 10, and Sγ = diag(4, 10, 10). The sample variance σ 2 was fixed to σ 2 = 100 and 1.0 in Examples 1 and 2, respectively, and initialized as σ 2 = 0.0025 and σ 2 = 0.5 in Examples 3 and 4. The remaining hyperparameters in Examples 3 and 4 were chosen as aβ = aγ ,j = 0, Aβ = 1, Aγ = diag(1, 1, 1), cb = 11, cγ = 13, Cb = σ 2 and Cγ = Sγ . In Examples 3 and 4, we initialized α = 0.2 and used hyperparameters aα = bα = 1. The prior on M was truncated at M∗ = 20, and M was initialized with M = 3. For the scaling parameter c in the probing distribution for γ˜jk , we used c = 0.1 in Examples 3 and 4. We simulated 10,000, 10,000, 20,000, and 40,000 iterations in Examples 1, 2, 3, and 4, respectively. The decision to terminate simulations was based on the convergence diagnostic proposed by Geweke (1992). Compare with Tables 1 and 2. The relatively long simulation lengths in the simple examples (1 and 2) were required to obtain sufficiently large Monte Carlo posterior samples for the posterior scatter plots. In Examples 1 and 2 we discarded the initial 100 iterations as burn-in and saved every tenth iteration thereafter to collect an approximate posterior Monte Carlo sample, used for the ergodic averages. In Examples 3 and 4 we discarded the first 1000 as burn-in and saved every tenth. Acknowledgments Research was supported by grants from the National Science Foundation, CICYT, and the Iberdrola Foundation. It was completed while P.M. visited
Issues in Bayesian Analysis of Neural Network Models
769
the Department of Artificial Intelligence of Madrid Technical University and D.R.I. was visiting CNR-IAMI. We are very grateful to the numerous remarks from the referees and discussions with Fabrizio Ruggeri. References Amari, S. (1993). Mathematical methods of neurocomputing. In O. E. BarndorfNielsen, J. L. Jensen, & J. S. Kendall (Eds.), Networks and chaos. London: Chapman and Hall. Beale, R., & Jackson, T. (1990). Neural computing. Bristol: Hilger. Berger, J. O. (1990). Robust Bayesian analysis: Sensitivity to the prior. Journal of Statistical Planning and Inference, 25, 303–328. Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. New York: Wiley. Besag, J., & Green, P. J. (1993). Spatial statistics and Bayesian computation. Journal of the Royal Statistical Society, Series B, 55, 25–37. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Buntine, W. L., & Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5, 603–643. Buta, R. (1987). The structure and dynamics of ringed galaxies, III. Astrophysical Journal Supplement Series, 64, 1–37. Cheng, B., & Titterington, D. M. (1994). Neural networks: A review from a statistical perspective (with discussion). Statistical Science, 9, 2–54. Cybenko, G. (1989). Approximation by superposition of sigmoidal functions. Mathematics of Control Systems and Signals, 2, 303–314. Escobar, M. D., & West, M. (1995). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577–588. Gelman, A., Roberts, G. O., & Gilks, W. R. (1996). Efficient Metropolis jumping rules. In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics 5. Oxford: Oxford University Press. Geweke, J. (1992). Evaluating the accuracy of sampling based approaches to the calculation of posterior moments. In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics 4. Oxford: Oxford University Press. Lavine, M., & West, M. (1992). A Bayesian method for classification and discrimination. Canandian Journal of Statistics, 20, 451–461. Lindley, D., & Smith, A. F. M. (1971). Bayes estimates for the linear model (with discussion). Journal of the Royal Statistical Society, Series B, 34, 1–41. MacKay, D. J. C. (1992). A practical Bayesian framework for backprop networks. Neural Computation, 4, 448–472. MacKay, D. J. C. (1995). Bayesian methods for neural networks: Theory and applications (Technical Rep.). Cambridge: Cavendish Laboratory, Cambridge University. Muller, B., & Reinhardt, J. (1990). Neural networks. Berlin: Springer-Verlag. Neal, R. M. (1993). Bayesian learning via stochastic dynamics. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems 5, San Francisco: Morgan Kaufman.
770
Peter Muller ¨ and David Rios Insua
Neal, R. M. (1996). Bayesian learning for neural networks, New York: SpringerVerlag. Raftery, A. E., Madigan, D. M., Volinsky, C. (1996). Accounting for model uncertainty in survival analysis improves predictive performance. In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistcs 5. Oxford: Oxford University Press. Rios Insua, D. & Salewicz, K. A. (1995). The operation of Kariba Lake: A multiobjective decision analysis. Journal of Multicriteria Decision Analysis, 4, 203–222. Ripley, B. D. (1993). Statistical aspects of neural networks. In O. E. BarndorfNielsen, J. L. Jensen, & W. S. Kendall (Eds.), Networks and chaos. London: Chapman and Hall. Robert, C.P. (1994). The Bayesian choice, New York: Springer-Verlag. Spiegelhalter, D.J., Thomas, A., & Gilks, W.R. (1994). BUGS Manual. Cambridge: MRC Biostatistics Unit, IPH. Stern, H.S. (1996). Neural networks in applied statistics. Technometrics, 38, 205– 220. Thodberg, H. H. (1996). Review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks, 7, 56–72. Tierney, L. (1994). Markov chains for exploring posterior distributions. Annals of Statistics, 22, 1701–1762. Wang, Y. (1995). Unpredictability of standard back propagation neural networks. Managment Science, 41, 555–559. West, M., Muller, ¨ P., & Escobar, M. D. (1994). Hierarchical priors and mixture models, with application in regression and density estimation. In A. F. M. Smith & P. R. Freeman (Eds.), Aspects of uncertainty: A tribute to D. V. Lindley. New York: Wiley. West, M., & Turner, M. D. (1994). Deconvolution of mixtures in analysis of synaptic transmission. The Statistician, 43, 31–43.
Received March 6, 1996; accepted August 5, 1997.
REVIEW
Communicated by P. Read Montague
Computational Models of Neuromodulation Jean-Marc Fellous Brandeis University, Volen Center for Complex Systems, Waltham, MA 02254-9110, U.S.A.
Christiane Linster Harvard University, Department of Psychology, Cambridge, MA 02138, U.S.A.
Computational modeling of neural substrates provides an excellent theoretical framework for the understanding of the computational roles of neuromodulation. In this review, we illustrate, with a large number of modeling studies, the specific computations performed by neuromodulation in the context of various neural models of invertebrate and vertebrate preparations. We base our characterization of neuromodulations on their computational and functional roles rather than on anatomical or chemical criteria. We review the main framework in which neuromodulation has been studied theoretically (central pattern generation and oscillations, sensory processing, memory and information integration). Finally, we present a detailed mathematical overview of how neuromodulation has been implemented at the single cell and network levels in modeling studies. Overall, neuromodulation is found to increase and control computational complexity.
1 Introduction Organisms, from invertebrates to mammals, exhibit diverse behaviors when coping with their environments. Correspondingly, the nervous systems of these organisms can differ significantly in their organization and cellular components. Despite such cross-species variability, computational models of nervous systems have shown that complex computations can emerge from the interaction of relatively simple circuits of neurons. A typical connectionist model, for example, involves a transfer function computing the output of the neuron given the sum of its inputs and a synaptic learning rule determining how the strength of synaptic connections is updated. With this type of simple model, a variety of behavioral functions have been modeled, providing insights into how complex phenomena, such as perception, memory, and motor control, can be explained in terms of simple neural mechanisms. Simple models, however, often fail to capture important aspects of neural processing such as neuromodulation (Cooper, Bloom, & Neural Computation 10, 771–805 (1998)
c 1998 Massachusetts Institute of Technology °
772
Jean-Marc Fellous and Christiane Linster
Roth, 1991; Harris-Warrick & Marder, 1991; Hasselmo, 1995; Kaczmarek & Levitan, 1987). In addition to the classic excitatory and inhibitory neurotransmission, such as those mediated by glutamate or GABA, a large number of biophysical processes serve to modify the response of a neuron to a given input signal or to alter the input signals before their arrival. These modulatory effects often involve substances such as acetylcholine (ACh), norepinephrine (NE), histamine, serotonin (5-HT), dopamine (DA), and a variety of neuropeptides. Although these substances are known to act at different types of receptors, originate from different structures, and have different spatial distributions and time courses of action, they have at least one of the following three functional effects: modulation of intrinsic neural properties (such as input-output function, or threshold), modulation of afferent properties (such as strengthening some neural inputs rather than others), or modulation of efferent properties (such as presynaptic modulation of release). At the behavioral level, such modulations can profoundly affect the function of the nervous tissue involved. Much is now known about the detailed action of neuromodulatory substances and their agonists and antagonists at the level of small circuits, single neurons, single synapses, or single channels. On the other hand, psychopharmacologists have examined the effects of many drugs that affect various neuromodulatory systems on behaviors such as perception, learning and memory, and motor control. Because of the wider use of modeling techniques and growing interest in systems neuroscience, the computational role of neuromodulation in information processing is receiving increased attention in both the modeling and experimental communities. As we will suggest, the study of neuromodulation may help bridge the gap between elementary neural principles and behavior. Computational models provide a formal framework in which the function of a neuron or a group of neurons can be expressed rigorously. In general, neural dynamics is represented as a set of equations with variables and parameters. Variables are determined by both the level of description of the model (concentration, membrane potential, firing rate, etc.) and the function under study. Parameters are potential neuromodulatory factors. They are diffuse (nonspecific to each neuron) and are assumed to change more slowly than variables, so that keeping them constant (or very slowly varying) will not perturb the function of the network. In this formalism, neuromodulation appears as a means of changing the way the function is achieved, without changing the function itself. However, not all parameters have a biological meaning. Some are abstract place holders used to make up for the lack of knowledge about the details of a particular phenomenon (the learning rate, for example), others ensure the ad hoc goodness of fit (“tuning”) of the model to certain experimental data that are not the primary targets of the model (“time constants” of synaptic alpha functions for example). Moreover, not all the parameters that have putative biological correlates have
Computational Models of Neuromodulation
773
identified neuromodulatory roles in situ. For example, some chemical time constants might characterize complex biophysical mechanisms that are normally severely constrained (“regulated”), and consequently have no neuromodulatory function. Furthermore, not all neuromodulatory phenomena can be represented by simple parameter changes. The purpose of this article is to highlight, through a targeted review of the modeling literature, some of the basic computational roles assigned to neuromodulation and present their possible neural implementation. Due to the diversity and ubiquity of neuromodulatory phenomena, we will not provide a comprehensive review of all neuromodulatory systems in terms of their anatomical loci, detailed biochemical pathways, and individual physiological effects. Nor will we attempt to define it; rather, we will review neuromodulation according to the computational framework provided by a chosen set of modeling studies. Our intent is not to be exhaustive. Many models not mentioned here have discussed how specific neuromodulations can be implemented and how they affect particular aspects of the neural system they consider. We include here a selection of studies that have dealt explicitly with neuromodulation and will help readers understand a specific computational role of neuromodulation. In the first section, we characterize neuromodulation on the basis of its spatial origin (extrinsic or intrinsic), its functional coupling with neural computation (tuning versus regulation), and its time course. We then review in more detail the computational role of neuromodulation in three important classes of models that address issues pertaining to oscillations and synchrony in small and large networks, sensory processing, and memory function. Finally, in the appendix, we give a detailed mathematical account of the way neuromodulation has been implemented in the various modeling frameworks reviewed. 2 Characterizing Neuromodulation Neuromodulations can be described by their spatial and temporal characteristics, and in the computational framework chosen here, they can also be characterized by their level of coupling with the specific neural computations under consideration. 2.1 Extrinsic and Intrinsic Neuromodulation. A first class of neuromodulatory signals may originate from an area extrinsic to the neural substrate whose computation is under study, so that lesioning the neuromodulatory center does not usually perturb the function itself, but only modifies its quality. The computational functions of such extrinsic neuromodulation are expected to be somewhat global, because they usually influence many functionally different sites simultaneously. A second class consists of neuromodulations that originate in the relevant substrate itself or in a distant site but are controlled locally within the substrate. In such systems, neuromod-
774
Jean-Marc Fellous and Christiane Linster
ulation is an integral part of the computation. Cotransmission (Brezina, Orekhova, & Weiss, 1996; Chan-Palay & Palay, 1984; Kupfermann, 1991; Marder, Christie, & Kilman, 1995), presynaptic receptions (Marder, 1996; Starke, Gothert, & Kilbinger, 1989), glial modulation (Hansson & Ronnback, 1994), and volume transmission (Fuxe & Agnati, 1991; Ridet, Rajaofetra, Teilhac, Geffard, & Privat, 1993) are examples of such phenomena. The functions of such intrinsic modulations are more specific to the substrate under consideration (Katz & Frost, 1996). 2.1.1 Extrinsic Neuromodulation. In many models, the origin of the modulation is known but does not depend in general on the computation of the substrate being modulated. Rather, it depends on the parallel activity of functionally distinct systems, extrinsic to the substrate. Such is the case of most neuromodulatory centers releasing specific neuroactive substances that modify the cellular and synaptic properties of their targets. Most of the actions of dopamine (Cooper, 1991) and norepinephrine (van Dongen, 1981) enter in this category. Here, we illustrate this point with a recent model of sequence learning in hippocampal region CA3 showing that computations may crucially depend on the extrinsic modulation by GABAergic and cholinergic inputs from the septum (Wallenstein & Hasselmo, 1997b). In this large multicompartmental model, CA3 interneurons receive external periodic (4–10 Hz) inhibitory GABAergic signals from the septum (itself not modeled), while pyramidal cell and interneuron excitability is increased by steadily lowering their leak potassium conductance, simulating the cholinergic influences of the septum. In this modulatory regime, interneurons spontaneously fire gamma (30–100 Hz) bursts of action potentials at the theta (4–10 Hz) frequency externally imposed by the septum (Wallenstein & Hasselmo, 1997a). This pattern of firing in turn entrains the pyramidal cell network at theta frequency, yielding an overall network behavior compatible with much in vivo and in vitro experimental data. The emerging theta-gamma pattern of interneuronal GABAergic activation results in a periodic activation of GABAB receptors on pyramidal cells: GABAB activation is greatest at the start of each theta cycle and decreases smoothly until the end of each cycle. Because GABAB receptors primarily control synaptic activation at intrinsic (CA3 recurrent collaterals) rather than extrinsic (sensory) pyramidal inputs, their net effect is to modify periodically the balance between internal and external information processing. Sensory inputs dominate at early phases of the theta cycle; intrinsic inputs dominate at later phases. This pattern of modulation is shown to be crucial to the computations of CA3 in that it allows for the development of place fields and for the learning and recall of sequence information modeled as a path learned by a rat running on a linear track. Without GABAB modulation, the network still functions, but it is qualitatively impaired, yielding place fields that do not develop and making significant errors during the recall of learned sequences.
Computational Models of Neuromodulation
775
In this example, extrinsic neuromodulation acts as a separate clocking device whose net effect is to improve the nature of information processing in the CA3 region of the hippocampus. Both the timing of the modulatory signal (theta frequency) and its pharmacological consequences (GABAB receptor activation) are important and generate testable predictions as to what might happen if either is modified. Other models have viewed extrinsic modulation as a signal influencing synaptic mechanisms. Such is the case of the reward signal entering the weight modification rule, between the ventral tegmental area (VTA) and cortex (Montague, Dayan, & Sejnowski, 1996), discussed in the following sections, or the direct change of synaptic efficacy triggered by an external center (Linster & Gervais, 1996; Linster & Masson, 1996; Raymond, Baxter, Buonomano, & Byrne, 1992). 2.1.2 Intrinsic Neuromodulation. In some interesting instances, it is not possible to isolate the neuromodulatory phenomenon from the system it modulates. In such cases, neuromodulation is intrinsic to the network whose computation is under study. Unfortunately, to our knowledge, there are no direct modeling studies of such phenomena. Some experimental evidence for intrinsic neuromodulation is reviewed elsewhere (Katz & Frost 1996). We briefly mention two examples. In the stomatogastric ganglion (STG) of the lobster, an afferent axon (SNAX1) has been characterized as both a participant in the rhythmicity of the gastric mill network and as a conveyor of modulatory information (Nusbaum, Weimann, Golowasch, & Marder, 1992). SNAX1 receives (inhibitory) synaptic inputs from the STG and is capable of initiating action potentials (intrinsically, within the STG and not near the cell body, a few centimeters away), which generate excitatory postsynaptic potentials (EPSPs) on the STG elements, therefore participating in the generation of the rhythm. However, because SNAX1 is also electrically coupled with key neurons of the central pattern generator, its level of depolarization (whether or not action potentials are present) modulates the activity of the network. Similarly, in the tritonia, dorsal swim interneuron DSI (a serotonergic central pattern generator (CPG) neuron), is known to enhance synaptic transmission presynaptically at synapses made by a key CPG neuron (Katz & Frost, 1995b; Katz & Frost, 1996; Katz, Getting, & Frost, 1994). DSI elicits both a fast, neurotransmitter-like EPSP, and a slow neuromodulatory-like EPSP (Katz & Frost, 1995a), both pharmacologically separable. DSI therefore modulates the oscillatory pattern it is contributing to. It is, of course, possible to envision dual extrinsic and intrinsic neuromodulations, whereby the former would express state or stimulus dependency and the latter would be activity dependent. In the computational framework of modeling studies, extrinsic neuromodulations can be easily implemented by choosing appropriate sets of parameters (tuning), whereas intrinsic neuromodulations require that the neuromodulatory mechanisms be regulated by the computations under consideration.
776
Jean-Marc Fellous and Christiane Linster
2.2 Regulation and Tuning. Choosing a computational framework to study neuromodulation inherently places it within a larger continuum. 2.2.1 Regulation. At one extreme, when neuromodulation is tightly coupled with neural computations, it becomes regulatory, an integral part of the computations. Such is the case of second messenger systems described in a Markovian kinetics formalism (Destexhe, Mainen, & Sejnowski, 1994b) or of activity-dependent regulation of maximal conductances (LeMasson, Marder, & Abbott, 1993), which we briefly discuss next. Using a single-compartment model of the lateral pyloric neuron of the stomastogastric ganglion of the crab (Buchholtz, Golowasch, Epstein, & Marder, 1992), LeMasson et al. (1993) elegantly illustrate how neurons can maintain a given firing behavior in the face of perturbations such as changes in extracellular K+ concentrations or sudden shifts in certain membrane current maximal conductances. This is achieved by making the intrinsic properties of the neuron (maximal conductances) dependent on the intracellular calcium concentration, and hence indirectly on previous activity. This feedback regulation ensures that conductances are stable and that the firing pattern of the cell (silent, bursting, or tonically firing) is preserved. The authors propose that this regulation, because it happens on a relatively slow timescale, could correspond physiologically to calcium regulation of channel synthesis, insertion, or degradation. Interestingly, in this particular model, the same mechanism that regulates the firing pattern in the face of certain perturbations may also change it in the face of other perturbations, such as external patterns of stimulation, therefore increasing the complexity of the input-output relationship of the cell. 2.2.2 Tuning. At the other extreme, when neuromodulation is entirely decoupled from the network under study, its actual implementation becomes a matter of parameter tuning. Such is the case of the choice of particular parameter sets that yield different bursting modes in invertebrate pattern generators (Epstein & Marder, 1990) or different cell frequency adaptation characteristics in piriform cortex (Barkai & Hasselmo, 1994), as we discuss next. In slices, piriform cortex pyramidal cells can generally be classified into strongly adapting or weakly adapting cells, depending on their response to long constant depolarizing current pulses (Barkai & Hasselmo, 1994). This difference in firing frequency adaptation may influence the computations at hand. Carbachol, a muscarinic receptor agonist, has been found to decrease the spike frequency adaptation of pyramidal cells and, in effect, switches strongly adapting cells into weakly adapting ones. On the basis of the experimental finding that carbachol primarily modulates two membrane potassium currents, IK (AHP) and IK (M) (Madison, Lancaster, & Nicoll, 1987), Barkai and Hasselmo used a compartmental model and found that different values of the maximal conductances of these two cur-
Computational Models of Neuromodulation
777
rents, as well as different values for membrane resistance, would reproduce the range of spiking adaptation behaviors identified experimentally. These authors therefore found two distinct parameter tunings that characterize strongly and weakly adapting cells, and they used the weakly adapting tuning to model cells undergoing cholinergic modulation. The computational role of cholinergic neuromodulation is then illustrated in the context of associative memory (Barkai, Bergman, Horwitz, & Hasselmo, 1994) and will be discussed in more details in a later section. Regardless of whether neuromodulation is extrinsic or intrinsic, it is possible, if not desirable, to build a model that implements the effects of neuromodulation using external parameter tuning first. As more data about the neuromodulatory processes are made available, the model can be modified to include mechanisms that could trigger the parameter changes in an internal manner (e.g., as a function of network activity). Such is the case, for example, of the regulation of acetylcholine modulation by overall network activity (Hasselmo & Schnell, 1994) rather than by parameter set switching (Barkai et al., 1994). 2.3 Time Course of Neuromodulation. Depending on the function implemented, neural computation may follow different time courses, from a few milliseconds to several minutes or hours. For the neuromodulation of such circuits to be relevant to the computation, it must be adapted to its timing. 2.3.1 Fast Computations, Slow Modulation. Most computational studies that feature neuromodulation implement it as a slow and diffuse process, tonically changing some aspects of membrane or synaptic properties. Such is the case for acetylcholine and its steady depolarizing effects on pyramidal cells and interneurons in CA3 (Wallenstein & Hasselmo, 1997a) or for its slow activity-dependent modulation of CA1 between learning and recall modes (Hasselmo & Schnell, 1994). We discuss these models further in a later section. 2.3.2 Slow Computations, Fast Modulation. Some types of slow computations may be influenced by fast modulations. In a series of experiments in monkeys, Schultz, Apicella, and Ljungberg (1993) showed that during learning, VTA dopaminergic neurons (A9–A10) exhibit transient increases in activity lasting less than 150 ms in response to behaviorally relevant signals, such as reward delivery, or conditioned stimulus presentation. Monkeys were tested in several behavioral paradigms. Here, we focus on the delayed response task in which monkeys are taught to memorize one of two spatially distant targets for a variable (2.5–3.5 sec) amount of time and to respond by arm movement at the onset of a trigger signal. Correct responses are rewarded. During learning, VTA activity increases transiently after the presentation of the target signal and shortly after the delivery of the reward, regardless
778
Jean-Marc Fellous and Christiane Linster
of the delay introduced. This firing behavior is in marked contrast to some of their neural targets, such as the striatum or prefrontal cortex, whose neural activity may be tonically increased during the whole delay period (Goldman-Rakic, Lidow, Smiley, & Williams, 1992). Moreover, the increase of VTA activity after reward appears to be present only during learning, and not once the animal has acquired the task. These data suggest that the transient actions of DA after reward delivery may be specifically involved in learning. The precise duration of the postsynaptic effects of the release of dopamine in the prefrontal cortex during such a task is not known, but it might be as short as 100–200 ms (Jay, Glowinski, & Thierry, 1995). Insofar as one considers that the performance of the delayed response task is a slow process (lasting up to 4 sec), a 150-ms phasic involvement of the dopaminergic system appears as a fast modulatory process influencing a slow sequence of neural computations. In a model of VTA activity, Montague et al. (1996) propose a way in which DA neuron may transiently affect learning. Their model suggests that DA signals δ(t) carry a composite information about external reward r(t), and internal fluctuations between present V(t) and immediately past V(t − 1) sensory cortical signals. This DA signal is used to modulate the rate of change of the synaptic weights, which link cortical signals x(t) to dopaminergic neurons. Mathematically, this modulation is expressed as a transient change of learning rate, which tends to reduce the amount of excitation forwarded to the dopaminergic neurons, as learning develops, compatible with experimental data (Schultz et al., 1993), and following the general idea of temporal difference learning (Sutton & Barto, 1990): 1wτi = ηxi δ(t) if t = τ. 0 if not. with δ(t) = r(t) + V(t) − V(t − 1) and X Vi (t). Vi (t) = xi wτi (t), V(t) = i
Interestingly, this model chooses to label weights explicitly with space (i, origin of cortical activity) and time (τ , relative to the start of each trial). In this paradigm, each weight codes for the occurrence of a particular cortical signal, at a particular time within the experiment. The spatial diffusion of the DA signal is expressed by the fact that the same δ(t) affects all synaptic weights equally (it is not indexed by i) and by the fact that it is built on the basis of the sum of all cortical inputs, rather than specialized cortical inputs only (V rather than Vi ). During the initial stages of learning, when weights are uniformly distributed, DA activity closely follows the temporal patterns of reward. During learning, if the time of reward is fixed (such as in the instructed spatial task; Schultz et al., 1993), the weights that code for the particular time (τr ) of the reward will be strengthened, so that δ(τr ) eventually vanishes. If the
Computational Models of Neuromodulation
779
time of reward is variable, as in a delayed spatial task, the activity of DA neurons will become small, but nonzero, around the mean of the times when reward was delivered. After learning, in both cases, DA activity becomes particularly significant when the initial target sensory cue is presented. DA neurons therefore learn to respond to the target sensory cue predictive of the reward rather than to the reward itself. In this model, the timing characteristics of the modulation are crucial. Its short duration is directly related to the precision with which the prediction of reward is made. Phasic modulation is also important in other modeling studied involving GABAB receptors (Wallenstein & Hasselmo, 1997b) and in other experimental systems involving norepinephrine and locus coeruleus response to attentional signals and novelty (Aston-Jones, Rajkowski, Kubiak, & Alexinsky, 1994; Rajkowski, Kubiak, & Aston-Jones, 1994; Sara, Vankov, & Herve, 1994). 3 Computational Aspects of Neuromodulation 3.1 Modulation of Oscillation and Synchrony. Neural computation is dynamic and modular and requires that functionally distinct structures communicate in a coordinated fashion. Experimental and theoretical evidence suggests that the generation and synchronization of oscillatory activity may be used to this effect (Gray, 1994). Invertebrate studies have been crucial in furthering our understanding of how both intrinsic membrane properties and synaptic interactions may contribute to the creation and modulation of rhythmic firing (Calabrese & De Schutter, 1992; HarrisWarrick & Marder, 1991). Vertebrate studies of the cortex have built on these results and have proposed ways in which oscillations may synchronize across functionally distinct structures (Gray, 1994). In this context, the neuromodulation of the generation and synchronization of oscillations is bound to play an important computational role. 3.1.1 Central Pattern Generators: Creating and Modulating Rhythmicity. A long tradition of experimental work in invertebrates has led to a detailed knowledge of the effects of various substances on the behavior of individual neurons and small networks of neurons (see for reviews, Calabrese & De Schutter, 1992; Marder, 1996; Marder & Selverston, 1992). Most of these effects can be modeled by changes in the maximal conductance of one or more membrane currents. In these systems, attention is given to neurons whose putative function is to provide, through their rhythmic firing, timing signals necessary for one or several rhythmic motor behaviors (Pearson, 1993), such as chewing in the crustaceans or hormone release during egglaying behavior in Aplysia. These cells are often referred to as conditional bursters because of their ability to fire rhythmically, either intrinsically or under the influence of a small network of connected cells. Two examples can be found in the pyloric network of the crustacean stomatogastric ganglion and in the Aplysia bursting neuron R15.
780
Jean-Marc Fellous and Christiane Linster
In the STG of the lobster, various modulatory substances such as dopamine, pilocarpine, serotonin, or proctolin can elicit rhythmic burst firing. The mechanisms involved, even though they result in similar bursting behaviors, are by no means simple and depend on the particular substance applied. For example, tetrodoxin (TTX) may block the effects of serotonin and octopamine but have no effects on the bursting evoked by dopamine and pilocarpine. One possibility is that each of these neuromodulatory effects is mediated by a particular change in the mix of membrane conductances of the cells (Harris-Warrick & Flamm, 1987), which may be studied theoretically using the Hodgkin-Huxley (HH) formalism (Hodgkin & Huxley, 1952; Rinzel & Lee, 1987). Epstein and Marder (1990) provide a model for the conditional bursting of the anterior burster (AB) neuron of the lobster STG and investigate the effects of the change of a selected set of maximal conductance on the oscillatory properties of the model. They are able to show that two different mixes of fast sodium, leakage, and voltage-dependent calcium maximal conductances were able to model the bursting behaviors of the AB neuron under various neuromodulatory conditions and show why TTX has a different effect on two of these oscillatory modes. Kepler, Marder, and Abbott (1990) showed that, in addition to being intrinsically modulated, the frequency of the modeled AB cell might also depend on the state of follower neurons, provided that both neurons are coupled via gap junctions. Unfortunately, the effects of isolated membrane conductances are often not accessible experimentally. To study the putative effect of pharmacological agents (expressed as continuous maximal conductance changes) on the oscillatory properties of this cell, researchers may then use different modeling techniques, such as exhaustive parameter searches (Bhalla & Bower, 1993; Foster, Ungar, & Schwaber, 1993) or dynamical systems theory (Guckenheimer, Gueron, & Harris-Warrick, 1993; Guckenheimer, Harris-Warrick, Peck, & Willms, 1997). Further experimental and theoretical studies focused on other STG neurons (Golowasch, Buchholtz, Epstein, & Marder, 1992). These models essentially consider neuromodulation to be extrinsic to the oscillatory circuit, and implement it using parameter tuning. Interestingly, further work has attempted to show how maximal conductances may also be changed by intrinsic phenomena. LeMasson et al. (1993), for example, show how intracellular calcium concentrations can be used to implement the activity-dependent modulation of certain maximal conductances (Turrigiano, Abbott, & Marder, 1994). Their model shows that depending on the nature of the perturbations imposed onto the cells, this modulation can be regulatory (maintaining the behavior of the cells when extracellular [K+ ] is modified) or truly modulatory, by enriching the behavioral repertoire of the cell in response to external patterns of stimulations. Most modeling studies of the extrinsic effects of neuromodulatory substances have addressed the problem at the level of maximal conductances by tuning them to different values. Very few have actually modeled the
Computational Models of Neuromodulation
781
explicit effect of these substances on the conductances (Brezina et al., 1996; Butera, Clark, Canavier, Baxter, & Byrne, 1995). A different line of research in Aplysia, however, has achieved this. Burster neuron R15 in Aplysia has been studied in much detail, and its electrophysiological and biochemical properties have been investigated intensively (reviewed in Adams & Benson, 1985; Lechner, Baxter, Clark, & Byrne, 1996). Numerous mathematical models have been developed to explain the cellular basis of single-cell oscillatory activity and bursting. Some models went further and studied the extrinsic modulation of oscillatory dynamics by substances such as DA and 5-HT (Bertram, 1993, 1994; Butera et al., 1995), while others focused on the role of intrinsic modulation by calcium-dependent processes in conditioning (Gingrich & Byrne, 1987; Raymond et al., 1992). A first series of studies used a simplified HH framework to model the effect of 5-HT as modifications in the conductance of a subthreshold K+ current (Bertram, 1993, 1994). As for the models of the AB neuron, these models show that changes in maximal conductance can modify the firing properties of R15 from silent to bursting and beating and that the sensitivity of the cell to synaptic inputs is increased. In a separate study Butera et al. (1995) show that even though the apparent effects of DA and 5-HT on the firing properties of R15 are similar, its subsequent responses to depolarizing inputs differ. Effects of 5-HT and DA were implemented as a change in the conductance of two opposing currents: an anomalous delayed rectifier current and a slow inward Ca2+ current. Unlike the models mentioned above, this change is directly related to the concentration of extrinsic neuromodulators (see the appendix). Their dynamics are such that both 5-HT and DA can hyperpolarize the cell into silence. However, the subsequent response to a brief depolarizing current pulse elicits a burst of spikes if the cell was silenced with 5-HT and occasional single spikes if it was inhibited by DA, as observed experimentally. Because they make the concentration of these neuromodulatory substances explicit, the authors are able to show that although the effects of 5-HT and DA can be modeled as changes in maximal conductances, they cannot be understood without taking into account the indirect effects of other currents and second messenger systems (such as Ca2+ or cAMP). In turn, these indirect effects lead to further modeling that shows their functional importance. Indeed, the roles of intracellular cAMP and Ca2+ are known to be important in activity-dependent neuromodulation in the context of associative classical conditioning in aplysia (reviewed in Abrams & Kandel, 1988; Byrne, 1987). In a study using detailed representations of membrane parameters, Gingrich and Byrne have shown that intrinsic regulation of cAMP by Ca2+ in an aplysia single sensory neuron can simulate the neural analogues of nonassociative learning and classical conditioning (Gingrich & Byrne, 1987). A subsequent study showed that a circuit of six neuron-like elements (including central pattern generators), some of which have synapses mod-
782
Jean-Marc Fellous and Christiane Linster
ifiable according to an activity-dependent neuromodulation learning rule, can account for simple features of operant conditioning as well (Raymond et al., 1992). 3.1.2 Modulation of Rhythmicity in the Cortex: Toward Information Processing. Current research in the vertebrate cortex has indicated the functional importance of oscillation and synchronization (Gray, 1994; Singer, 1993). Experimental and theoretical evidence suggest their role in odor coding in the olfactory bulb, in feature integration in the visual cortex, in synaptic plasticity in hippocampus, in attentive behaviors in somatomotor cortex (Gray, 1994), and in the gating of sensory information during awake and sleep states in the thalamocortical circuit (McCormick, 1992). Unfortunately, the computational role of neuromodulation in the generation and synchronization of these rhythms has rarely been studied from a modeling point of view. However, an interesting line of research in the thalamocortical loop is setting the stage for modeling work in other systems. In the past decade, tremendous breakthroughs have been achieved in the understanding of synchronized oscillations in the thalamocortical circuit (see de Carvalho, 1994; McCormick, 1992, for reviews). Their neuromodulation has been studied in vitro and in vivo, and their cellular mechanisms explored both experimentally and theoretically through computer simulations. The functional significance of the neuromodulation of this system is summarized next. In slow-wave sleep, with low cholinergic, serotonergic, noradrenergic, and histaminergic modulation, the thalamocortical system presents slow, spontaneous basal intrinsic and circuit oscillations (delta waves and spindle waves). During this state, cholinergic inhibition of thalamic interneurons is absent, resulting in massive inhibition of incoming sensory information, which is consequently only poorly transmitted to the cortex. The increase of cholinergic activation (but decrease of noradrenergic, serotonergic, and histaminergic activation) characteristic of rapid eye movement (REM) sleep results in an abolition of oscillatory activity and an increase of endogenous (without sensory inputs) phasic activity (pontogeniculate-occipital [PGO] waves), thought to be at the origin of the pseudosensorial perceptions experienced during dream states. Finally, the tonic activation of all neuromodulatory systems (including cholinergic, noradrenergic, serotonergic, and histaminergic) results in complex patterns of activity and sets the stage for awake attentive cognitive processing. The precise nature of the sensory processing in the thalamus and its modulation by neuromodulatory centers are limited by the lack of understanding of the nature of the sensory codes themselves. However, understanding how oscillations are generated and how they propagate in a synchronized manner across the thalamic networks might help shed some light on the computations achieved by this structure. A line of experimental and theoretical work shows that the behavioraldependent rhythmic firing patterns of thalamocortical (TC) relay cells de-
Computational Models of Neuromodulation
783
pend on only a small number of membrane currents (McCormick & Huguenard, 1992) and a functionally intact group of inhibitory thalamic reticular (RE) cells. RE cells are capable of oscillating on their own in vivo, and a crucial role for their neuromodulations by NE or 5-HT has been proposed on experimental (McCormick & Wang, 1991) and theoretical (Destexhe, Contreras, Sejnowski, & Steriade, 1994a) grounds. By deactivating a potassium leak current, this extrinsic neuromodulation is able to depolarize RE cells so that GABAergic inhibitory postsynaptic potentials (IPSPs) received from other RE cells deinactivate the low-threshold Ca membrane current IT . This current triggers a rebound burst at the single-cell level, which generates network oscillations in the frequency range of spindle waves (Destexhe et al., 1994a). Through their influence on intracellular levels of G-protein (a second messenger), NE or 5-HT has the potential of switching a silent network of RE cells between quiescent and oscillatory states. Interestingly, the inclusion of TC cells in this network has prompted the study of a form of intrinsic activity-dependent neuromodulation (Destexhe, Bal, McCormick, & Sejnowski, 1996). In a model of synchronized oscillations and propagating waves in thalamic slices Destexhe et al. (1996) show how the activity-dependent modulation (which they term upregulation) of a mixed cationic current Ih in TC cells contributes to the waning phase of the characteristic waning and waxing pattern of spindle oscillations. Whereas neuromodulation is often expressed as a change in maximal conductances, previous work on the STG has indicated how serotonin-mediated shifts in the voltage dependence of the activation curve of Ih could also contribute to the pattern of oscillations of an intrinsically oscillating cell (Golowasch et al., 1992; Harris-Warrick, Coniglio, Levini, Gueron, & Guckenheimer, 1995). In the STG model, shifts were artificially introduced and their effects studied. In this model, however, a different formalism is proposed and introduces an activity-dependent shift of the activation of the Ih current: ¯ + K[OL ])(V − Erev ) with Ih = G([O] α(V)/β(V)
C ←−−−−−−→ O Pu + 2Ca2+ ←→ Pb O + Pb ←→ OL , where C, O, and OL are closed and opened forms of the h channel and Pu and Pb are unbound and bound forms of a slow intracellular regulating factor, which could be cAMP. The kinetics are such that the transition from OL to a closed state is very improbable, leading effectively to a locking of the OL fraction of the channels into the open state. This effect is responsible for a bounded shift of the activation curve of Ih toward depolarized values, as the intracellular calcium concentration is increased during bursting activity. Moreover, because K is chosen greater than 1, the binding of calcium
784
Jean-Marc Fellous and Christiane Linster
also triggers an increase in conductance. Both effects have been observed experimentally (Hagiwara & Irisawa, 1989). Because of the dependence of Ih kinetics on Ca, deactivation of Ih occurs only during low-frequency firing when Ca does not accumulate. During bursts, the accumulation of calcium shifts the activation curve of Ih toward more depolarized states and keeps Ih active. During a burst, therefore, OL (and consequently Ih ) increases, leading to a progressive afterdepolarization (ADP). The ADP eventually counteracts the IT -mediated rebound bursts, and the spindle oscillatory episodes are terminated. The subsequent slow return of Ih to its basal value results in an 8–10 sec refractory period during which further oscillations cannot be initiated. Evoked or spontaneous activity may ultimately restart the spindle episode, after a total waning phase of 15–25 sec, including the refractory period. In addition to contributing to the waning phase of spindle oscillation, the modulation of Ih also enables the synchronization of several independent colliding spindle waves into a single propagating wave (but see Contreras, Destexhe, Sejnowski, & Steriade, 1997, for in vivo data). Other forms of Ih modulations have been proposed elsewhere in the thalamus (Wallenstein, 1996) and in the STG of the lobster (Golowasch et al., 1992; Harris-Warrick et al., 1995). In the piriform cortex and olfactory bulb, oscillatory dynamics are modulated by noradrenergic and cholinergic afferents (Biedenbach, 1966; Bressler & Freeman, 1980). Liljenstroem and Hasselmo (1995) investigate the effects of cholinergic modulation on piriform cortex oscillatory dynamics. These include cholinergic suppression of neuronal adaptation, cholinergic suppression of intrinsic fiber synaptic transmission, and cholinergic enhancement of interneuron activity. Their model provides a basis for understanding the involvement of acetylcholine modulation in cortical EEG oscillations (Wilson & Bower, 1992). They demonstrate that the suppression of neuronal adaptation could explain the appearance of evoked gamma oscillations after potentials. They also find that such suppression of adaptation, when coupled with the other cholinergic effects mentioned above, was particularly effective in switching the network into spontaneous theta oscillations. These results are related to others in the hippocampus (Traub, Miles, & Buzsaki, 1992; Traub, Whittington, Colling, Buzsaki, & Jefferys, 1996; Wang & Buzsaki, 1996) that do not involve neuromodulation explicitly. The putative functional significance of neuronal adaptation, and its consequence on rhythmicity, is made apparent in later studies on learning and memory in the hippocampus and will be discussed separately.
3.2 Modulation of the Processing of Sensory Signals: Filtering and Signal-to-Noise Ratio. Processing of sensory information often relies on preprocessing functions like filtering, contrast enhancement, and noise reduction. Many of these functions can be modulated, enabling the sensory
Computational Models of Neuromodulation
785
system to respond differently to various components of complex incoming sensory streams. In the visual domain, one example of such a function is the temporal transformation that some lateral geniculate nucleus (LGN) cells perform on their retinal input (Mukherjee & Kaplan, 1995). The experimental data show that the temporal response of these cells is variable and is related to their ability to burst. Such cells can behave as either relays, responding at the same frequency as their retinal inputs by firing tonically (in alert/awake state), or as bandpass filters, responding optimally at frequencies of 2–8 Hz by firing in a bursting mode (in sleep states), presumably failing to transmit sensory information accurately. In a biophysical model, Mukherjee and Kaplan (1995) show that LGN cell responses can vary from low-pass, with no apparent bursting properties, to bandpass, with frequent burst discharges, depending on the value of their resting membrane potential, and provided that the low-threshold calcium T current is kept active. The authors propose that the LGN acts as a temporal filter, which can be dynamically tuned by attentional signals from the brainstem and the visual cortex, through their modulatory effects on LGN cells’ resting membrane potential. In a separate connectionist model, Jackson, Marrocco, and Posner (1994) model such modulatory signals by the putative effects of NE release, expressed as a combination of self-feedback excitation and lateral inhibition. The computational role of such modulation is to achieve contrast enhancement, such that small initial differences in the incoming signal are amplified, and consequently direct attention. In their model of the olfactory bulb, Linster and Gervais (1996) showed that the modulation of two families of interneurons might sensitively improve odorant signal detection. On the one hand, the modulation (increase) of lateral inhibition mediated by the periglomerular interneurons may result in the sparsification of the mitral activation patterns of complex odors, which otherwise would involve a large, undifferentiated population of mitral cells. On the other hand, under conditions when mitral cell responses are close to noise levels, a global modulation (decrease) of the inhibition mediated by glomerular interneurons may result in an enhancement of their responses. In an extension of this model, Linster and Hasselmo (1997) show that such modulation of inhibition could depend on the global activity of the mitral cells. They introduce a modulator neuron (a putative NE or ACh cell) that receives inputs from all mitral cells and that feeds back on periglomerular cells while simultaneously modulating the connection strength between granule cells and mitral cells. The modulation of inhibition in the glomerular layer ensures a constant average number of active mitral cells, irrespective of the complexity of the input patterns, while modulation of granule cells inhibition ensures a constant average mitral cells spiking probability. Together, these modulations decrease the overlap between pairs of output patterns, making discrimination between overlapping input patterns easier and more reliable.
786
Jean-Marc Fellous and Christiane Linster
Addressing similar questions in a model of olfactory processing in the honeybee, Linster and Masson (1996) showed that modulation of inhibition in the antennal lobe may serve for feature extraction of complex and fluctuating chemical signals. This modulation is expressed through the synaptic strength of inhibitory synapses, the biological basis of which has yet to be investigated experimentally. Changes of the balance between excitation and inhibition during the presentation of a stimulus allow the network to act as a short-term memory, displaying the neural activity patterns elicited by the stimulus even after its offset, compatible with experimental data (Sun, Fonta, & Masson, 1993). Expanding on this idea, Linster and Smith (1997) constructed a model of reinforcement learning in the honeybee olfactory system. In this model, modulation of lateral inhibition is introduced via an external modulatory neuron that receives reinforcement signals. This neuron makes plastic synapses onto the circuit under consideration. The authors show that such extrinsic modulation accounts for various behavioral phenomena, such as blocking, unblocking, and overshadowing. Sensory processing may also involve computations aimed at separating a sensory signal from the background noise. When seen at a system level, the modulation of the signal-to-noise ratio appears as a powerful computational tool by selectively enhancing a signal in a specific pathway, while leaving it undifferentiated with noise in others. A line of modeling work has shown that the cellular mechanisms involved in the known effects of catecholamines on signal detection performance (Clark, Geffen, & Geffen, 1987a, 1987b) may be modeled by a modulation of the slope (gain) of the sigmoid function of a network of leaky-integrator neurons (Servan-Schreiber, Printz, & Cohen, 1990). Changes of this gain at the level of an individual neuron do not affect its signal-detecting capabilities, while increases of this gain in a feedforward chain of neurons augment the signal-to-noise ratio of the whole chain. The model accounts for experimental observations pertaining to the cellular effects of norepinephrine, which show that NEmediated blockade of Iahp may result in the selective diminution of weak EPSPs and the increase of the depolarization associated with trains of EPSPs, thereby increasing signal- to-noise ratio (Madison & Nicoll, 1986). The model is then used in a backpropagation network to model the improvement in signal detection measured experimentally in human subjects performing a continuous performance task. In this task, subjects are submitted to pharmacological challenges that release catecholamines from synaptic terminals or prevent their reuptake. In an extension of this model, Cohen and Servan-Schreiber (1992) simulate several schizophrenic deficits in selective attention and language processing assessed by tasks such as the Stroop task, the continuous performance test, and a lexical disambiguation task. They successfully show that even though these tasks are seemingly different, the deficits exhibited by schizophrenics can be understood as a general disturbance of the internal representation of contextual information. Such disturbances are implemented as a decrease in the gain of the sigmoid func-
Computational Models of Neuromodulation
787
tion of modeled prefrontal cortex units, simulating the possible functional effects of the loss of dopaminergic modulation observed in schizophrenic patients. This theoretical work has been followed by experimental work that confirmed and refined the hypothesis advanced (Cohen, Braver, & O’Reilly, 1996). In a separate experimental and theoretical study in piriform cortex, Hasselmo and coworkers show that noradrenergic enhancement of the signalto-noise ratio may also be due to a modulation of synaptic transmission rather than a modulation of input-output function, as was first proposed by Servan-Schreiber et al. (1990). They found that NE, like ACh, may suppress excitatory neurotransmission at intrinsic (collateral) fibers and may also depress feedback inhibition. In a model of piriform cortex, they show that these two effects can act synergistically to increase signal-to-noise ratio (Hasselmo, Linster, Patil, Ma, & Cekic, 1997). Finally, another interesting body of research has pointed to the role of noise itself as a means of modifying the signal-to-noise ratio (Bulsara, Jacobs, Zhou, Moss, & Kiss, 1991; Levin & Miller, 1996; Longtin, 1993; Longtin, Bulsara, Pierson, & Moss, 1994; McNamara & Wiesenfeld, 1989). To our knowledge, no explicit links to neuromodulation have yet been made. 3.3 Modulation of Memory Function. A large class of memory models is based on the assumption that memories are stored as patterns of synaptic strengths mediating the spread of activation in a network. Learning is achieved according to a synaptic modification rule (or equation) that relates synaptic strength and presynaptic and postsynaptic activities (Brown, Kairiss, & Keenan, 1990; Hasselmo, 1995; Zador, Koch, & Brown, 1990). In this framework, memory function is defined by the synaptic learning rule and the dynamics of individual neurons. 3.3.1 Modulation of the Synaptic Learning Rule. In their model of the response of dopamine neurons to reward and conditioned stimuli (Schultz et al., 1993), Montague, Dayan, and Sejnowski (1996) propose a learning rule in which the postsynaptic activity is augmented by an external reward signal of neuromodulatory origin. In addition, plasticity is made sensitive to temporal differences (Sutton & Barto, 1990) in the postsynaptic activity, rather than to the postsynaptic activity itself. This formulation of the Hebbian learning rule makes time explicit in that some synapses represent early events and others represent later ones. The authors show that after learning a delayed matching-to-sample task, dopaminergic neurons act as a temporal predictor of reward, compatible with experimental data. In this context, dopamine centers have the role of computing and sending diffuse modulatory error signals to the cortex, and hence influence its computation of action in the time domain. The same approach has been used elsewhere (Montague, Dayan, Person, & Sejnowski, 1995) to show how an identified interneuron in the honeybee brain, VUMmx1, could predict reward values
788
Jean-Marc Fellous and Christiane Linster
of spatial location during foraging. In this model, VUMmx1 cells influence flight in a manner that accounts for the previous learning of the landscape and its rewarding regions. A similar implementation of VUMmx1 modulation can be found elsewhere (Linster & Smith, 1997). One of the problems with most learning-rule-based neural models of memory function is the fact that learning and recall may interfere in undesirable ways. Unless care is taken to prevent this, the presentation of a new pattern during learning may elicit an erroneous response from the network. This spurious activity perturbs (if not prevents) learning. In a series of experimental and theoretical studies, Hasselmo and coworkers have shown how selective cholinergic modulation of some synapses, but not others, might provide an elegant solution to this problem. Experimental data from field recordings in the piriform cortex suggest that cholinergic, noradrenergic, and GABAergic modulation might selectively suppress intrinsic but not afferent excitatory synaptic transmission in the piriform cortex (Hasselmo & Bower, 1992; Hasselmo et al., 1997; Tang & Hasselmo, 1994). In a mathematical model of associative memory, Hasselmo (1993) shows that this selective suppression may prevent previously learned patterns from interfering with the storage of new patterns, especially when previous and new patterns are coded by overlapping populations of neurons (Hasselmo, 1993). This modulation is expressed as a decrease in glutamate release in the activation rule, coupled with a rescaling of the learning rate in the learning rule. In further experimental and theoretical studies, Barkai and Hasselmo (1994) present a detailed biophysical model of a single pyramidal cell in piriform cortex. They show that in addition to its effects on synaptic transmission observed with field potentials, intracellular recordings show that cholinergic modulation of single cells also results in the suppression of neuronal adaptation and in marked depolarization from resting potential. Their single-cell model shows these effects as changes in the maximal conductance of two potassium currents. These results lead to a detailed model of autoassociative memory in the piriform cortex, including 240 pyramidal cells as well as feedforward and feedback interneurons (Barkai et al., 1994). Results from intracellular recordings (suppression of neuronal adaptation and depolarization) and field recordings (suppression of intrinsic synaptic transmission) are included in the model. During learning, the overall effects of cholinergic modulation are to enhance pyramidal cell activity, increasing learning performance. After learning, cholinergic modulation is suppressed and sets the stage for recall. ACh therefore ensures that learning and recall do not interfere and controls the computations of the network. 3.3.2 Modulation of Neural Dynamics. In a large associative network of Fitzhugh-Nagumo-like cells, Abbott (1990) shows that a simple modulation (of putative neuromodulatory origin) of the dynamics of the slow variable (see the appendix) may switch the network from implementing a nonselective short-term latching memory to behaving as a long-term associative
Computational Models of Neuromodulation
789
memory. This change in mode of operation of the network increases its computational capabilities without changing its learning rule or architecture. Repetitive firing can also be the result of intrinsic cellular properties such as cholinergically or serotonergically induced afterdepolarization. Models of associative memory based on this phenomenon have shown that repetitive firing can be temporally organized into nested theta and gamma oscillations in order to learn and maintain several memory items active in a short-term memory buffer (Jensen, Idiart, & Lisman, 1996; Lisman & Idiart, 1995). Building on their work in the piriform cortex, Hasselmo and Schnell (1994) show that the dynamics of learning and recall in the hippocampus can also be regulated by overall network activity. In their model of hippocampal layers CA1 and CA3, the total activity of CA1 pyramidal cells feeds back to the cholinergic system (presumably in the septum) and regulates cholinergic neuromodulation. This model involves a closed and autonomous system that has a clear function and in which neuromodulation is regulated by its target. The septum modulates the function of the hippocampus, which in return regulates the septum in a diffuse, activity-dependent manner. These ideas have been incorporated in a model of corticohippocampal classical eye-blink conditioning (Gluck & Myers, 1993) as a septally driven modification of the learning rate of the hippocampus autoassociative module (Myers et al., 1996). In this model, septal neuromodulation controls the relative amount of time spent by the hippocampus in learning new stimuli and the time necessary to transfer information to neocortical regions. Finally, in a model of hierarchical associative memory, Cartling (1996) shows that different levels of coupling between activity and excitability may change the dynamics of memory recall. In a Hopfield-like architecture, activity may be chaotic (memories fail to be retrieved), oscillatory (memories are retrieved cyclically, one after the other), or tonic (only one memory item is eventually retrieved) as the coupling is decreased. Neuromodulation is implemented as a change in the shape of the sigmoid transfer function linking membrane potential to firing rate. This change is regulated by overall network activity and depends on intracellular calcium concentrations. However, while some experimental and theoretical work shows that a decrease of cholinergic modulation is associated with stable network dynamics (Hasselmo & Schnell, 1994), this model assumes that an increase of cholinergic modulation yields stable states. 3.4 Neuromodulation for Input Selection and Information Integration. In complex neural networks, information flows along many divergent routes. Much experimental and theoretical work has assigned to neuromodulation the role of selecting the input to particular neural systems, thereby controlling the flow of information. Neuromodulation can act as a routing mechanism and control whether synaptic inputs will activate a particular circuit. The general flow of information between functionally distinct circuits is therefore determined by their modulatory state. Neuromodulation
790
Jean-Marc Fellous and Christiane Linster
can also act within a circuit to control what subsets of the available information will be processed. At the single-cell level, the combined actions of different neuromodulatory systems on cellular or synaptic mechanisms may determine whether the cell will be responsive to a given pattern of synaptic stimulation, therefore enabling or disabling processing. In Aplysia, for example, while both DA and 5-HT silence the bursting neuron R15, only the serotonergic modulation will allow brief depolarization to elicit a sustained bursting response. A modeling study of this system has proposed that the underlying mechanism is rooted in the modulation by DA and 5-HT of two distinct currents (Butera et al., 1995). The authors show that DA prevents input signals from eliciting R15 firing, while 5-HT enhances its response, effectively amplifying synaptic inputs. Together these two neuromodulatory systems control when input signals to R15 may be forwarded to later processing stages. Similarly, at the network level in the vertebrate, experimental and theoretical evidence suggest that ACh levels, together with other neuromodulatory systems, may control the flow of sensory information through the thalamus to the cortex (see sections 3.1.2 and 3.2). Modeling studies in piriform cortex and hippocampus show that neuromodulation within a circuit may control the nature of the information processed. In a series of experimental and modeling studies (see section 3.2), it was shown that selective cholinergic (Hasselmo & Bower, 1992), noradrenergic (Hasselmo et al., 1997), or GABAergic (Tang & Hasselmo, 1994) suppression of intrinsic (recurrent) but not extrinsic (sensory) inputs promotes learning, while the absence of such suppression allows for memory recall. In this system, the selection of the information that is processed therefore depends on a rich class of neuromodulatory conditions, itself related to the behavioral state of the animal. Finally, modulation of signal-to-noise ratio (see section 3.2) can also be considered as a form of input selection. By selectively enhancing certain neural inputs (the signal) and decreasing others (the noise), the system makes a de facto selection, which may change with neuromodulatory and behavioral conditions. This observation is at the basis of several models of selective attention involving the noradrenergic locus coeruleus (Aston-Jones et al., 1994; Rajkowski et al., 1994; Usher, Cohen, Servan-Schreiber, Rajkowski, & Aston-Jones, 1995) and of the DA-mediated control of cognitive processing in the prefrontal cortex and its relation to schizophrenia (Cohen et al., 1996). 4 Conclusion: Neuromodulation Increases and Controls Complexity Our review has shown that neuromodulation may play a significant computational role in a large spectrum of systems, from invertebrate central pattern generators to vertebrate cortical memory networks. In all cases, neuromodulation appears to be a powerful tool destined to increase and/or control the
Computational Models of Neuromodulation
791
computational complexity of a given network, without necessarily increasing the structural or dynamical complexity of the network itself. Spatially diffuse and slow neuromodulations of current conductances may trigger drastic changes of rhythmic patterns in central pattern generators, as well as in the thalamus, probably changing the nature of the downstream computations and increasing the complexity of the computations achieved by the whole circuit. Spatially selective and phasic neuromodulatory controls of specific neuronal input pathways help complex recurrent memory networks function properly. Our review has also revealed two major limitations to the study of neuromodulation. Overcoming them requires the design of new theoretical and experimental tools, which undoubtedly will be beneficial. The first stems from the observation that most modeling studies reviewed consider neuromodulation as an enhancing addition to a basic model. Often it is reduced to ad hoc parameter variations. We believe, however, that such an approach will no longer suffice as efforts are made to make computational models more biologically plausible in both their design and their function. Neuromodulation should be an integral part of the models. Only then will comprehensive theories of neuromodulation emerge and new neural computational principles may be discovered. Second, in actual biological systems, neuromodulation has multiple simultaneous or sequential (cascade) effects on neural information processing. However, their experimental study almost always consists of individual modulations, keeping others constant. Moreover, in most cases, neuromodulation is present or absent and is rarely studied as a continuous phenomenon. It is not generally known whether the effects of different kinds of neuromodulation are truly independent, and if not, how they interact, nor is it known whether various levels of a single neuromodulation may yield drastically different neural behaviors. If some models propose interesting ways in which various modulatory phenomena might coexist, most of the models reviewed here still assume that multiple neuromodulatory effects are independent. As first proposed elsewhere (Harris-Warrick & Marder, 1991; Marder, Hooper, & Eisen, 1987), it is likely that accounting for the simultaneous effects of several neuroactive substances on a single network may increase its computational complexity in relevant and interesting ways, giving further insight into its function in the larger context of behavior. Overall, computational and experimental models of neuromodulation appear to be powerful tools for the understanding of the computation of single cells as well as large neural networks. Appendix: The Mathematical Tools Uppercase letters are constants unless otherwise noted; lowercase letters are variables. The appendix is organized by levels of modeling, from more detailed to more abstract.
792
Jean-Marc Fellous and Christiane Linster
A.1 Markovian Chemical Kinetics Models. ¯ o , . . . , so , . . . , (v − E) i = Gs i 1 Rij
−−−−→ S Si ←−−−− j
with
Rji
X X dsi = Rji sj − si Rij . dt j j
R are rate constants, and s are concentrations (fraction of channels in state S). sok is an open state. At the most elementary level of modeling, neuronal processes can be described as chemical reactions, provided that their kinetics are quantitatively determined. In this framework, neuromodulatory phenomena are not distinguishable from others. Destexhe et al. (1994a) expresses intracellular phenomena, membrane mechanisms, synaptic transmission, and neuromodulation with a single set of kinetic equations. In the model proposed, the neuromodulation by second messenger G-protein gated K+ channel (GABAB , 5HT, M2 (ACh), α2 (NE), D2 (DA), histamine, opioid, and somatostatin receptors) is expressed by the −
v
appropriate formulation of rate constants of the type Rij = Rij (v) = Aij e Bij for voltage-dependent gating and Rij = [L]R¯ ij for ligand-activated gating. Using a simplified formulation of this model, Destexhe et al. (1994a) model the putative role of NE and 5-HT in modulating rhythmic activity in thalamic reticular cells. G-protein activation is taken as a consequence of both NE and 5-HT neuromodulation. It is implemented as a modulating factor to the activation dynamics of a leak potassium current according 0 to gKleak = G¯ Kleak · m, with dm dt = K[S]m − K (1 − m), [S] representing the concentration of second messenger present in the cell. A.2 Hodgkin-Huxley Models. For a multicompartment model (x indexes compartments): C
Eleak − v X vx − v dv X = i+ + + isyn + Iinject dt R Ra x
¯ A nB (v − E) i = Gm
with
Lm (v) − m dn Ln (v) − n dm = ∞ and = ∞ . dt τm (v) dt τn (v)
m and n are activation and inactivation variables, respectively. Eventual synaptic potentials are modeled by: t ´ τ1 ³ τt −e 1 − − e τ2 , isyn = gsyn (v − Esyn ) with gsyn = W G¯ syn τ2 − τ1 where W is the synaptic weight.
Computational Models of Neuromodulation
793
In the Hodgkin-Huxley formalism, neuromodulation is often expressed as a change in the maximal conductance of some particular membrane currents. At this level of modeling, it also may be implemented as a variation of the dynamics of some currents, variations in intracellular concentrations of some substances, and variation in synaptic transmission. When the actual pharmacology of the channels is known, it is possible to express the conductances as functions of other intracellular quantities, including concentrations of neuromodulatory substances. For example, Butera et al. (1995) propose a scheme of interaction between dopamine and serotonin that yields expressions for conductances of the type: ! ¶ µ µ ¶ Ã KDA K K0 ¯ ¯ × × 1+ G = G1 [cAMP]−K00 [Ca]i + K [DA] + KDA 1 + e− D00 Ã ! Ã ! 0 v − E K × 1+ G¯ = G¯ 1 ZF(v−E0 ) [cAMP]−K00 1 + e RT 1 + e− D0 µ ¶ [5HT] [cAMP] d[cAMP] 0 =K 1+K . +C dt [5HT] + K00 [cAMP] + K000 In some cases it is possible to obtain only an experimental curve quantitatively measuring the influence of a modulatory substance on given conductances. Bertram (1993) models two serotonin-sensitive conductances using a fit to their experimental values. The fit chosen takes the form: ¯ G¯ = G(s) = A+
B 1+
e−C(Ds−F)
with s ∈ [0, 1],
where s represents the concentration of serotonin applied. A similar formulation is used to describe the influence of two neuromodulatory substances (small cardioactive peptides, and myomodulins) on several currents in invertebrate neuromuscular circuits (Brezina et al., 1996). In other cases, maximal conductances can be made dynamically dependent on intracellular quantities such as calcium (LeMasson et al., 1993) with τ
Gmax dG¯ = f ([Ca]) − G¯ and f ([Ca]) = . [Ca]−CT dt 1 + e± 1
Barkai et al. (1994) and Barkai and Hasselmo (1994) have access only to a qualitative experimental description of the effects of two potassium conductances on the firing adaptation of cortical cells. They therefore model these effects by choosing two parameter sets that yield adapting or weakly adapting model cells: ¯ R) ∈ {(G¯ 1 , R1 ), (G¯ 2 , R2 )}. (G,
794
Jean-Marc Fellous and Christiane Linster
In a different system Epstein and Marder (1990) consider intermediate values, extrapolated linearly from the experimental ones following: ¯ G¯ = G(α) = α G¯ 1 + (1 − α)G¯ 2 , where α is a dimensionless parameter. When values for maximal conductances are not accessible experimentally, a theoretical search might be fruitful. In some cases, the set of conductances under neuromodulatory influence is known or hypothesized, and the dynamics of the network are under investigation. Dynamical systems theory (Guckenheimer et al., 1993) maps conductances values to possible network dynamics. The study of their stability leads them to experimental predictions about conductance values and their effects. In other cases, the dynamics of the network is known, but the set of conductances under neuromodulatory influences is unknown. Exhaustive search (Bhalla & Bower, 1993; Foster et al., 1993) allows for a systematic exploration of the parameter space constituted by all the maximal conductances hypothesized to be functional. Some regions of this space yield the dynamics under study. The location and shape of these regions predict what conductances are likely to be important (i.e., under neuromodulatory control) and what their possible values are. Neuromodulation can also be expressed as a change in the dynamics (rather than maximal conductance) of some particular membrane currents. Such is the case of a variation in an inactivation time constant (Mukherjee & Kaplan, 1995) such as, h i with T¯ n ∈ T¯ nmin , T¯ nmax , τn (v) = T¯ n τ¯n (v) or of a variation in the voltage dependence of the steady-state activation curve Lm ∞ (v), as for Ih (Destexhe et al., 1996; Golowasch et al., 1992). Neuromodulation can also be expressed through changes in the intracellular concentration of some substances such as cAMP (Raymond et al., 1992) rather than as changes in maximal conductance of some membrane current. Finally, neuromodulation can be expressed at the level of synaptic transmission. Such is the case for the presynaptic modulation of synaptic transmission by the activation of GABAB receptors (Wallenstein & Hasselmo, 1997b). In this model, the concentration of [GABA]o in the synaptic cleft is first calculated as a function of the number of local active inhibitory synapses (npre ) and a local diffusion term leading to: d[GABA]o = C.npre − D.[GABA]o , dt where C and D are constants. At any point in time, [GABA]o is then used to decrease synaptic currents, with isyn = isyn − A.[GABA]o where A is a constant.
Computational Models of Neuromodulation
795
Other models have viewed modulation as signal influencing synaptic mechanisms. Such is the case of the reward signal entering the weight modification rule, between VTA and cortex (Montague et al., 1996), or the direct change of synaptic efficacy triggered by an external center (Linster & Hasselmo, 1997; Linster & Smith, 1997; Raymond et al., 1992). A.3 Fitzhugh-Nagumo Models. C
dv − f (v) − w + Iinject dt
τ
dw = v − Dw dt
v is the fast (voltage-like, C small) variable; w is the slow (recovery-like) variable. In this simplified framework (as for BonHoeffer-van der Pol or MorrisLecar systems), individual concentrations and current conductance are not accessible, and fast Hodgkin-Huxley-type timescales are relaxed to pseudosteady-state values. Neuronal behavior is assessed macroscopically through overall activity. In a model of associative learning, Abbott (1990) proposes that neuromodulation may serve as a mechanism for initiating and terminating learning. Using the following formulation for the slow variable, dw = av − (1 − a)w dt he shows that depending on the value of a and the strength of the external inputs (Iinject ), single cells may settle in regions of hyperpolarization, depolarization, oscillations, or bistability. At the network level, for values of a yielding oscillation, the network behaves like an associative memory (phase-locked oscillations, patterned according to synaptic coupling). For values of a yielding bistability, a putative consequence of neuromodulation, the network behaves like a nonselective latching short-term memory, maintaining the activity elicited initially by the input pattern, and allowing Hebbian plasticity to take place. Interesting approaches to neuromodulation have also focused on the role of noise. Longtin (1993) uses stochastic resonance theory to show that the introduction of noise can have modulatory effects on the signal-to-noise ratio of a neuronal system, measured on the basis of the transfer of the oscillatory inputs to the output. The formulation used to illustrate this point introduces noise in v and a periodic forcing on w: τ
C
dv = v(v − A)(1 − v) − w + Iinject + ξ(t) dt
τ
dw = v − Dw − [B + R sin(ωt)], dt
796
Jean-Marc Fellous and Christiane Linster
where ξ(t) is a white noise (gaussian distributed) function, and B+R sin(ωt) is a subthreshold oscillatory forcing. Experimental evidence has recently been found in support of the role of noise in improving information processing (Levin & Miller, 1996). In this framework, neuromodulation can also be expressed as a change of electrical coupling between two cells, as in the STG (Kepler et al., 1990). It can be expressed as a shared current following: Iinject = W(v f − v) and f (v) = Gv − (G + α)V¯ dv f = −G f (v f − V¯ f ) + W(v − v f ). dt We also should mention attempts at modeling NE-mediated decrease in K+ current effects on the oscillatory behavior of a small thalamo-cortical model (Wallenstein, 1993). The effects were modeled by current injection in a Bonhoeffer–van der Pol modeling framework. Cf
A.4 Leaky Integrator Models. X dv = −v + Wi S(vi ) + Iinject τ dt i where S( ) is usually nonlinear (the sigmoid function), and v is the average membrane potential. This representation allows for qualitative descriptions of the overall effects of average pools of neurons on behavior. Neuromodulation can be expressed by a change in firing threshold or a significant modification of synaptic weights (Linster & Masson, 1996) with: v ≤ θmin ⇒ S(v) = 0
and
v ≥ θmax ⇒ S(v) = 1
θmin ≤ v ≤ θmax ⇒ S(v) = αv. Neuromodulation can also be expressed by introducing a multiplicative factor to the upper and lower bounds of the sigmoid function or by decreasing weights by a fraction (Liljenstrom & Hasselmo, 1995): (θmin , θmax ) ⇒ (βθmin , βθmax )
or
W⇒
W . γ
It can also be expressed as a dependence of the sigmoid function on other quantities such as the intracellular calcium concentration. Cartling (1996) models neuromodulation as a change in neuronal adaptability (coupling between activity and excitability). It is expressed as a multiplicative factor (a) to the intracellular Ca concentration (c) with activity-dependent secondorder dynamics: S(v) = S(v, c) = MAX(tanh(Av − ac − θ ), 0) K K00 − c dc = 0 S(v, c) + , with dt K +c T
Computational Models of Neuromodulation
797
where c is the intracellular calcium concentration and a is the adaptability. Neuromodulation is measured by a, which depends on the total activity of the network: a = Amax (1 − n)
with
X dn = C(1 − n) αi vi . dt i
αi is the size of the population having vi as state variable. A.5 Connectionist Models. X Wi S(oi ). o= i
In case of modifiable synapses: dw = ηx(t)y(t), dt where x(t) represents the presynaptic activity, and y(t) represents the postsynaptic activity. S( ) is analogous to the sigmoid function of leaky integrator models. Neuromodulation in connectionist models has been expressed in two general ways. The first expresses neuromodulation in the sigmoid function, the second in the dynamics of the synaptic weights. Neuromodulation can be implemented as a modification of the slope (gain) of the sigmoid function (Cohen & Servan-Schreiber, 1992; ServanSchreiber et al., 1990) in a small network (chain) of connectionist elements, following: S(o) =
1 with G ∈ [Gmin , Gmax ]. 1 + e−(Go+B)
Other modifications to the sigmoid function can be made to express other neuromodulatory properties, such as suppression of neuronal adaptation (Liljenstrom & Hasselmo, 1995). Using a nonmodulated sigmoid function, ³ ´ oi +1 S(oi ) = CQi 1 − e− Qi , activity-dependent neuromodulation is expressed as: ³ ´ oi +1 2 S(oi ) = CQi 1 − e− Qi × e−αhS(oi )it−T , where T is a fixed time window and t is time. A second modeling approach consists of introducing neuromodulatory effects to the learning rule. Montague et al. (1996) modeled the activity of
798
Jean-Marc Fellous and Christiane Linster
dopamine cells in the VTA by augmenting the postsynaptic activity with an external reward signal: S(o) = o 1w = ηx(t)(r(t) + y(t)), where r(t) is the external reward signal. Hasselmo (1993) selectively modifies the learning rate of certain synapses to include the effects of ACh with dw = η(1 − c)x(t)y(t) dt
with c ∈ [0, 1],
where c measures the amount of cholinergic suppression. Myers et al. (1996) adopt a similar approach in their model of cholinergic influence on corticohippocampal interaction during eye-blink conditioning. Both sigmoid and synaptic modulation can coexist and have been modeled by Hasselmo and Schnell (1994). The synaptic modulation is expressed using: Wi = (1 − cCWi )wi and
dwi = η(1 + cCη − Cη )x(t)y(t) dt
and replacing θ with (1 − cTmax )θ in the normal ramplike sigmoid function: o ≤ θ ⇒ S(o) = 0 o ≥ θ ⇒ S(o) = o − θ. Finally we should mention the modeling of morphological changes in Alzheimer”s disease (Horn, Ruppin, Usher, & Herrmann, 1993) expressing modulation of activity by random synaptic deletion, and appropriate compensation with: X Wi S(oi ) with |1| = (1 − d)N and S(o) = Step(o − T). o=c i∈1
c is the compensation factor, d is the deletion factor, and c = 1 + k ∈ [0, 1].
dk 1−d
with
Acknowledgments We thank Larry Abbott, Mike Hasselmo, John Lisman, Eve Marder, and Akaysha Tang for their critical reading of the manuscript for this article. References Abbott, L. F. (1990). Modulation of function and gated learning in a network memory. Proc Natl Acad Sci USA, 87(23), 9241–9245.
Computational Models of Neuromodulation
799
Abrams, T. W., & Kandel, E. R. (1988). Is contiguity detection in classical conditioning a system or a cellular property? Learning in Aplysia suggests a possible molecular site. Trends Neurosci, 11(4), 128–135. Adams, W. B., & Benson, J. A. (1985). The generation and modulation of endogenous rhythmicity in the Aplysia bursting pacemaker neurone R15. Prog Biophys Mol Biol, 46(1), 1–49. Aston-Jones, G., Rajkowski, J., Kubiak, P., & Alexinsky, T. (1994). Locus coeruleus neurons in monkey are selectively activated by attended cues in a vigilance task. J Neurosci, 14(7), 4467–4480. Barkai, E., Bergman, R. E., Horwitz, G., & Hasselmo, M. E. (1994). Modulation of associative memory function in a biophysical simulation of rat piriform cortex. J Neurophysiol, 72(2), 659–677. Barkai, E., & Hasselmo, M. E. (1994). Modulation of the input/output function of rat piriform cortex pyramidal cells. J Neurophysiol, 72(2), 644–658. Bertram, R. (1993). A computational study of the effects of serotonin on a molluscan burster neuron. Biological Cybernetics, 69, 257–267. Bertram, R. (1994). Reduced-system analysis of the effects of serotonin on a molluscan burster neuron. Biol Cybern, 70(4), 359–368. Bhalla, U. S., & Bower, J. M. (1993). Exploring parameter space in detailed single neuron models: Simulations of the mitral and granule cells of the olfactory bulb. J Neurophysiol, 69(6), 1948–1965. Biedenbach, M. A. (1966). Effects of anesthetics and cholinergic drugs on prepyriform electrical activity in cats. Exp Neurol, 16(4), 464–479. Bressler, S. L., & Freeman, W. J. (1980). Frequency analysis of olfactory system EEG in cat, rabbit, and rat. Electroencephalogr Clin Neurophysiol, 50(1–2), 19–24. Brezina, V., Orekhova, I. V., & Weiss, K. R. (1996). Functional uncoupling of linked neurotransmitter effects by combinatorial convergence. Science, 273(5276), 806–810. Brown, T. H., Kairiss, E. W., & Keenan, C. L. (1990). Hebbian synapses: Biophysical mechanisms and algorithms. Annu Rev Neurosci, 13, 475–511. Buchholtz, F., Golowasch, J., Epstein, I. R., & Marder, E. (1992). Mathematical model of an identified stomatogastric ganglion neuron. J Neurophysiol, 67(2), 332–340. Bulsara, A., Jacobs, E. W., Zhou, T., Moss, F., & Kiss, L. (1991). Stochastic resonance in a single neuron model: Theory and analog simulation. J Theor Biol, 152(4), 531–555. Butera, R. J., Jr., Clark, J. W., Jr., Canavier, C. C., Baxter, D. A., & Byrne, J. H. (1995). Analysis of the effects of modulatory agents on a modeled bursting neuron: Dynamic interactions between voltage and calcium dependent systems [published erratum appears in J Comput Neurosci 1996 Sep;3(3):265]. J Comput Neurosci, 2(1), 19–44. Byrne, J. H. (1987). Cellular analysis of associative learning. Physiol Rev, 67(2), 329–439. Calabrese, R. L., & De Schutter, E. (1992). Motor-pattern-generating networks in invertebrates: Modeling our way toward understanding. Trends Neurosci, 15(11), 439–445.
800
Jean-Marc Fellous and Christiane Linster
Cartling, B. (1996). Dynamics control of semantic processes in a hierarchical associative memory. Biol Cybern, 74(1), 63–71. Chan-Palay, V., & Palay, S. L. (Eds.). (1984). Coexistence of neuroactive substances in neurons. New York: Wiley. Clark, C. R., Geffen, G. M., & Geffen, L. B. (1987a). Catecholamines and attention. I: Animal and clinical studies. Neurosci Biobehav Rev, 11(4), 341–352. Clark, C. R., Geffen, G. M., & Geffen, L. B. (1987b). Catecholamines and attention. II: Pharmacological studies in normal humans. Neurosci Biobehav Rev, 11(4), 353–364. Cohen, J. D., Braver, T. S., & O’Reilly, R. C. (1996). A computational approach to prefrontal cortex, cognitive control and schizophrenia: Recent developments and current challenges. Philos Trans R Soc Lond B Biol Sci, 351(1346), 1515–1527. Cohen, J. D., & Servan-Schreiber, D. (1992). Context, cortex, and dopamine: A connectionist approach to behavior and biology in schizophrenia. Psychol Rev, 99(1), 45–77. Contreras, D., Destexhe, A., Sejnowski, T. J., & Steriade, M. (1997). Spatiotemporal patterns of spindle oscillations in cortex and thalamus. J Neurosci, 17(3), 1179–1196. Cooper, J. R., Bloom, F. E., & Roth, R. H. (1991). The biochemical basis of neuropharmacology (6th ed.). New York: Oxford University Press. Cooper, S. J. (1991). Interactions between endogenous opioids and dopamine: Implications for reward and aversion. In P. Willner & J. Scheel-Kruger (Eds.), The mesolimbic dopamine system: From motivation to action (pp. 331–366). New York: Wiley. de Carvalho, L. A. (1994). Modeling the thalamocortical loop. Int J Biomed Comput, 35(4), 267–296. Destexhe, A., Bal, T., McCormick, D. A., & Sejnowski, T. J. (1996). Ionic mechanisms underlying synchronized oscillations and propagating waves in a model of ferret thalamic slices. J Neurophysiol, 76(3), 2049–2070. Destexhe, A., Contreras, D., Sejnowski, T. J., & Steriade, M. (1994a). Modeling the control of reticular thalamic oscillations by neuromodulators. Neuroreport, 5(17), 2217–2220. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994b). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. J Comput Neurosci, 1(3), 195–230. Epstein, I. R., & Marder, E. (1990). Multiple modes of a conditional neural oscillator. Biol Cybern, 63(1), 25–34. Foster, W. R., Ungar, L. H., & Schwaber, J. S. (1993). Significance of conductances in Hodgkin-Huxley models. J Neurophysiol, 70(6), 2502–2518. Fuxe, K., & Agnati, L. F. (1991). Volume transmission in the brain: Novel mechanism for neural transmission. Advances in Neuroscience, 1, 1–9. Gingrich, K. J., & Byrne, J. H. (1987). Single-cell neuronal model for associative learning. J Neurophysiol, 57(6), 1705–1715. Gluck, M. A., & Myers, C. E. (1993). Hippocampal mediation of stimulus representation: A computational theory. Hippocampus, 3(4), 491–516. Goldman-Rakic, P. S., Lidow, M. S., Smiley, J. F., & Williams, M. S. (1992). The anatomy of dopamine in monkey and human prefrontal cortex. J Neural Transm Suppl, 36, 163–177.
Computational Models of Neuromodulation
801
Golowasch, J., Buchholtz, F., Epstein, I. R., & Marder, E. (1992). Contribution of individual ionic currents to activity of a model stomatogastric ganglion neuron. J Neurophysiol, 67(2), 341–349. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. J Comput Neurosci, 1(1–2), 11–38. Guckenheimer, J., Gueron, S., & Harris-Warrick, R. M. (1993). Mapping the dynamics of a bursting neuron. Philos Trans R Soc Lond B Biol Sci, 341(1298), 345–359. Guckenheimer, J., Harris-Warrick, R., Peck, J., & Willms, A. (1997). Bifurcation, bursting, and spike frequency adaptation. J Comput Neurosci, 4(3), 257–277. Hagiwara, N., & Irisawa, H. (1989). Modulation by intracellular Ca2+ of the hyperpolarization-activated inward current in rabbit single sino-atrial node cells. J Physiol (Lond), 409, 121–141. Hansson, E., & Ronnback, L. (1994). Astroglial modulation of synaptic transmission. Perspect Dev Neurobiol, 2(3), 217–223. Harris-Warrick, R. M., Coniglio, L. M., Levini, R. M., Gueron, S., & Guckenheimer, J. (1995). Dopamine modulation of two subthreshold currents produces phase shifts in activity of an identified motoneuron. J Neurophysiol, 74(4), 1404–1420. Harris-Warrick, R. M., & Flamm, R. E. (1987). Multiple mechanisms of bursting in a conditional bursting neuron. J Neurosci, 7(7), 2113–2128. Harris-Warrick, R. M., & Marder, E. (1991). Modulation of neural networks for behavior. Annual Reviews of Neuroscience, 14, 39–57. Hasselmo, M. E. (1993). Acetylcholine and learning in a cortical associative memory. Neural Computation, 5, 32–44. Hasselmo, M. E. (1995). Neuromodulation and cortical function: Modeling the physiological basis of behavior. Behav Brain Res, 67(1), 1–27. Hasselmo, M. E., & Bower, J. M. (1992). Cholinergic suppression specific to intrinsic not afferent fiber synapses in rat piriform (olfactory) cortex. J Neurophysiol, 67(5), 1222–1229. Hasselmo, M. E., Linster, C., Patil, M., Ma, D., & Cekic, M. (1997). Noradrenergic suppression of synaptic transmission may influence cortical signal-to-noise ratio. J Neurophysiol, 77(6), 3326–3339. Hasselmo, M. E., & Schnell, E. (1994). Laminar selectivity of the cholinergic suppression of synaptic transmission in rat hippocampal region CA1: Computational modeling and brain slice physiology. J Neurosci, 14(6), 3898–3914. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology, 117, 500–544. Horn, D., Ruppin, E., Usher, M., & Herrmann, M. (1993). Neuronal network modeling of memory deterioration in Alzheimer’s disease. Neural Computation, 5(5), 736–749. Jackson, S. R., Marrocco, R., & Posner, M. I. (1994). Networks of anatomical areas controlling visuospatial attention. Neural Networks, 7(6/7), 925–944. Jay, T. M., Glowinski, J., & Thierry, A. M. (1995). Inhibition of hippocampoprefrontal cortex excitatory responses by the mesocortical DA system. Neuroreport, 6(14), 1845–1848.
802
Jean-Marc Fellous and Christiane Linster
Jensen, O., Idiart, M. A. P., & Lisman, J. E. (1996). Physiologically realistic formation of autoassociative memory in networks with theta/gamma oscillations: Role of fast NMDA channels. Learning and Memory, 3, 243–256. Kaczmarek, L. K., & Levitan, I. B. (1987). Neuromodulation: The biochemical control of neuronal excitability. New York: Oxford University Press. Katz, P. S., & Frost, W. N. (1995a). Intrinsic neuromodulation in the Tritonia swim CPG: Serotonin mediates both neuromodulation and neurotransmission by the dorsal swim interneurons. J Neurophysiol, 74(6), 2281–2294. Katz, P. S., & Frost, W. N. (1995b). Intrinsic neuromodulation in the Tritonia swim CPG: The serotonergic dorsal swim interneurons act presynaptically to enhance transmitter release from interneuron C2. J Neurosci, 15(9), 6035– 6045. Katz, P. S., & Frost, W. N. (1996). Intrinsic neuromodulation: Altering neuronal circuits from within. Trends Neurosci, 19(2), 54–61. Katz, P. S., Getting, P. A., & Frost, W. N. (1994). Dynamic neuromodulation of synaptic strength intrinsic to a central pattern generator circuit. Nature, 367(6465), 729–731. Kepler, T. B., Marder, E., & Abbott, L. F. (1990). The effect of electrical coupling on the frequency of model neuronal oscillators. Science, 248(4951), 83–85. Kupfermann, I. (1991). Functional studies of cotransmission. Physiol Rev, 71(3), 683–732. Lechner, H. A., Baxter, D. A., Clark, J. W., & Byrne, J. H. (1996). Bistability and its regulation by serotonin in the endogenously bursting neuron R15 in Aplysia. J Neurophysiol, 75(2), 957–962. LeMasson, G., Marder, E., & Abbott, L. F. (1993). Activity-dependent regulation of conductances in model neurons. Science, 259(5103), 1915–1917. Levin, J. E., & Miller, J. P. (1996). Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380(6570), 165–168. Liljenstrom, H., & Hasselmo, M. E. (1995). Cholinergic modulation of cortical oscillatory dynamics. J Neurophysiol, 74(1), 288–297. Linster, C., & Gervais, R. (1996). Investigation of the role of interneurons and their modulation by centrifugal fibers in a neural model of the olfactory bulb. J Comput Neurosci, 3(3), 225–246. Linster, C., & Hasselmo, M. (1997). Modulation of inhibition in a model of olfactory bulb reduces overlap in the neural representation of olfactory stimuli. Behav Brain Res, 84(1–2), 117–127. Linster, C., & Masson, C. (1996). A neural model of olfactory sensory memory in the honeybee’s antennal lobe. Neural Computation, 8, 94–114. Linster, C., & Smith, B. H. (1997). A computational model of the response of honey bee antennal lobe circuitry to odor mixtures: Overshadowing, blocking and unblocking can arise from lateral inhibition. Behavioural Brain Research, 87, 1–14. Lisman, J. E., & Idiart, M. A. (1995). Storage of 7 +/− 2 short-term memories in oscillatory subcycles. Science, 267(5203), 1512–1515. Longtin, A. (1993). Stochastic resonance in neuron models. Journal of Statistical Physics, 70(1/2), 309–327. Longtin, A., Bulsara, A., Pierson, D., & Moss, F. (1994). Bistability and the dy-
Computational Models of Neuromodulation
803
namics of periodically forced sensory neurons. Biol Cybern, 70(6), 569–578. Madison, D. V., Lancaster, B., & Nicoll, R. A. (1987). Voltage clamp analysis of cholinergic action in the hippocampus. J Neurosci, 7(3), 733–741. Madison, D. V., & Nicoll, R. A. (1986). Actions of noradrenaline recorded intracellularly in rat hippocampal CA1 pyramidal neurones, in vitro. J Physiol (Lond), 372, 221–244. Marder, E. (1996). Neural modulation: Following your own rhythm. Curr Biol, 6(2), 119–121. Marder, E., Christie, A. E., & Kilman, V. L. (1995). Functional organization of cotransmission systems: Lessons from small nervous systems. Invertebrate Neuroscience, 1, 105–112. Marder, E., Hooper, S. L., & Eisen, J. S. (1987). Multiple neurotransmitters provide a mechanism for the production of multiple outputs from a single neuronal circuit. In G. M. Edelman, E. W. Gall, & W. M. Cowan (Eds.), Synaptic function (pp. 305–327). New York: Wiley. Marder, E., & Selverston, A. I. (1992). Modeling the stomatogastric nervous system. In R. M. Harris-Warrick, E. Marder, A. I. Selverston, & M. Moulins (Eds.), Dynamic biological networks: The stomatogastric nervous system. Cambridge, MA: MIT Press. McCormick, D. A. (1992). Neurotransmitter actions in the thalamus and cerebral cortex and their role in neuromodulation of thalamocortical activity. Prog Neurobiol, 39(4), 337–388. McCormick, D. A., & Huguenard, J. R. (1992). A model of the electrophysiological properties of thalamocortical relay neurons. J Neurophysiol, 68(4), 1384–1400. McCormick, D. A., & Wang, Z. (1991). Serotonin and noradrenaline excite GABAergic neurones of the guinea-pig and cat nucleus reticularis thalami. J Physiol (Lond), 442, 235–255. McNamara, B., & Wiesenfeld, K. (1989). Theory of stochastic resonance. Physical Review A, 39(9), 4854–4869. Montague, P. R., Dayan, P., Person, C., & Sejnowski, T. J. (1995). Bee foraging in uncertain environments using predictive Hebbian learning. Nature, 377(6551), 725–728. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J Neurosci, 16(5), 1936–1947. Mukherjee, P., & Kaplan, E. (1995). Dynamics of neurons in the cat lateral geniculate nucleus: In vivo electrophysiology and computational modeling. J Neurophysiol, 74(3), 1222–1243. Myers, C. E., Ermita, B. R., Harris, K., Hasselmo, M., Solomon, P., & Gluck, M. A. (1996). A computational model of cholinergic disruption of septohippocampal activity in classical eyeblink conditioning. Neurobiol Learn Mem, 66(1), 51–66. Nusbaum, M. P., Weimann, J. M., Golowasch, J., & Marder, E. (1992). Presynaptic control of modulatory fibers by their neural network targets. J Neurosci, 12(7), 2706–2714. Pearson, K. G. (1993). Common principles of motor control in vertebrates and
804
Jean-Marc Fellous and Christiane Linster
invertebrates. Annu Rev Neurosci, 16, 265–297. Rajkowski, J., Kubiak, P., & Aston-Jones, G. (1994). Locus coeruleus activity in monkey: Phasic and tonic changes are associated with altered vigilance. Brain Res Bull, 35(5–6), 607–616. Raymond, J. L., Baxter, D. A., Buonomano, D. V., & Byrne, J. H. (1992). A learning rule based on empirically-derived activity-dependent neuromodulation supports operant conditioning in small network. Neural Networks, 5, 789–803. Ridet, J. L., Rajaofetra, N., Teilhac, J. R., Geffard, M., & Privat, A. (1993). Evidence for nonsynaptic serotonergic and noradrenergic innervation of the rat dorsal horn and possible involvement of neuron-glia interactions. Neuroscience, 52(1), 143–157. Rinzel, J., & Lee, Y. S. (1987). Dissection of a model for neuronal parabolic bursting. J Math Biol, 25(6), 653–675. Sara, S. J., Vankov, A., & Herve, A. (1994). Locus coeruleus-evoked responses in behaving rats: A clue to the role of noradrenaline in memory. Brain Res Bull, 35(5–6), 457–465. Schultz, W., Apicella, P., & Ljungberg, T. (1993). Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. J Neurosci, 13(3), 900–913. Servan-Schreiber, D., Printz, H., & Cohen, J. D. (1990). A network model of catecholamine effects: Gain, signal-to-noise ratio, and behavior. Science, 249(4971), 892–895. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu Rev Physiol, 55, 349–374. Starke, K., Gothert, M., & Kilbinger, H. (1989). Modulation of neurotransmitter release by presynaptic autoreceptors. Physiol Rev, 69(3), 864–989. Sun, X., Fonta, C., & Masson, C. (1993). Odour quality processing by bee antennal lobe neurons. Chemical Senses, 18, 355–377. Sutton, R. S., & Barto, A. G. (1990). Time-derivative models of Pavlovian reinforcement. In M. Gabriel & J. Moore (Eds.), Learning and computational neuroscience. Cambridge, MA: MIT Press. Tang, A. C., & Hasselmo, M. E. (1994). Selective suppression of intrinsic but not afferent fiber synaptic transmission by baclofen in the piriform (olfactory) cortex. Brain Res, 659(1-2), 75–81. Traub, R. D., Miles, R., & Buzsaki, G. (1992). Computer simulation of carbacholdriven Rhythmic population oscillations in the CA3 region of the in vitro rat hippocampus. J Physiol (Lond), 451, 653–672. Traub, R. D., Whittington, M. A., Colling, S. B., Buzsaki, G., & Jefferys, J. G. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J Physiol (Lond), 493(Pt 2)), 471–484. Turrigiano, G., Abbott, L. F., & Marder, E. (1994). Activity-dependent changes in the intrinsic properties of cultured neurons. Science, 264(5161), 974–977. Usher, M., Cohen, J., Servan-Schreiber, D., Rajkowski, J., Kubiak, P., & AstonJones, G. (1995). A computational model of locus coeruleus function and its influence on behavioral performance in primate (Tech. rep. No. PDP.CNS 95.1). Pittsburgh: Carnegie Mellon University, University of Pittsburgh. van Dongen, P. A. (1981). The central noradrenergic transmission and the locus
Computational Models of Neuromodulation
805
coeruleus: A review of the data, and their implications for neurotransmission and neuromodulation. Prog Neurobiol, 16(2), 117–143. Wallenstein, G. V. (1993). Spatial mode dynamics of a thalamo-cortical network. Paper presented at the SPIE, Chaos in Biology and Medicine. Wallenstein, G. V. (1996). Adenosinic modulation of 7-14 Hz spindle rhythms in interconnected thalamic relay and nucleus reticularis neurons. Neuroscience, 73(1), 93–98. Wallenstein, G. V., & Hasselmo, M. E. (1997a). Functional transitions between epileptiform-like activity and associative memory in hippocampal region CA3. Brain Res Bull, 43(5), 485–493. Wallenstein, G. V., & Hasselmo, M. E. (1997b). GABAergic modulation of hippocampal population activity: Sequence learning, place field development, and the phase precession effect. J Neurophysiol, 78(1), 393–408. Wang, X. J., & Buzsaki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J Neurosci, 16(20), 6402–6413. Wilson, M., & Bower, J. M. (1992). Cortical oscillations and temporal interactions in a computer simulation of piriform cortex. J Neurophysiol, 67(4), 981–995. Zador, A., Koch, C., & Brown, T. H. (1990). Biophysical model of a Hebbian synapse. Proc Natl Acad Sci U S A, 87(17), 6718–6722. Received January 2, 1997; accepted October 23, 1997.
NOTE
Communicated by Helge Ritter
Weight-Value Convergence of the SOM Algorithm for Discrete Input Siming Lin Jennie Si Department of Electrical Engineering, Arizona State University, Tempe, AZ 85287, U.S.A.
Some insights on the convergence of the weight values of the self-organizing map (SOM) to a stationary state in the case of discrete input are provided. The convergence result is obtained by applying the RobbinsMonro algorithm and is applicable to input-output maps of any dimension. 1 Introduction The self-organizing map (SOM) has found wide applications in speech processing (Togneri, Alder, & Attrikiouzel, 1992), vector quantization (Mcauliffe, Atlas, & Rivera, 1990; Nasrabadi & Feng 1988), image coding (Carrato, 1994), and biological signal analysis (Lin, Si, & Schwartz, 1997). In comparison to the applications of the SOM, a thorough theoretical study of the SOM in terms of topology preserving, convergence analysis, performance analysis, and, in particular, multidimensional input seems lacking. In this article, we focus on one specific aspect of the analysis to provide some insight into the SOM learning process for discrete input cases that are related to such applications as image data compression and image coding. Several researchers have identified many interesting properties of the SOM in one-dimensional input cases. Kohonen (1982) presented a proof of ordering properties if the input is generated from a one-dimensional uniform distribution and the output map is a chain (referred to as a onedimensional case) and showed that the expectation of weight values converges to a unique limit. Kohonen proposed analyzing the SOM for the cases of continuous input. But the minimization of the objective function leads to a different learning algorithm from the SOM (Kohonen, 1991). By viewing the weight values of the SOM as states of a Markov process, Erwin, Obermayer, and Schulten (1992), and Bouton and Pag`es (1993) showed that the output map is ordered almost surely for a one-dimensional case with nonuniformly distributed input. Using the ordinary differential equation (ODE) approach, Horowitz and Alvarez (1995) showed that the SOM converges to the best locally affine feature map for one-dimensional case with nonuniformly distributed input. On the other hand, results on the properties of the Neural Computation 10, 807–814 (1998)
c 1998 Massachusetts Institute of Technology °
808
Siming Lin and Jennie Si
multidimensional SOM (multidimensional input, one-dimensional or multidimensional output map) are not fully developed. Yin and Allinson (1995) analyzed the statistical properties of the weight values of the SOM algorithm by considering the weight values at any iteration step as a summation of a contribution from the initial state and a contribution from the inputs at and before the iteration step. However, it is assumed that there is no “dead” node in the output map starting from any time step t (t = 0, 1, . . . , ∞). In other words, it is assumed that the probability for each output node to be updated is always a positive number starting from any time step t (t = 0, 1, . . . , ∞). This seems to be an unreasonable assumption since it actually assumes that each weight vector of the SOM is updated sequentially by an infinite subsequence of the input with the neighborhood function equal to one in the updating equation of the SOM algorithm, which will directly result in the convergence of each weight vector to the mass center of the subsequence in a mean-square sense (the conclusion proved in Yin & Allinson, 1995). Likhovidov (1997) applied a variational approach to prove the convergence of the weight values of a class of unsupervised learning algorithms. Since the boundary condition of the output map of the SOM algorithm could not satisfy the conditions imposed in Likhovidov (1997), the convergence proof is suitable only to a modified SOM algorithm proposed in Li, Gasteiger, & Zupan (1993). Ritter and Schulten (1988a) investigated the weight-value convergence to stationary states for a multidimensional case by a Markov process argument. Their result holds when the weights of the SOM network have reached a vicinity of the stationary states. Ritter and Schulten (1988b) also introduced an objective function to be associated with the SOM algorithm in the case of multidimensional and discrete inputs. However, no further systematic analyses are provided based on this objective function (Ritter & Schulten, 1988b). We start with the same objective function that Ritter and Schulten (1988b) proposed and show (in section 4) that the weight values of the SOM network converge to a stationary state. Preliminary results are given in sections 2 and 3, and concluding remarks are made in section 5. 2 Preliminaries We first give some preliminaries on the Robbins-Monro algorithm, which will be used in obtaining our main results. A function g(w) with the form g(w) = E[ f (w)] is called a regression function of f (w); conversely, f (w) is called a sample function of g(w). The following conditions are needed to obtain the Robbins-Monro algorithm: C1: g(w) has a single root w∗ , g(w∗ ) = 0, and g(w) < 0 if w < w∗ g(w) > 0 if w > w∗ .
Weight-Value Convergence of the SOM Algorithm
809
This is assumed with little loss of generality since most functions of a single root not satisfying this condition can be made to do so by multiplying the function by −1. C2: The variance of f (w) from g(w) is finite, σ 2 (w) = E[g(w) − f (w)]2 < ∞.
(2.1)
|g(w)| < A|w − w∗ | + B < ∞.
(2.2)
C3:
C3 is a very mild condition as discussed in Tou and Gonzalez (1974). The values of A and B need not be known to prove the validity of the algorithm. As long as the root lies in some finite interval, the existence of A and B can always be assumed. If the conditions C1 through C3 are satisfied, the algorithm due to Robbins and Monro (1951) can be used iteratively to find the root w∗ of the function g(w), w(k + 1) = w(k) − α(k) f [w(k)],
(2.3)
where α(k) is a sequence of positive numbers that satisfy the following conditions: lim α(k) = 0
k→∞ ∞ X
α(k) = ∞
(2.4)
k=0 ∞ X
α 2 (k) < ∞.
k=0
Furthermore, w(k) will converge toward w∗ in the mean square error sense (Robbins & Monro, 1951) and with probability one (Blum, 1954), that is, lim E[kw(k) − w∗ k2 ] = 0
k→∞
½
∗
Prob lim w(k) = w k→∞
(2.5)
¾ = 1.
(2.6)
The convergence with probability one in equation 2.6 is also called almost sure convergence. In this article, the Robbins-Monro algorithm is applied to an optimization ∂J , where J is an objective (minimization) problem. In that setting, g(w) = ∂w
810
Siming Lin and Jennie Si
f f f f f G f f f f f f f i v f f f f f f f f f f f f f A S C A S C A S w C A wi1 S C in A SC A v SCv X x1 , . . . , xn Figure 1: Feature mapping architecture.
function to be minimized and will be specified later. If J has a local minimum at w∗ , g(w) will satisfy condition C1 locally at w∗ . If J has a quadratic form, g(w) will satisfy condition C1 globally. Replacing w with the vector notation W in earlier discussions, we can apply all the conclusions to multidimensional cases. 3 The SOM Algorithm The self-organizing feature map (see Figure 1) learns the topological mapping f : G ⊂ <m → X ⊂
(3.1)
where k = 0, 1, 2, . . . , is the discrete-time coordinate, α(k) is the learning rate factor, and 3(i, c) is the neighborhood function depending on the distance krc − ri k between the unit i and the winner c in the output layer, where rc and ri denote the coordinate positions of the winner c and the unit i in the output layer, respectively. The winner node c is defined to be the node whose weight vector has the smallest Euclidean distance from the input X(k): kWc (k) − X(k)k ≤ kWi (k) − X(k)k
∀ unit i ∈ G.
(3.2)
Weight-Value Convergence of the SOM Algorithm
As an example, the following was proposed by Kohonen (1995): ½ 1 for kri − rc k ≤ Nc (k) 3(i, c) = 0 otherwise,
811
(3.3)
where Nc (k) is some decreasing function of time. The value of Nc (k) is usually large in the beginning and then shrinks during the learning process. Note that our following result does not depend on a prespecified form of a neighborhood function—neither its value nor its coverage during the learning process. 4 Convergence Analysis of the SOM Algorithm with Discrete Input Ritter and Schulten (1988b) proposed an objective function for the case of discrete input, Jdis (W) =
X 1X 3(i, c) pj kXj − Wi k2 , 2 c,i X ∈X j
(4.1)
c
where Xc = {Xj : kXj − Wc k ≤ kXj − Wi k, Xj ∈ X , ∀i}, and pj is the discrete probability density. No further systematic analyses are provided based on this objective function (Ritter & Schulten, 1988a). We use this objective function to provide analyses on the convergence of the weight values to a stationary state in the SOM networks by applying the Robbins-Monro algorithm. First though, we need some assumptions: A.1: The input Xj ∈ X ⊂ Rn has discrete probability density p(X) = PN j=1 pj δ(X − Xj ). A.2: The learning rate α(k) satisfies the condition in equation 2.4. Theorem 4.1. Assume that A.1 and A.2 hold. Then the SOM algorithm (locally or globally) minimizes the objective function in equation 4.1, and the weights converge almost truly to a stationary solution Wi ∗ , i = 1, 2, . . . , L in equation 4.7 if the stationary solution exists. Proof. Jdis (W) is continuous, but only piecewise differentiable. Taking derivatives of both sides of equation 4.1 and rearranging the terms in the summation, we obtain, X X ∂ Jdis =− 3(i, c) pj (Xj − Wi ), ∂Wi c X ∈X j
(4.2)
c
since
X =
[
Xc ,
(4.3)
812
Siming Lin and Jennie Si
then X ∂ Jdis =− pj 3(i, c)(Xj − Wi ) ∂Wi X ∈X j
= −E[3(i, c)(X − Wi )].
(4.4)
Since X is drawn from a certain distribution in the input space, the mean and variance of X is finite. Therefore, conditions C2 and C3 in the RobbinsMonro algorithm are satisfied. The root of ∂ Jdis /∂Wi = 0 can be estimated by invoking the Robbins-Monro algorithm with f (Wi ) = −3(i, c)(X − Wi ). Using equation 2.3, we then obtain the updating equation, 3.1. Letting Wi ∗ , i = 1, 2, . . . , L be a stationary solution, we have ∂ Jdis = 0. ∂Wi ∗
(4.5)
E[3(i, c)(X − Wi ∗ )] = 0.
(4.6)
Then,
It follows that X ∗
Wi =
3(i, c)Xj pj
Xj ∈X
X
3(i, c)pj
.
(4.7)
Xj ∈X
Suppose the neighborhood function in equation 3.3 is used with a constant neighborhood function Nc (k) = Nc in the final learning phase; then X Wi ∗ =
Xj pj
Xj ∈Äi
X
pj
,
(4.8)
Xj ∈Äi
where Äi = {X : kri − rc k ≤ Nc }. Two extreme cases are useful to demonstrate the importance of the neighborhood function. First, if the range of the neighborhood function covers the entire output map, that is, Nc (k) = ∞, then Äi = X . According to equation 4.8, each weight vector converges to the same stationary state, which is the mass center of the training data set. This implies that if we want to eliminate the effect of the initial conditions, we should use a neighborhood function covering a large range of the output map. Second, if the range of the neighborhood function equals 0, that is, Nc (k) = 0, then Äi = Vi , where Vi = {X : kX − Wi ∗ k ≤ kX − Wj ∗ k, for all j}.
Weight-Value Convergence of the SOM Algorithm
813
Therefore, Wi ∗ , i = 1, 2, . . . , L become the centroids of the cells of a Voronoi partition of the inputs, and the final iterations of the SOM algorithm become a sequential updating process of vector quantization. This implies that the SOM could be used for vector quantization by shrinking the range of the neighborhood function to zero during the learning process. The Robbins-Monro algorithm ensures that the weights tuned by the SOM converge to the root of ∂ J/∂Wi = 0 almost truly if the root exists, where J is an objective function associated with the training process. In practice, J usually exhibits many (local) minima. Therefore, it is inevitable that the weights of SOM so obtained would converge only to solutions corresponding to local minima. Nevertheless, one always has the chance of comparing and choosing a solution with a lower value in J. It has also been observed that by introducing the neighborhood function, the SOM algorithm is capable to some extent of escaping from local minima when the SOM is used for vector quantization (Mcauliffe et al., 1990). However, the topology ordering information is not explicitly represented in the objective function J. Therefore, this objective function–based proof of the convergence makes no statement about the topology ordering of the weight vectors, which now remains as a well-observed practice in many applications. 5 Discussion The convergence of the weight values of the SOM algorithm to a stationary state for the case of discrete inputs is analyzed using the Robbins-Monro algorithm. The result holds for arbitrary dimensionality of the input vector and output map without any restriction on the form of the neighborhood function. In many applications, such as image data compression (Mcauliffe et al., 1990) and image coding (Carrato, 1994), the inputs to the SOM are from discrete distributions. Our analysis provides some insights on the convergence of the weight values (code words) of the SOM algorithm in these applications. Acknowledgments This research was supported in part by NSF under grant ECS-9553202, by EPRI under grant RP8015-03, by NIH, and by Motorola. References Blum, J. R. (1954). Approximation methods which converge with probability one. Ann. Math. Stat., 25, 382–386. Bouton, C., & Pag`es, G. (1993). Self-organization and a.s. convergence of the one-dimensional Kohonen algorithm with non-uniformly distributed stimuli. Stochastic Process Appl., 47, 249–274.
814
Siming Lin and Jennie Si
Carrato, S. (1994). Image vector quantization using ordered codebooks: Properties and applications. Signal Processing, 40(1), 87–103. Erwin, E., Obermayer, K., & Schulten, K. (1992). Self-organizing maps: Ordering, convergence properties and energy functions. Biol. Cybern., 67(1), 35–45. Horowitz, R., & Alvarez, L. (1995). Convergence properties of self-organizing neural networks. Proceedings of the 1995 American Control Conference (Vol. 2, pp. 1334–1339). Kohonen, T. (1982). Analysis of a simple self-organizing process. Biol. Cybern., 44(2), 135–140. Kohonen, T. (1991). Self-organizing maps: Optimization approaches, In T. Kohonen, K. Makisara, O. Simula, & J. Kangas (Eds.), Artificial neural networks (Vol. 1, pp. 981–990). Amsterdam: North-Holland. Kohonen, T. (1995). Self-organizing map. Heidelberg: Springer-Verlag. Li, X., Gasteiger, I., & Zupan, I. (1993). On the topology distortion in selforganizing feature maps. Biol. Cybern., 70, 189–198. Likhovidov, V. (1997). Variational approach to unsupervised learning algorithm of neural networks. Neural Networks, 10(2), 273–289. Lin, S., Si, J., & Schwartz, A. B. (1997). Self-organization of firing activities in monkey’s motor cortex: Trajectory computation from spike signals. Neural Comp., 9(3), 607–621. Mcauliffe, J. D., Atlas, L. E., & Rivera, C. (1990). A comparison of the LBG algorithm and Kohonen neural network paradigm for image vector quantization. In Proc. ICASSP-90, Int. Conf. on Acoustics, Speech and Signal Processing (Vol. 4, pp. 2293–2296). Nasrabadi, N. M., & Feng, Y. (1988). Vector quantization of images based upon the Kohonen self-organizing feature maps. IEEE International Conference on Neural Networks, San Diego (pp. 1101–1108). Ritter, H., & Schulten, K. (1988a). Convergence properties of Kohonen’s topology conserving maps: Fluctuation, stability, and dimension selection. Biol. Cybern., 60, 59–71. Ritter, H., & Schulten, K. (1988b). Kohonen’s self-organizing maps: Exploring their computational capabilities. In Proceedings of the 1988 IEEE International Conference on Neural Networks, San Diego, CA (pp. 109–116). Robbins, H., & Monro, S. (1951). A stochastic approximation method. Ann. Math. Stat., 22, 400–407. Togneri, R., Alder, M. D., & Attrikiouzel, Y. (1992). Dimension and structure of the speech space. IEE Proceedings—I, Commun. Speech & Vision, 139(2), 123– 127. Tou, J. T., & Gonzalez, R. C. (1974). Pattern recognition principles. Reading, MA: Addison-Wesley. Yin, H., & Allinson, N. M. (1995). On the distribution and convergence of feature space in self-organizing maps. Neural Comp., 7(6), 1178–1187. Received July 31, 1997; accepted September 26, 1997.
NOTE
Communicated by Laurence Abbott
Reading Neuronal Synchrony with Depressing Synapses W. Senn Department of Neurobiology, Hebrew University, Jerusalem 91904, Israel, Department of Physiology, University of Bern, Switzerland
I. Segev Department of Neurobiology and Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel
M. Tsodyks Department of Neurobiology, Weizmann Institute of Science, Rehovot 76100, Israel
A recent experiment showed that neurons in the primary auditory cortex of the monkey do not change their mean firing rate during an ongoing tone stimulus. The only change was an enhanced correlation among the individual spike trains during the tone. We show that there is an easy way to extract this coherence information in the cortical cell population by projecting the spike trains through depressing synapses onto a postsynaptic neuron. Understanding how information about the world is represented and read out from large neuronal populations is one of the most challenging tasks of neuroscience. Recent experiments indicate that the timing of the individual spikes might be used to represent external or internal stimuli while the mean firing rate could even be constant (Vaadia et al., 1995; Mainen & Sejnowski, 1995; Meister, Lagnado, & Baylor, 1995; Alonso, Usrey, & Reid, 1996). At the same time, theoretical analysis of spike trains showed that individual spike times are much more reliable than those for random trains with the same mean and variance (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997). This raises the question of how the information encoded in a neural population is read out by a postsynaptic cell if this cell receives the same number of spikes during its integration time. The list of experimental evidence that spike timings and correlation among spike trains may carry important information was recently enlarged by the study of deCharms and Merzenich (1996) on anesthetized monkeys. They recorded individual cells and local field potentials in the primary auditory cortex (AI) of the monkey while stimulating with a pure tone. Apart from transient activity increase at the tone onset and offset, the activity averaged over different presentations and different locations turned out to be the same during and before the tone. However, during the tone stimuNeural Computation 10, 815–819 (1998)
c 1998 Massachusetts Institute of Technology °
816
W. Senn, I. Segev, and M. Tsodyks
lus, the correlations among the individual spike times of two simultaneous recordings were significantly higher than before or after the stimulus. It was recently demonstrated that fast synaptic depression may facilitate transmitting synchronous activity of neuronal ensembles (Tsodyks & Markram, 1996, 1997; Abbott, Varela, Sen, & Nelson, 1997). We therefore propose that the information about the presence of the ongoing tone stimulus that is distributed over the AI neurons could be read out through frequency-dependent depressing synapses. To test our hypothesis, we simulated the output of 500 AI cells by random spike trains (see Figure 1c) exhibiting the same statistical properties as reported in deCharms and Merzenich (1996). There is experimental evidence to assume that before and during the tone, the neurons fire in short bursts (R. C. deCharms, personal communication, 1997), say, with bursts of three to four spikes within 40–50 ms, repeated every 200–250 ms. During the tone, the burst onsets are assumed to be synchronized within groups of 100 neurons that are randomly assembled anew for each burst. Such a scenario is similar to the activity in the monkey frontal cortex during a reaching task, where synchronization among rapidly associated subgroups occurs in the presence of a constant mean firing rate (Vaadia et al., 1995). Since in our simulation the bursting times of the groups alternate during the ongoing tone, the overall firing rate of the population remains constant, apart from the short onset and offset of the tone when most cells burst together (see Figure 1d). Spike trains generated by this scheme produced cross-correlations (see Figure 1a), which match those calculated from the actual recordings (cf. deCharms & Merzenich, 1996). The synaptic depression was modeled by assuming a limited amount of synaptic neurotransmitter, which recovers with a slow time constant of 800 ms (Tsodyks & Markram, 1997). Whenever a presynaptic spike arrives, a fixed fraction of 0.8 of the available transmitter is released. (More generally, this fraction could be transiently raised by each spike, introducing a fast facilitating component, although this would not change the main results.) The synaptic conductance rises instantaneously to an amplitude proportional to the released transmitter and decays with a time constant of 3 ms.1 During a burst, the response of such a depressing synapse rapidly decreases for successive spikes due to the depletion of the transmitter and its slow recovery. But during a nonbursting period, the transmitter has time to recuperate, and this results in a strong postsynaptic response at the onset of the next burst. If we compare this dynamic response with that for a nondepressed synapse evoking on average the same postsynaptic potential, the depressed synapse will have a larger response at the burst onset and a smaller response toward the end of the about 50-ms-long bursts. 1 The maximal synaptic conductance was g ¯ = .15 for the depressing synapses and g¯ = .0073 for the nondepressing synapses. The synaptic reversal potential was 0 mV in both cases.
Neuronal Synchrony
spikes/s
a 11 10 9 −200
b
CC before tone
CC during tone 11 10 9 200 −200
0 Time (msec)
c neuron
817
0
200
Poisson distributed spike trains of the AI cells
500 300 100
spikes/(s,n)
Spike histogram of the AI cells
60 40 20
e
mV
11 10 9 200 −200
Tone
d
Postsynaptic membrane potential with non−depressing synapses
−40 −50 −60
f mV
0
CC after tone
Postsynaptic membrane potential with depressing synapses
−40 −50 −60 500
1000
1500
2000 2500 Time (msec)
3000
3500
Figure 1: Applying a tone stimulus (b, amplitude envelope), the model auditory cells respond at the onset and offset, but during the stimulus they correlate their bursts only among randomly assembled subgroups (c, spike raster; a, crosscorrelations (CC) among spike trains). Since the mean firing rate is on the background level during the tone (d, spikes per second per neuron), a postsynaptic neuron gathering the spike trains through synapses of constant weight would respond only at the stimulus onset and offset (e). With depressing synapses, however, the postsynaptic neuron detects the correlated bursts and fires during the tone as well (f).
818
W. Senn, I. Segev, and M. Tsodyks
Feeding the synthetically generated spike trains into a leaky integrateand-fire neuron2 showed that the synaptic depression is indeed able to detect the partial synchrony in the burst times. With nondepressing synapses, the postsynaptic membrane potential follows the presynaptic mean firing rate (see Figure 1d) and is continuously below threshold apart from the tone onset and offset (see Figure c). With depressing synapses, however, the partially synchronized bursts pushed the postsynaptic membrane potential across threshold repeatedly during the stimulus (see Figure 1f). The spikes are triggered at burst onsets when a group of recovered AI neurons starts to fire. During such a burst, the postsynaptic membrane potential decreases because of the synaptic depression and eventually falls below the potential average. Since the bursts are not synchronized in the absence of the stimulus, high responses at burst onsets are canceled by depressed responses at burst ends before and after the tone (see Figure 1f). One could ask whether it would be possible for a postsynaptic neuron with nondepressing synapses to detect the synchrony by selecting any particular subpopulation of AI cells. However, this is difficult since the composition of the synchronized subgroups changes randomly. The example shows that rapidly depressing synapses enable the brain to extract coincidence information that otherwise would be hidden or would require additional circuits. Since speed and strength of the depression are known to be regulated by the timing between pre- and postsynaptic spikes (Markram & Tsodyks, 1996), such a timing also determines the capacity for extracting the coincidence information in the presynaptic population. The message and its decoding mechanism appear to be dynamically interwoven, and this generates the power of cortical information processing but also makes it difficult to break the neural code. Acknowledgments W. S. was supported by the Swiss National Science Foundation, grant 500203793 (Schwerpunktprogramm Biotechnologie). M. T. was supported by the Minerva Foundation. We thank Christopher deCharms for helpful discussions. References Abbott, L., Varela, J., Sen, K., & Nelson, S. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Alonso, J.-M., Usrey, W., & Reid, R. (1996). Precisely correlated firing in cells of the lateral geniculate nucleus. Nature, 383, 815–819.
2 The membrane time constant was chosen to be 20 ms, the resting potential was −70 mV, and the threshold was set to −53.6 mV.
Neuronal Synchrony
819
de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. deCharms, R. C., & Merzenich, M. M. (1996). Primary cortical representation of sounds by the coordination of action-potential timing. Nature, 381, 610–613. Mainen, Z., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382, 807–810. Meister, M., Lagnado, L., & Baylor, D. (1995). Concerted signaling by retinal ganglion cells. Science, 270, 1207–1210. Tsodyks, M., & Markram, H. (1996). Plasticity of neocortical synapses enables transitions between rate and temporal coding. In C. von der Malsburg (Ed.), Proceedings of the ICANN’96 (pp. 445–450). Berlin: Springer-Verlag. Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertson, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature, 373, 515–518. Received August 26, 1997; accepted October 28, 1997.
LETTER
Communicated by Laurence Abbott
Neural Networks with Dynamic Synapses Misha Tsodyks Department of Neurobiology, Weizmann Institute of Science, Rehovot 76100, Israel
Klaus Pawelzik ¨ Str¨omungsforschung, D-37073 Goettingen, Germany Max-Planck-Institut fur
Henry Markram Department of Neurobiology, Weizmann Institute of Science, Rehovot 76100, Israel
Transmission across neocortical synapses depends on the frequency of presynaptic activity (Thomson & Deuchars, 1994). Interpyramidal synapses in layer V exhibit fast depression of synaptic transmission, while other types of synapses exhibit facilitation of transmission. To study the role of dynamic synapses in network computation, we propose a unified phenomenological model that allows computation of the postsynaptic current generated by both types of synapses when driven by an arbitrary pattern of action potential (AP) activity in a presynaptic population. Using this formalism, we analyze different regimes of synaptic transmission and demonstrate that dynamic synapses transmit different aspects of the presynaptic activity depending on the average presynaptic frequency. The model also allows for derivation of mean-field equations, which govern the activity of large, interconnected networks. We show that the dynamics of synaptic transmission results in complex sets of regular and irregular regimes of network activity. 1 Introduction A marked feature of synaptic transmission between neocortical neurons is a pronounced frequency dependence of synaptic responses to trains of presynaptic spikes (Thomson & Deuchars, 1994). The nature of this dynamic transmission varies among different classes of neurons. In our recent article (Tsodyks & Markram, 1996; see also Abbott, Varela, Sen, & Nelson, 1997; Tsodyks & Markram, 1997) we studied synaptic depression between neocortical pyramidal neurons with the aid of a phenomenological model. We found that the rate of depression is a primary factor in determining which features of the action potential (AP) activity in the presynaptic population are most effective in driving the postsynaptic neuron. The phenomenological formulation of Tsodyks and Markram (1996) and Abbott et al. (1997) can be generalized to describe facilitating synapses beNeural Computation 10, 821–835 (1998)
c 1998 Massachusetts Institute of Technology °
822
Misha V. Tsodyks, Klaus Pawelzik, and Henry Markram
tween pyramidal cells and inhibitory interneurons (Thomson & Deuchars, 1994; Markram, Tsodyks, and Wang, in press). This formulation has two major goals. First, it allows the quantification of the features of the AP activity of the presynaptic neurons and populations transmitted by these different types of synapses. Second, it can be used in deriving a novel mean-field dynamics of neocortical networks aimed at understanding the dynamic behavior of large neuronal populations without having to solve an equally large number of equations. Mean-field descriptions were extensively used in order to understand the possible computations of cortical neural networks (see, e.g., Wilson & Cowan, 1972; Amit & Tsodyks, 1991; Ginsburg & Sompolinsky, 1994; Tsodyks, Skaggs, Sejnowski, & McNaughton, 1997). The novel formulation that uses the generalized phenomenological model of dynamic properties of synaptic connections between different types of neocortical neurons allows the study of the effects of synaptic dynamics and synaptic plasticity on information processing in large neural networks. 2 Phenomenological Model of Neocortical Synapses In order to derive a coarse-grained description of neuronal dynamics, we have to compute the postsynaptic current generated by a population of neurons with a particular firing rate. This can be done with the phenomenological model of neocortical synapses used in Tsodyks and Markram (1997) and Abbott et al. (1997), which was shown to reproduce well the synaptic responses between pyramidal neurons. The model assumes that a synapse is characterized by a finite amount of resources. Each presynaptic spike (arriving at time tsp ) activates a fraction (USE , utilization of synaptic efficacy) of resources, which then quickly inactivate with a time constant of few milliseconds (τin ) and recover with a time constant of about 1 second (τrec ). The corresponding kinetic equations read: z dx = − USE x(tsp − 0)δ(t − tsp ) dt τrec y dy =− + USE x(tsp − 0)δ(t − tsp ) dt τin y z dz = − , dt τin τrec
(2.1)
where x, y, and z are the fractions of resources in the recovered, active, and inactive states, respectively. The postsynaptic current is taken to be proportional to the fraction of resources in the active state, Is (t) = ASE y(t). The two major parameters of the model are ASE , the absolute synaptic strength, which can be exhibited only by activating all of the resources, and USE , which determines the dynamics of the synaptic response. For an individual synapse, the model reproduces the postsynaptic responses generated
Neural Networks with Dynamic Synapses
823
by any presynaptic spike train tsp for interpyramidal synapses in layer V (Tsodyks & Markram, 1997). 2.1 Modeling Facilitating Synapses. The formulation of equation 2.1 does not include a facilitating mechanism, which is not evident between pyramidal neurons. It is, however, prominent in synapses between pyramidal neurons and inhibitory interneurons (Thomson & Deuchars, 1994). A standard way of modeling short-term facilitation is by introducing a facilitation factor, which is elevated by each spike by a certain amount and decays between spikes, possibly at several rates (see, e.g., Mallart & Martin, 1967; Zengel & Magleby, 1982). To add facilitation into our synaptic model, we therefore assume that the value of USE is not fixed but is increased by a certain amount due to each presynaptic spike. The running value of USE 1 . The resulting model includes both facilitating and is referred to as USE depressing mechanisms. Increase in USE could reflect, for example, the accumulation of calcium ions caused by spikes arriving in the presynaptic terminal, which is responsible for the release of neurotransmitter (Bertram, Sherman, & Stanely, 1996). For a simple kinetic scheme, assume that an AP causes a fraction of USE calcium channels to open, which subsequently close with a time constant of τ f acil . The fraction of opened calcium channels determines the current value 1 . The corresponding kinetic equation therefore reads: of USE 1 U1 dUSE 1 = − SE + USE (1 − USE )δ(t − tsp ). dt τ f acil
(2.2)
1 due to each spike and USE determines the increase in the value of USE 1 reached upon the arrival of the first spike coincides with the value of USE (in other words, at a very low frequency of stimulation). This equation can be transformed into an iterative expression for the 1 reached upon the arrival of nth spike in a train, which detervalue of USE mines the postsynaptic response according to equation 2.1, 1(n+1) 1(n) = USE (1 − USE ) exp(−δt/τ f acil ) + USE , USE
(2.3)
where δt is the time interval between the nth and (n + 1)th spikes. If the 1 reaches presynaptic neuron emits a regular spike train at the frequency r, USE a steady value of USE . 1 − (1 − USE ) exp(−1/rτ f acil ) 1 becomes a frequency-dependent variable, Thus in this formulation, USE and USE is a kinetic parameter characterizing an activity-dependent transmission in a given synapse.1
1
One could introduce two independent parameters describing initial value and degree
824
Misha V. Tsodyks, Klaus Pawelzik, and Henry Markram
Facilitating and depressing mechanisms are intricately interconnected 1 values, which in turn leads since stronger facilitation leads to higher USE to stronger depression. The value of USE therefore determines the contribution of facilitation in generating subsequent synaptic responses. Facilitation is marked for small values of USE and is not observed for higher USE . We found that the main features of synaptic transmission between pyramidal neurons and inhibitory interneurons are well captured by this model with USE ∼ 0.01 → 0.05, and τrec is typically several times faster than τ f acil (Markram et al., in press; see also Figure 1D). Figures 1A and 1B show responses from facilitating and depressing synapses with the same absolute strength to a regular spike train of 20 Hz (but with input resistance of the facilitatory synapse’s target 10 times higher). Figure 1C illustrates the buildup of depression in facilitating synapses when they are stimulated at high frequencies. As a result, the stationary level of response exhibits a tuning curve dependence on the frequency, in agreement with experimental results (see Figure 1D). 3 Population Signal We now return to our original problem of signaling from a large population of presynaptic neurons. There is an infinite number of ways the neurons of a population can fire relative to each other. Analysis of neurophysiological data revealed that individual neurons in vivo fire irregularly at all rates (Softky & Koch, 1993), reminiscent of the so-called Poisson process. Mathematically, the Poisson assumption means that at each moment, the probability that a neuron will fire is given by the value of the instantaneous firing rate and is independent of the timing of previous spikes. This assumption allows averaging equations 2.1 and 2.2 over different realizations of Poisson trains with a given rate, to obtain a new dynamics for the corresponding mean quantities (Amit & Tsodyks, 1991): 1 − hxi dhxi 1 = − hUSE ihxir(t) dt τrec − hU− i i dhUSE − = − SE + USE (1 − hUSE i)r(t) dt τ f acil − 1 i = hUSE i(1 − USE ) + USE , hUSE
(3.1)
− i where r(t) denotes the rate of a Poisson train for the neuron at time t. hUSE 1 immediately before the spike. Depressing denotes the average value of USE synapses are described by the first of these equations with the fixed value 1 (see also Grossberg, 1969, for the earlier analysis of these equations). of USE 1 . More data are required to determine whether this is needed to model of facilitation of USE facilitating synapses in neocortex accurately.
Neural Networks with Dynamic Synapses
825
1
mV
A
D − 20Hz
0.5
mV
0 0 10
200 B
600
800
1000
1200
400
600
800
1000
1200
F − 20Hz
5 0 0 20
200 C
mV
400
F − 70Hz
10
mV
0 0
Time (ms) 50
100
150
200
250
300
350
400
450
500
10
20
30
40 50 60 Frequency (Hz)
70
80
90
100
5
0 0
Figure 1: Phenomenological synaptic model. (A) Simulated postsynaptic potential generated by a regular spike train at a frequency of 20 Hz transmitted through a depressing synapse. (B) Same as A for facilitating synapse. (C) Same as B but for a presynaptic frequency of 70 Hz. (D) Stationary level of excitatory presynaptic potentials versus presynaptic frequency for facilitating synapses. Open circles: Experimental results for one of the recorded synaptic connections between pyramidal neuron and inhibitory interneuron (details of the experiments will be reported in Markram et al., in press). Solid line: Model results. The postsynaptic potential is computed using a passive membrane mechanism (τmem dV = −V + Rin Isyn (t)) with an input resistance of Rin = 100 MÄ for dt pyramidal target and 1 GigaÄ for interneuron. Parameters: (A): τmem = 40 msec; τinact = 3 msec; ASE = 250 pA; τrec = 800 msec; USE = 0.5; (BCD) τmem = 60 msec; τinact = 1.5 msec; ASE = 1540 pA; τrec = 130 msec; τ f acil = 530 msec; USE = 0.03;
In deriving equation 3.1, we made a further simplification by assuming that the inactivation time constant τin is much faster than the recovery one τrec . This assumption is valid for interpyramidal synapses studied in Markram and Tsodyks (1996) and for pyramidal interneuron synapses (Markram et al., in press). The evolution of postsynaptic current can be obtained from
826
Misha V. Tsodyks, Klaus Pawelzik, and Henry Markram
the remaining equation for y and recalling that Is (t) = ASE y(t): hyi dhyi 1 =− + hUSE ihxir(t), dt τin
(3.2)
1 hxi if one is interested only in the which can be simplified to y = rτin USE timescale slower than τin . While averaging equation 2.1 over different realizations of Poisson spike trains, we assumed that there is no statistical dependence between the vari1 (t) and the probability of spike emission at time t. This is ables x(t) and USE 1 is a fixed strictly valid only if there is no facilitation since in this case USE parameter of the model, and x(t), which is a function of the spike arrival times prior to the current time, is independent of the probability of a spike at time t due to the Poisson assumption. However, if facilitation is included, 1 (t) are a function of previous spikes and are not statistically both x(t) and USE independent. We thus performed simulations of equations 2.1 and 2.2 for populations of presynaptic neurons firing Poisson spike trains with various modulations of their firing rate and compared the resulting postsynaptic current with the solution of the mean-field equation 3.1. We found that in all cases considered, mean-field solutions were good approximations (see, e.g., Figure 2). More detailed analysis, outlined in the appendix, showed that 1 or x mean-field approximation works because for all frequencies, either USE have small coefficients of variation (CV), and thus the effect of the statistical correlations between them is small. Equations 3.1 and 3.2 can be solved analytically for an arbitrary modulation of the firing rates of the presynaptic population. In the case of depressing synapses, the solution takes a particular simple form:
µ ¶ Z t t − t0 dt0 exp − − dt00 USE r(t00 ) . τrec −∞ t0
Z hy(t)i = USE r(t)
t
(3.3)
We use this equation to determine which features of the presynaptic AP train are transmitted by depressing synapses to their targets. Assuming that the presynaptic frequency changes gradually, one can write down the expansion over the derivatives of the frequency. The first two terms of this expansion are, r r + r0 + ··· 1 + rUSE τrec (1 + rUSE τrec )3
(3.4)
This expression describes the relative contribution of rate and temporal signaling in generating the postsynaptic response. The first term depends on the current rate, which is dominant for frequencies that are small compared to the limiting frequency λ ∼ 1/(USE τrec ). As the frequency increases, this term saturates, and thus progressively less rate signaling is possible.
Neural Networks with Dynamic Synapses
Current (nA)
6
827
A (Depressing synapse)
5
mean−field simulations
4 15 Hz
3
30 Hz
80 Hz
2 1 0 0
200
400
600
800
1000
1200
1400
1600
800 Time (ms)
1000
1200
1400
1600
Current (nA)
8 6
B (Facilitating synapse)
4 2 0 0
200
400
600
Figure 2: Postsynaptic current, generated by Poisson spike trains of a population of 1000 neurons with synchronous transitions from 0 Hz to 15 Hz to 30 Hz and then to 80 Hz, transmitted through facilitating (A) and depressing (B) synapses. Thick line: Solution of mean-field equation 3.1 and 3.2. Thin line: Simulations of 1000 spike trains with the use of the full model in equation 2.1. Parameters are the same as in Figure 1, with ASE = 250 pA.
The main contribution at higher frequencies therefore comes from a transient term reflecting the changes in frequencies. In the context of population signaling, this means that only synchronous transitions in the population activity can be signaled to the postsynaptic neuron (Tsodyks & Markram, 1997). The solution of the full set of equations 3.1 for facilitating synapse has the same form as equation 3.4, with the single complication due to the fact 1 is now a functional of the frequency that USE Z 1 USE
= USE
Ã
t − t0 dt r(t ) exp − − USE τ f acil −∞ t
0
0
Z t0
t
! 00
00
dt r(t ) ,
(3.5)
828
Misha V. Tsodyks, Klaus Pawelzik, and Henry Markram
which has to be substituted in equation 3.3. One could still analyze the qualitative features of this solution by noting that at very high frequencies, 1 → 1, and thus facilitating synapses behave in the same way as deUSE pressing ones, transmitting the information about the rate transitions. As the frequency decreases toward the peak frequency (see Figure 1D), s θ = 1/τ f acil +
2/τ f2acil +
q 1 + USE ≈ 1/ USE τrec τ f acil , USE τrec τ f acil
(3.6)
the presynaptic rate dominates in the postsynaptic response. The reason is that at this frequency, facilitating and depressing effects compensate each other, and the average amplitude of excitatory postsynaptic poten1 , is approximately constant. At even smaller tial (EPSP), which is ∼ xUSE frequencies, depressing effects become less relevant since x recovers almost to unity between the subsequent spikes. In this regime, the postsynaptic signal mainly reflects the current value of rate amplified by the value of 1 : USE Z Is ∼ r(t)
t
−∞
dt0 r(t0 ) exp(−(t − t0 )/τ f acil ).
(3.7)
The integral in this equation is roughly equal to the number of spikes emitted by the presynaptic neuron in the preceding time window of τ f acil . In this regime, postsynaptic response is a delayed and amplified transformation of the presynaptic frequency. As an example, we show in Figure 2 the postsynaptic current resulting from a series of transitions in the firing rate for both depressing and facilitating synapses. All three regimes of transmission via facilitating synapses are illustrated in Figure 2B. 4 Mean-Field Network Dynamics The analysis of the previous section allows the formulation of a closed system of equations for the dynamics of a large network consisting of subpopulations of neurons with uniform connections. Each population could describe a cortical column, which consists of neurons with similar receptive field properties. At this stage, we assume that at each cortical location, there are only two subpopulations of cortical neurons: pyramidal excitatory neurons and inhibitory interneurons. The coarse-grained equations, describing the firing rates of these populations, have the same form as in Wilson and Cowan (1972) and Amit and Tsodyks (1991), Ã ! X dEr ee ee ei ei e = −Er + g Jrr0 yr0 − Jrr0 yrr0 + Ir τe dt r0
Neural Networks with Dynamic Synapses
à ! X dIr ie ie ii ii i = −Ir + g τi Jrr0 yr0 − Jrr0 yrr0 + Ir , dt r0
829
(4.1)
where Er (Ir ) is the firing rate of excitatory (inhibitory) populations located at the site r; g(x) is a response function usually assumed to be monotonously ee increasing; and Jrr 0 denotes the absolute strength of the synaptic connection between excitatory neurons in the populations located at r and r0 times the average number of such connections per one postsynaptic neuron, correspondingly for other interactions. Finally, Ire (Iri ) is the external input to the excitatory (inhibitory) population. yee rr0 (and corresponding values for all other synapses) has to be computed from equations 3.1 and 3.2 for each connection rr0 with the corresponding set of kinetic parameters. Refractoriness of the neurons was ignored for simplicity. These equations reduce to the ones of Wilson and Cowan (1972) if synaptic transmission is frequency independent, in which case xr ≡ 1 and hence yr ∼ Er . In the presence of frequency dependence, they include effects of ever-changing synaptic efficacy due to depression and facilitation. This formulation allows for an analysis of the behavior of the network with any pattern of connections and external inputs. Since the goal of this article is not to consider any particular computational model, we limit ourselves to two examples. 4.1 Network of One Population. As the simplest example, we consider a network that consists of only one population of excitatory neurons. Already in this case, synaptic depression makes the network dynamics nontrivial. Equations 4.1 reduce to, dE = −E(t) + g(JUSE x(t)E(t)) dt 1 − x(t) dx = −USE E(t)x(t) + . dt τrec
τ
(4.2)
For convenience, the factor of τin (see equation 3.2) was absorbed in the definition of J. We can solve these equations for the fixed point, where it simplifies to, µ E = g JUSE
E 1 + EUSE τrec
¶ (4.3)
and can be illustrated using the graphical method (see Figure 3A). The righthand side of equation 4.3 always saturates for arbitrary response functions due to synaptic depression. The system will therefore have a nontrivial fixed point with E > 0, even in cases where without depression there is no stable solution.
830
Misha V. Tsodyks, Klaus Pawelzik, and Henry Markram
A
E (Hz)
20 10 0 0
5
10
5
10
15 theta (mV)
100
E (Hz)
20
25
50
0 0 C
15 E (Hz)
100
J (mV/Hz)
B
20
25
30
50
0 0
0.2
0.4
0.6
0.8
1 1.2 Time (sec)
1.4
1.6
1.8
2
Figure 3: Solution of equations 4.2 for the network of one excitatory population with homogeneous connections. (A) Graphical solution of the fixed-point equation, 4.3. The response function had a linear-threshold shape (dashed line),
½
g(x) =
0 β(x − θ )
if x < θ . if x > θ
(4.4)
The fixed-point solution is given by the intersection of the two solid lines. (B) Phase diagram of the system in the space of θ and J. (C) The solution of the dynamic equations, 4.2. Parameters in (A) and (C): θ = 15 mV; β = 0.5 mV−1 Hz; J = 60 mV∗ Hz−1 ; USE = 0.5; τrec = 800 msec; τ = 30 msec.
The stability of the fixed-point solution can be analyzed from equations 4.2. The solution is stable if the following matrix has eigenvalues with negative real parts: ! Ã ∗ ∗ β JUSE x −1 τ −USE x∗
β JUSE E τ −USE E∗
−
1 τrec
,
(4.5)
where E∗ and x∗ are the values of E and x at the fixed point, β = g0 (JUSE x∗ E∗ ). For a linear-threshold gain function (β = const) as in Figure 3A, the phase
Neural Networks with Dynamic Synapses
831
diagram of the system is shown in Figure 3B. For a given threshold, a fixedpoint solution appears when the synaptic strength exceeds the first critical value shown by the lower line on the diagram. Contrary to what one would expect from Figure 3A, this solution remains unstable until the synaptic strength grows above a second critical value (upper line). Even if the fixed-point solution is stable, the system exhibits dampened oscillations before reaching the steady state due to synaptic dynamics (see Figure 3C). 4.2 Network of Two Interconnected Populations. This system was analyzed in Wilson and Cowan (1972) for the case of linear synapses. They showed that if the external inputs are fixed, mean-field equations have two basic types of stable solutions: fixed points and limit cycles with the period on the order of τe , τi . In our case, the mean-field equations have a much richer set of solutions, because in addition to the pair of equations for E and I (see equation 4.1), they also include dynamic equations for synaptic efficacies (see equation 3.1). As a result, in addition to fixed points and simple limit cycles, the system exhibits a variety of rhythmic and irregular solutions that dominate the network behavior but are difficult to analyze in a completely general manner. Two particular novel solutions, one periodic and another irregular, are shown in Figures 4A and 4B. 5 Conclusion In this study, we introduce a phenomenological model that allows computation of the postsynaptic responses generated by either facilitating or depressing synapses for an arbitrary train of presynaptic spikes. The model was used to define the signals that can be transmitted by these synapses, and we show that signaling through these two types of synapses is fundamentally different at low firing rates but becomes more similar as the firing rate grows. The model was also used to test the validity of the derivation of selfconsistent mean-field equations for the dynamic behavior of large neural networks with arbitrary architecture of external inputs and internal interactions. The formalism was illustrated by considering two simple examples of networks consisting of one and two uniform populations of neurons. The purely excitatory network was shown always to possess a fixed-point solution, which can have arbitrary small firing rates. Adding an inhibitory population greatly increases the repertoire of behaviors, including the irregular sequence of population bursts of various amplitudes. Synaptic dynamics could therefore be an important factor in generating different states of cortical activity as reflected by electroencephalogram recordings. An important challenge for the proposed formulation remains in analyzing the influence of the synaptic dynamics on the performance of other,
832
Misha V. Tsodyks, Klaus Pawelzik, and Henry Markram
A 40
E (Hz)
30 20 10 0 30
31
32
33
34
35
36
31
32
33 Time (sec)
34
35
36
B
120
E (Hz)
100 80 60 40 20 0 30
Figure 4: Solutions of equation 4.1 for the network of two populations with homogeneous connections. (A) Population activity E(t) for the parameters Ie = 17 mV; Ii = 15 mV; Jee = 50 mV*Hz−1 ; Jei = 40; Jie = 70; Jii = 19.5; USE = 0.5 (ee and ei); USE = 0.05 (ie); 0.03 (ii); τrec = 800 msec (ee and ei); 600 msec (ie); 850 msec (ii); τ f acil = 1000 msec (ie); 400 msec (ii); τe = 30 msec; τi = 40 msec. (B) The same as in A but with Jii = 0. The gain functions for both populations have the same form as in Figure 3.
computationally more instructive neural network models. Work in this direction is in progress. Note added in proof: After the work was completed, we learned that a network of excitatory population can have oscillating solutions under some conditions (J. Rinzel, private communication).
Neural Networks with Dynamic Synapses
833
CVu CVx CVx*CVu
0.5
0.4
0.3
0.2
0.1
0 0
10
20
30
40 50 60 Frequency (Hz)
70
80
90
100
Figure 5: Coefficients of variation of USE , x, as well as their product, as a function of presynaptic frequency. Parameters are the same as in Figures 1B–D.
Appendix The mean-field description in section 3 was derived by adopting an approximation, 1 1 i = hxihUSE i. hxUSE
(A.1)
The relative error of this approximation can be estimated using the CauchySchwarz inequality of the probability theory: |hxU1 i − hxihU1 i| ≤ CVx CVU , hxihU1 i
(A.2)
where CVx (CVU ) stay for the coefficient of variation of the random variable 1 for brevity. Intuitively, this inequality states x (U1 ); we use U1 instead of USE that if one of the random variables has a small CV, its correlations with the other variables can be neglected. We can now use equation 2.2 to compute 1 for any presynaptic rate r. In the steady state, the result of the CV of USE
834
Misha V. Tsodyks, Klaus Pawelzik, and Henry Markram
computation is, 2 = CVU
rτ f acil (1 − U)2 . 2(1 + rτ f acil )2 (1 + Urτ f acil (1 − U/2))
(A.3)
CVx can be computed from equation 2.1, again assuming the condition in equation A.1: CVx2 =
rτrec h(U1 )2/2i . 1 + rτrec hU1 (1 − U1/2)i
(A.4)
The self-consistency of the mean-field theory can now be checked by plotting the product of CVU and CVx as a function of frequency (see Figure 5). The graph shows that for the set of parameters used in modeling facilitating synapses, derived from experimental traces, the relative error of equation A.1 does not exceed 5 percent for any frequency. More detailed analysis of equations A.3 and A.4 indicates that this error can exceed a 10 percent level only at the significantly shorter τ f acil and higher values of U at which the model does not exhibit facilitating behavior anymore. Acknowledgments This study was supported by grants from the Office of Naval Research and the Minerva Foundation. We thank Andreas Herz for discussions and two referees for useful comments on the article. References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Amit, D. J. & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates I: Substrate-spikes, rates, and neuronal gain. Science, 2, 259–274. Bertram, R., Sherman, A., & Stanely, E. F. (1996). Single-domain/bound calcium hypothesis of transmitter release and facilitation. J. Neurophysiol., 75, 1919– 1931. Ginsburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Phys. Rev. E, 50, 3171–3191. Grossberg, S. (1969). On the production and release of chemical transmitters and related topics in cellular control. J. Theor. Biol., 22, 325–364. Mallart, A., & Martin, A. R. (1966). Two components of facilitation at the neuromuscular junction of the frog. J. Physiol., 193, 677–694. Markram, H., & Tsodyks, M. V. (1996). Redistribution of synaptic efficacy between pyrimadal neurons. Nature, 382, 807–810. Markram, H., Tsodyks, M., & Wang, Y. (In press). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Nat’l Acad. Sci. USA.
Neural Networks with Dynamic Synapses
835
Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trends in Neurosci., 17, 119–126. Tsodyks, M. V., & Markram, H. (1996). Plasticity of neocortical synapses enables transitions between rate and temporal coding. Lect. Notes Comput. Sci., 1112, 445–450. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Nat’l Acad. Sci. USA, 94, 719–723. Tsodyks, M. V., Skaggs, W. E., Sejnowski, T., & McNaughton, B. L. (1997). Paradoxical effect of external modulation of inhibitory neurons. J. Neurosci., 17, 4382–4388. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interneurons. Biophys., 12, 1–24. Zengel, J. E., & Magleby, K. L. (1982). Augmentation and facilitation of transmitter release: A quantitative description at the frog neuromuscular junction. J. Gen. Physiol., 80, 583–611. Received February 26, 1997; accepted October 30, 1997.
LETTER
Communicated by John Rinzel
Spike Frequency Adaptation Affects the Synchronization Properties of Networks of Cortical Oscillators Sharon M. Crook Center for Computational Biology, Montana State University, Bozeman, MT 59717, U.S.A.
G. Bard Ermentrout Department of Mathematics, University of Pittsburgh, Pittsburgh, PA 15260, U.S.A.
James M. Bower Division of Biology, California Institute of Technology, Pasadena, CA 91125, U.S.A.
Oscillations in many regions of the cortex have common temporal characteristics with dominant frequencies centered around the 40 Hz (gamma) frequency range and the 5–10 Hz (theta) frequency range. Experimental results also reveal spatially synchronous oscillations, which are stimulus dependent (Gray & Singer, 1987; Gray, Konig, ¨ Engel, & Singer, 1989; Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992). This rhythmic activity suggests that the coherence of neural populations is a crucial feature of cortical dynamics (Gray, 1994). Using both simulations and a theoretical coupled oscillator approach, we demonstrate that the spike frequency adaptation seen in many pyramidal cells plays a subtle but important role in the dynamics of cortical networks. Without adaptation, excitatory connections among model pyramidal cells are desynchronizing. However, the slow processes associated with adaptation encourage stable synchronous behavior. 1 Introduction There is great interest in the mechanisms underlying the oscillatory properties of networks of cortical cells (Gray, 1994). In particular, there are many questions about which properties encourage synchrony, traveling waves of oscillations, or other phase shifts in phase-locked activity that may be computationally significant (Gray & Singer, 1989; Bressler, 1984; Freeman, 1978; Bower, 1995). In this article, we briefly describe a biophysically based compartmental model of an adapting cortical pyramidal cell. Then we use this model to derive a simpler coupled oscillator model that provides insight into the synchronizing properties of networks of cortical oscillators. We verify the predictions of the coupled oscillator model using simulations of networks of the biophysically based model. Neural Computation 10, 837–854 (1998)
c 1998 Massachusetts Institute of Technology °
838
Sharon M. Crook, G. Bard Ermentrout, and James M. Bower
In general, biophysical models of excitable membrane usually represent the dynamics of a cell in the general Hodgkin and Huxley (1952) current balance format,
CM
dV(t) E + IStim (t), = −IIon (V, w) dt
(1.1)
where V(t) denotes the deviation of the membrane potential from some reference potential at time t, IIon is the sum of voltage and time-dependent E is the vector of currents through the various ionic channel types, and w auxiliary membrane variables such as intracellular calcium and the gating variables. The stimulus IStim (t) represents the electrode current applied to the soma divided by the total cell membrane area. Membrane potential V(t) is in units of mV, membrane capacitance CM is in units of µF/cm2 , currents are densities with units of µA/cm2 , the time unit is ms, and the gating E describe the fraction of channels of a given type variables appearing in w that are in various conducting states at time t. When a more complex spatially distributed model is necessary, a cell model is usually constructed of smaller compartments, which are assumed to be isopotential with uniform physical properties (Segev, Fleshman, & Burke, 1992). In this case, there is a system of current balance equations similar to equation 1.1 that reflect the membrane properties of each particular somatic or dendritic compartment, including any ionic currents, synaptic inputs, and applied currents. Additional terms are included to represent the longitudinal currents flowing between neighboring compartments. In contrast, coupled oscillator models provide a simplified approach that can be useful for representing networks of cells by reducing the number of required equations and providing a context for a more analytical approach. Coupled oscillator models use a single phase variable to approximate the voltage oscillation of each neuron or neural assembly during repetitive firing. The behavior of a pair of coupled oscillators depends critically on the single interaction function chosen to represent the coupling between them. The behavior of our biophysically based compartmental model of an adapting pyramidal cell very closely matches experimental current clamp data from a brain slice preparation of rat piriform cortex (Crook, Ermentrout, Vanier, & Bower, 1997). We use this compartmental model to derive interaction functions that approximate the coupling for pairs of pyramidal cells. Since the interaction functions are derived from the biophysically based model rather than being chosen arbitrarily, they provide an accurate approximation, provide insight into the behavior of networks of coupled pyramidal cells, and illuminate conditions that encourage synchronous oscillations. We are able to verify the behavioral predictions of the coupled oscillator approach with simulations performed for a network comprised of the biophysically based compartmental pyramidal cell model.
Spike Frequency Adaptation
839
2 Compartmental Model Action potentials produced by pyramidal cells often occur at a higher frequency during the initial stages of the current injection, with a decreased firing rate at later stages of a sustained injection (Connors, Gutnick, & Prince, 1982; Madison & Nicoll, 1984). Experimental evidence shows that this spike frequency adaptation can be partially suppressed by application of acetylcholine or norepinephrine, which block various potassium currents (Sherman & Koch, 1986; Steriade and Llinas, 1988). Thus, the degree of adaptation is partially determined by the ionic conductance density of the currents responsible for adaptation. The hyperpolarization of the membrane potential due to potassium efflux regulates the firing rate by establishing a relative refractory period for the neuron. Thus, the degree of adaptation also depends on the relative timing of the kinetics of the adaptation currents and on the rate of decay of intracellular calcium, as described by Crook and Ermentrout (1997). We initially developed an adapting pyramidal cell model with one somatic compartment and four dendritic compartments using a current balance equation for each of the five compartments. Additional equations were used to represent the dynamics of the gating variables for the various ionic currents in the soma. The parameters representing the maximal conductances of the ion channels as well as the kinetic parameters were systematically adjusted using an automated parameter search method (Vanier & Bower, 1996). The resulting model accurately reproduces the experimental spiking behavior from a brain slice preparation of rat piriform cortex for a wide range of injected currents (Crook et al., 1997). Then we reduced the five-compartment model to a two-compartment model in the manner of Pinsky and Rinzel (1994). In the reduced model, a single compartment represents the entire dendritic structure; however, the model demonstrates the same qualitative behavior with the same level of accuracy as the full model (Crook, 1996). The currents in the model include a fast-activating voltage-dependent sodium current (INa ) and a delayed rectifier potassium current (IK−DR ) mediating the generation of simulated action potentials. The model also includes two different currents that contribute to the spike frequency adaptation. One is a noninactivating voltage-dependent potassium current (IK−M ), and the other is a calcium-dependent potassium current (IK−AHP ). There is also a high-threshold voltage-activated calcium current (ICa ) similar to those in other pyramidal cell models (Barkai & Hasselmo, 1994; Traub, Wong, Miles, & Michelson, 1991; Pinsky & Rinzel, 1994). The standard voltage-independent leak currents (IL−S and IL−D ) are included where the current in the soma partially reflects impalement damage. Equations and parameters for the reduced model are provided in the appendix. Experimental evidence shows that after spike frequency adaptation has occurred, pyramidal cells can exhibit oscillations at very low frequencies near the critical applied current required for the onset of repetitive firing
840
Sharon M. Crook, G. Bard Ermentrout, and James M. Bower
120
membrane potential (mV)
100
stable unstable
periodic branch
80 60
steady−state current−voltage curve
40 20 0 −50
−25
0 2 IStim (µA/cm )
25
50
Figure 1: Bifurcation diagram for the adapting pyramidal cell model showing current injection (IStim µA/cm2 ) versus membrane voltage deviation from rest (mV). For low current injection values, the cell demonstrates steady-state behavior, as suggested by the stable portion of the steady-state current voltage curve shown solid. At IStim ≈ 3.28 µA/cm2 a saddle node bifurcation occurs so that at higher current injection values, the cell demonstrates repetitive firing. The periodic branch shows the maximum and minimum voltage of the oscillation produced by a given value of IStim . The diagram depicts the behavior after adaptation has occurred.
(Lanthorn, Storm, & Andersen, 1984; McCormick, Connors, Lighthall, & Prince, 1985; Haberly, 1985; Barkai & Hasselmo, 1994). This behavior is characteristic of membrane models where the transition to repetitive firing occurs due to a saddle node bifurcation. Such models are known as type I membrane models (Rinzel & Ermentrout, 1992). Our biophysical pyramidal cell model demonstrates the characteristic low frequencies typical of type I membranes, and we verify that repetitive firing occurs due to the presence of a saddle node bifurcation, as shown in Figure 1. 3 Coupled Oscillator Model Network simulations using coupled biophysically based cell models provide a valuable tool for exploring the effects of different parameters. However, the dynamics underlying the network behavior are often obscured by the complicated nature of the model cells. Consider any cortical oscillator
Spike Frequency Adaptation
841
where the dynamics of the cell can be represented by equation 1.1, with an additional term representing a synaptic current. We can assume that the stimulus IStim is constant and spatially homogeneous so that the neuron is capable of spontaneously oscillating in the absence of synaptic current (Rinzel & Ermentrout, 1992). Alternatively, one can hypothesize that the cell acts as an oscillator due to the local network interactions with inhibitory neurons, as described in various models of cortical networks (Eeckman & Freeman, 1990; Whittington, Traub, & Jefferys, 1995; Wilson & Bower, 1991, 1992). In either case, V(t) denotes the oscillation of a single uncoupled oscillator so that V(t) can be approximated by V(θ (t)) where θ (t) represents the periodic phase of the oscillator. The phase variable θ (t) lies in the interval [0, T] where T is the period of the oscillation. If the intrinsic frequency of the oscillator is ω, then the phase satisfies dθ(t) = ω. dt
(3.1)
This single-variable phase model approximates the repetitive behavior of the voltage oscillation, but no amplitude information is retained. 3.1 Interaction Function. A synaptic current with no delay or spatial dependence has the form ˆ − VSyn ), ISyn (t) = g¯ Syn S[V(t)](V(t)
(3.2)
where g¯ Syn denotes the maximal conductance for the synapse, VSyn is the synaptic reversal potential, and S[·] is some functional of the presynaptic ˆ voltage V(t), which provides the synaptic time course and is equivalent to an alpha function or dual exponential. Now consider two oscillators identical to those of equation 1.1 that are coupled symmetrically with no delay or spatial dependence: dV1 (t) = −IIon (V1 , wE1 ) + IStim − g¯ Syn S[V2 (t)](V1 (t) − VSyn ) dt dV2 (t) = −IIon (V2 , wE2 ) + IStim − g¯ Syn S[V1 (t)](V2 (t) − VSyn ). C dt C
(3.3) (3.4)
If g¯ Syn is small, then it is possible to average the equations, leading to a phase model for the interactions between the neurons (Ermentrout & Kopell, 1984; Kuramoto, 1984). This phase reduction approach has been used by numerous authors in order to understand the dynamics of interacting neural oscillators where the coupling is weak (Ermentrout & Kopell, 1991; Cohen et al., 1992). The phases of the oscillators in this coupled system satisfy dθ1 (t) ¯ = ω + gH(θ 2 (t) − θ1 (t)) dt
(3.5)
842
Sharon M. Crook, G. Bard Ermentrout, and James M. Bower
dθ2 (t) ¯ = ω + gH(θ 1 (t) − θ2 (t)), dt
(3.6)
where g¯ denotes the coupling strength, and the interaction function H(·) is periodic and is determined by the form of the synaptic coupling and the nature of the uncoupled oscillation (Ermentrout & Kopell, 1990). Note that the interaction function depends on only the phase difference between the two oscillators, φ(t) = θ2 (t) − θ1 (t). The behavior of the pair of coupled oscillators depends critically and solely on the periodic interaction function chosen to represent the coupling. We can use our biophysical model to compute an interaction function H(φ) that is representative of a particular connection between two model cells. This is done by averaging the synaptic influence of the presynaptic cell over the cycle of a postsynaptic cell’s oscillation. For a synaptic connection of the form provided in equation 3.2, the interaction function is H(φ) =
1 T
Z 0
T
Z(t)(− g¯ Syn S[V(t + φ)](V(t) − VSyn ))dt.
(3.7)
The function Z(t) is called the infinitesimal phase response curve (PRC). Thus, the net effect of this calculation is a convolution of the PRC with the function that describes the form of the synaptic coupling (Ermentrout, 1996). The PRC is determined by the phase shifts that result from infinitesimally small perturbations during repetitive firing (Kuramoto, 1984; Hansel, Mato, & Meunier, 1995). It is possible to obtain a numerical computation that approximates this function (Ermentrout, 1996). A positive PRC indicates that a depolarizing perturbation at that time in the cycle will advance the phase of the oscillator, causing it to fire earlier. In contrast, a negative PRC indicates that a depolarizing perturbation will delay the phase, so the cell fires later. The function that describes the form of the synaptic coupling is chosen so that the resulting model synaptic current matches the experimental excitatory postsynaptic potentials recorded from a pyramidal cell in a slice preparation from layer Ib of rat olfactory cortex (Haberly & Bower, 1984). In this case, g¯ Syn = 1 mS/cm2 , VSyn = 30 mV, and the synaptic time course is equivalent to the dual exponential, α(t) = 2.75
exp(−t/τ1 ) − exp(−t/τ2 ) , τ1 − τ2
(3.8)
where τ1 = 2.8 and τ2 = .65. The resulting interaction function is insensitive to small changes in the form of the synaptic coupling; however, large changes in the synaptic time course can lead to a qualitative change in the dynamics of the coupled system (Crook, Ermentrout, & Bower, in press). 3.2 Phase-Locked Solutions. Once an interaction function has been computed, we use it to determine the phase-locked solutions to the simpler
Spike Frequency Adaptation
843
coupled oscillator system. These are the solutions for which the phase difference φ(t) = θ2 (t) − θ1 (t) is constant. For example, φ(t) ≡ 0 corresponds to the synchronous phase-locked solution. Determining the phase shift and stability of these solutions provides insight into the behavior of the more complicated biophysical system. From equations 3.5 and 3.7, we have dφ(t) ¯ = g(H(−φ) − H(φ)) dt ¯ odd (φ), = −2 gH
(3.9) (3.10)
since the even components cancel in the case of symmetric coupling. Any solution to the equation dφ(t)/dt = 0 is a phase-locked solution to the system, so the phase-locked solutions correspond to the zeros of the odd component ¯ we obtain of the interaction function. Linearizing near a fixed solution φ, dφ(t) 0 ¯ ¯ odd ≈ [−2 gH (φ)]φ(t). dt
(3.11)
0 (φ)] ¯ < 0, the solution is stable, so any particular phase¯ odd When [−2 gH 0 (φ(t)) ¯ ¯ > 0. Thus, we need only look at locked solution φ is stable when Hodd the form of the odd component of the interaction function near the zeros to predict the behavior of the system of two coupled cells (Ermentrout, 1996; Hansel, Mato, & Meunier, 1993).
4 Results We compute the interaction functions that are representative of the behavior of two coupled model pyramidal cells for different levels of adaptation. The strength of the adaptation is varied by changing the maximal conductances of the currents responsible for adaptation in the biophysical model. First we eliminate the adaptation currents completely, compute the interaction function, and determine the phase-locked solutions. In this case, we find that the synchronous phase-locked solution is unstable. When we gradually strengthen the level of adaptation, we see a transition to stable synchrony. Figure 2 demonstrates this transition where the panels depict Hodd (φ) for two pyramidal cells coupled with excitatory synapses as the level of adaptation grows. Simulations of two synaptically coupled biophysical cell models verify that the behavior predicted by the coupled oscillator model holds for the biophysical model as well. Although the synchronous solution is unstable in the spiking model with excitatory coupling and no spike frequency adaptation, adding spike frequency adaptation to the cell model leads to stable synchrony. The simulation results are summarized in the schematic in Figure 3. Coupled oscillator models are valid when the coupling between oscillators is weak. The coupling parameter must also be smaller than all other parameters in the model. We use these biophysical simulations to verify that the same qualitative results hold for strong coupling as well and
844
A
Sharon M. Crook, G. Bard Ermentrout, and James M. Bower
120
Hodd(φ)
− gCa=3
T/2
T 0
B
0
50
100
0
50
100
0
50
100
50
100
120
−g =4 Ca
T/2
T 0
C
120
−g =4.5 Ca
T/2
T 0
D
120
− gCa=6.5
mV
T/2
φ
T 0
0
ms
Figure 2: Odd part of the interaction function for two pyramidal cells coupled with excitatory synapses as the level of slow adaptation grows. The voltage traces on the right demonstrate the corresponding level of adaptation. Here we have set g¯ K−M = 0, fixed g¯ K−AHP = 7 mS/cm2 , and gradually increased the adaptive influence of IK−AHP by increasing g¯ Ca . We obtain the same transition if we eliminate IK−AHP and vary g¯ K−M . The phase difference φ lies between 0 and T where T ms is the period of the oscillation. Filled circles indicate stable phase-locked solutions. (A, B) In these panels, corresponding to lower levels of adaptation, the synchronous solution is unstable, and the antiphase solution is stable. (C) When the level of adaptation is large enough, there is a bifurcation so that two stable solutions appear near the antiphase solution. (D) High levels of adaptation lead to stable synchrony.
that the results are not affected by the size of the other parameters in the model. Network simulations demonstrate that if the synchronous state of a pair of neurons is unstable, then a globally coupled network of such neurons cannot synchronize fully (Hansel et al., 1995). This is true for our pyramidal cell model when we perform simulations with a small network of model cells with symmetric all-to-all coupling. Figure 4 depicts the results for simulations with and without spike frequency adaptation. Panel A shows the
Spike Frequency Adaptation
845
T Stable Unstable phase difference
T/2
φ
0 level of slow adaptation Figure 3: Schematic depicting the phase-locked solutions φ(t) for varying levels of slow adaptation. Solid curves and lines correspond to the stable phase-locked solutions, and dashed lines correspond to the unstable phase-locked solutions.
lack of synchrony inherent to networks of cells with no spike frequency adaptation, and panel B demonstrates the synchronizing properties of cells which include the slow processes for adaptation. Ermentrout (1996) examines the solutions and the PRCs produced by depolarizing perturbations to type I spiking membrane models. He finds that these oscillators have nonnegative PRCs due to the fact that the minimum of the curve occurs at the spike. In this case, a depolarizing perturbation will always advance the phase of the oscillator, causing it to fire earlier. The analysis is valid whenever all dynamic processes are faster than the timescale of the period of the oscillation. This is the case in our model with no adaptation, as shown Figure 5E. Hansel et al. (1995) show that unless excitatory synapses are very fast, synchrony is not possible for excitatory coupling when the PRC is nonnegative. This is consistent with the lack of synchrony observed in our simulations of two model cells with no adaptation and only fast processes. However, we find that in the presence of adaptation, the slow processes associated with the adaptation currents alter the dynamics so that the slope of the PRC is initially negative for our type I model. This leads to negative values on the portion the domain immediately following the action potential, as shown in Figure 5A. On this negative portion of the domain, a depolarizing perturbation will delay the phase of the oscillator, causing it to fire later. The delay occurs due to the high level of intracellular calcium
846
Sharon M. Crook, G. Bard Ermentrout, and James M. Bower
A membrane potential (mV)
0
50
100
150
100
150
time (ms)
B membrane potential (mV)
0
50
time (ms)
Figure 4: Voltage traces depicting the behavior of a small group of model pyramidal cells with symmetric all-to-all coupling. (A) The results for cells without spike frequency adaptation where synchronous initial conditions lead to outof-phase behavior. (B) The results for model cells with weak spike frequency adaptation. The applied currents are begun at different times so that the cells are out of phase following adaptation; over time, the cells synchronize.
following each action potential, which allows a depolarizing perturbation to activate the hyperpolarizing adaptation currents. This difference in the PRC accounts for the change in behavior observed as we increase the level of adaptation. In the simulations of two coupled model cells with spike frequency adaptation, the phase of one model cell is advanced and the other is delayed until they are firing synchronously. As expected, when we increase the speed of the processes that are responsible for adaptation, we find that the PRC becomes more similar to the one computed in the case of no adaptation. This is demonstrated in Figures 5B– D, where we scale the speed of the processes responsible for adaptation by 15, 30, and 45, respectively. The speed of the change in the voltage-gated IK−M current is increased by scaling the equation for the change in the gating variable. In contrast, the intrinsic gating of the IK−AHP current is rapid.
Spike Frequency Adaptation
A 120
847
VS Ca −IK−AHP
0
PRC
100
T/2
Hodd(φ)
T/2
T
T/2
T
T/2
T
T/2
T
T/2
T
T
−70
B
120
0
100
−70
T/2
T
C 120
0
100
T/2
−70
T
D 120
0
100
T/2
−70
T
E mV
120
0 −70
100
time (ms)
T/2
T
φ
Figure 5: Membrane potentials, levels of intracellular calcium, and IK−AHP currents in the model under varying circumstances with corresponding phase response functions and interaction functions shown on the right. Phase differences have been normalized to the period of oscillation for easy comparison, where the peak of the voltage oscillation occurs at time zero. Once again, filled circles indicate stable phase-locked solutions. (A) Results for the adapting model developed to match experimental data. (B–D) The equation for the gating variable for IK−M and the term controlling the calcium depletion are scaled by 15, 30, and 45 respectively. (E) Results for the model with no adaptation. In the case of fast processes, the PRC is very similar to the one computed for the model with no adaptation. This demonstrates that the slow timescale of the adaptation is required for the change in behavior.
848
Sharon M. Crook, G. Bard Ermentrout, and James M. Bower
It is the kinetics of the intracellular calcium that determine the degree of adaptation (Lancaster & Zucker, 1994). Thus, this process is altered by increasing the speed of the depletion of intracellular calcium. The resulting changes in the interaction functions show that it is the slower processes that allow spike frequency adaptation to encourage the synchronization of cortical networks. 5 Discussion Experimental evidence suggests that cholinergic drugs induce synchronous theta rhythm and gamma rhythm oscillations in electroencepalogram (EEG) recordings from hippocampus (Konopacki, MacIver, Bland, & Roth, 1987; Bland, Colom, Konopacki, & Roth, 1988) and olfactory cortex (Biedenbach, 1966). In these experiments, the typical cholinergic effect is an increase in the number of fast gamma oscillations seen in spontaneous EEG activity. In addition, the frequency of the slow theta rhythm decreases as the number of gamma oscillations increases. Traub, Miles, and Buzs´aki (1992) replicate these effects in network biophysical simulations of hippocampus, and Barkai, Bergman, Horwitz, and Hasselmo (1994) do the same with simulations of olfactory cortex. In these models, cholinergic modulation is simulated with a reduction in the maximal conductances for the potassium currents responsible for adaptation. Decreasing the level of adaptation leads to an increase in the firing frequency of pyramidal cells. These slower outward adaptation currents are also responsible for setting the frequency of the slow rhythm, so cholinergic modulation, which reduces these currents, also causes a decrease in the frequency of the theta rhythm. The partial (not total) reduction in the level of adaptation is consistent with results that show partial recovery of this conductance during sustained application of acetylcholine (Benardo & Prince, 1982). Our result suggests that the remaining spike frequency adaptation is not only crucial for maintaining the slow rhythm in these simulations but is also necessary for the synchronization of the oscillations during periods of repetitive firing. In the hippocampal model of Traub et al. (1992), the fast and slow inhibitory postsynaptic potentials are blocked. Thus, the network synchronization must be a product of the cellular properties and the excitatory coupling among pyramidal cells. Other models with intact spike frequency adaptation suggest that in some situations, synchronization may occur due to the influence of inhibitory interneurons (Traub, Miles, & Wong, 1987a; Traub, Miles, Wong, Schulman, & Schneiderman, 1987b; Traub et al., 1996). However, even in these models, which include inhibition, the network behavior is modified by the participation of pyramidal cells, and the spike frequency adaptation could contribute to the stability of the synchronous gamma oscillations. More recent experimental results demonstrate a different mechanism in visual cortex and other neocortical areas (Munk, Roelfsema, Konig, ¨ Engel,
Spike Frequency Adaptation
849
& Singer, 1996; Steriade, Amzica, & Contreras, 1996). In these areas, robust slow theta rhythms are present during drowsiness, deep sleep, and anesthesia; however, not much gamma activity is seen. With arousal, there is an increase in the release of acetylcholine, low-frequency theta oscillations diminish, and synchronous fast gamma-range oscillations are enhanced. Note that spike frequency adaptation can encourage synchrony even where it is too weak to cause the silent periods characteristic of the slower theta rhythm. This is evident in the simulation results shown in Figure 4B. Thus it is possible that spike frequency adaptation contributes to the stable synchronous oscillatory behavior in these areas as well. It is worth noting that even in the presence of spike frequency adaptation, one should not assume that all excitatory connections among pyramidal cells are synchronizing. Delays such as the conduction delays introduced by lengthy axons (Crook et al., 1997) or even the delays introduced by distal excitatory synapses (Crook et al., in press) can introduce phase lags that prevent synchrony. Appendix Current Balance Equations CM V˙S = −INa (VS , m, h) − IK−DR (VS , n) − ICa (VS , s, r) − IK−AHP (VS , q) −IK−M (VS , w) − IL−S (VS ) − gc (VS − VD )/P + IStim /P
CM V˙D = −IL−D (VD ) − gc (VD − VS )/(1 − P),
where VS and VD are the deviations of the somatic and dendritic membrane potentials from the reference potential of −77 mV, gc is the coupling conductance parameter, and the current scaling parameter P is the proportion of the cell area taken up by the soma. Ionic Currents INa (VS , m, h) = g¯ Na m2 h(VS − VNa ) IK−DR (VS , n) = g¯ K−DR n(VS − VK ) ICa (VS , s, r) = g¯ Ca s2 r(VS − VCa ) IK−AHP (VS , q) = g¯ K−AHP q(VS − VK ) IK−M (VS , w) = g¯ K−M w(VS − VK ) IL−S (VS ) = g¯ L−S (VS − VL ) IL−D (VD ) = g¯ L−D (VD − VL ). Kinetic Equations The kinetic equations for the gating variables have ˙ the form y(u) = (y∞ (u) − y(u))/τy (u). The functions that determine the kinetic equations are listed below. In some cases we give the functions in
850
Sharon M. Crook, G. Bard Ermentrout, and James M. Bower
the form αy (u) and βy (u) where y∞ (u) = αy (u)/(αy (u) + βy (u)) and τy (u) = 1/(αy (u) + βy (u)). αm (V) = βm (V) = αh (V) = βh (V) = αn (V) = βn (V) = αs (V) = βs (V) = αr (V) =
.32(30.1 − V) exp(.25(30.1 − V)) − 1 .28(V − 57.1) exp((V − 57.1)/5.0) − 1 .128 exp((34 − V)/18) 4 exp((57 − V)/5) + 1 .059(52.1 − V) exp((52.1 − V)/5) − 1 .925 exp(.925 − .025V) .912 exp(−.072(V − 82)) + 1 .0114(V − 68.1) exp((V − 68.1)/5) − 1 min(.005, .005 exp(−(V − 17)/20))
βr (V) = (.005 − αr (V)) q∞ (Ca) = (.0005Ca)2 .0338 min(.00001Ca, .01) + .001 1 w∞ (V) = exp(−(V − 42)/10) + 1 92 exp(−(V − 42)/20) τw (V) = 1 + .3 exp(−(V − 42)/10) τq (Ca) =
Calcium Handling dCa = −BICa − Ca/τCa dt where the variable Ca represents the intracellular free calcium level, B = 3, and τCa = 60 ms. Model Parameters. The maximal conductances in units of mS/cm2 are g¯ Na = 221, g¯ K−DR = 47, g¯ Ca = 8.5, g¯ K−AHP = 7, and g¯ K−M = 6.5. The maximal conductance of the leak current is g¯ L−S = 2 in the soma compartment and g¯ L−D = .05 in the dendrite compartment. The reversal potentials in units of mV are VNa = 132, VK = −13, VL = 0, and VCa = 197. The capacitance is CM = .8 µF/cm2 . The coupling parameter is gc = 1.1 mS/cm2 , and the current scaling parameter is P = .05.
Spike Frequency Adaptation
851
Acknowledgments This work was supported by the NIH and NIMH. We thank J. Rinzel, A. Sherman, and P. Latham for helpful discussion and comments. The reduced pyramidal cell model is available via ftp://www.nervana.montana.edu/pub/ users/crook/pyr.ode.
References Barkai, E., Bergman, R. E., Horwitz, G., & Hasselmo, M. E. (1994). Modulation of associative memory function in a biophysical simulation of rat piriform cortex. Journal of Neurophysiology, 72, 659–677. Barkai, E., & Hasselmo, M. E. (1994). Modulation of the input/output function of rat piriform cortex pyramidal cells. Journal of Neurophysiology, 72, 644–658. Benardo, L. S., & Prince, D. A. (1982). Ionic mechanisms of cholinergic excitation in mammalian hippocampal pyramidal cells. Brain Research, 249, 333–344. Biedenbach, M. A. (1966). Effects of anesthetics and cholinergic drugs on prepyriform electrical activity in cats. Experimental Neurology, 16, 464–479. Bland, B. H., Colom, L. V., Konopacki, J., & Roth, S. H. (1988). Intracellular records of carbachol-induced theta rhythm in hippocampal slices. Brain Research, 447, 364–368. Bower, J. M. (1995). Reverse engineering the nervous system: An in vivo, in vitro, and in computo approach to understanding the mammalian olfactory system. In S. Zornetzer, J. Davis, and C. Lau (Eds.), An introduction to neural and electronic networks (pp. 3–28). San Diego: Academic Press. Bressler, S. L. (1984). Spatial organization of EEGs from olfactory bulb and cortex. Electroencephalography and Clinical Neurophysiology, 57, 270–276. Cohen, A., Ermentrout, G. B., Kiemel, T., Kopell, N., Sigvardt, K. A., & Williams, T. L. (1992). Modeling of intersegmental coordination in the lamprey central pattern generator for locomotion. Trends in Neuroscience, 15, 434–438. Connors, B. W., Gutnick, M. J., & Prince, D. A. (1982). Electrophysiological properties of neocortical neurons in vitro. Journal of Neurophysiology, 48, 1302–1320. Crook, S. M. (1996). The role of delay in oscillatory models of olfactory cortex. Unpublished doctoral dissertation, University of Maryland. Crook, S. M., & Ermentrout, G. B. (1997). An analysis of the adaptive behavior of piriform cortex pyramidal cells. In Computational neuroscience trends in research 1996. New York: Plenum. Crook, S. M., Ermentrout, G. B., & Bower, J. M. (In press). Dendritic and synaptic effects in systems of coupled cortical oscillators. Journal of Computational Neuroscience. Crook, S. M., Ermentrout, G. B., Vanier, M. C., & Bower, J. M. (1997). The role of axonal delay in the synchronization of networks of coupled cortical oscillators. Journal of Computational Neuroscience, 4, 161–172. Eeckman, F. H., & Freeman, W. J. (1990). Correlations between unit firing and EEG in the rat olfactory system. Brain Research, 528, 238–244.
852
Sharon M. Crook, G. Bard Ermentrout, and James M. Bower
Engel, A. K., Konig, ¨ P., Kreiter, A. D., Schillen, T. B., & Singer, W. (1992). Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends in Neuroscience, 15, 218–226. Ermentrout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001. Ermentrout, G. B., & Kopell, N. (1984). Frequency plateaus in a chain of weakly coupled oscillators. SIAM Journal on Mathematical Analysis, 15, 215–237. Ermentrout, G. B., & Kopell, N. (1990). Oscillator death in systems of coupled neural oscillators. SIAM Journal of Applied Mathematics, 50, 125–146. Ermentrout, G. B., & Kopell, N. (1991). Multiple pulse interactions and averaging in systems of coupled neural oscillators. Journal of Mathematical Biology, 29, 195–217. Freeman, W. J. (1978). Spatial properties of an EEG event in the olfactory bulb and cortex. Electroencephalography and Clinical Neurophysiology, 44, 586–605. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. Journal of Computational Neuroscience, 1, 11–38. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Gray, C. M., & Singer, W. (1987). Stimulus-specific neuronal oscillations in the cat visual cortex: A cortical functional unit. Society of Neuroscience Abstracts, 13, 404.3. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proceedings of the National Academy of Science, 86, 1698–1702. Haberly, L. B. (1985). Neuronal circuitry in olfactory cortex: Anatomy and functional implications. Chemical Senses, 10, 219–238. Haberly, L. B., & Bower, J. M. (1984). Analysis of association fiber system in piriform cortex with intracellular recording and staining techniques. Journal of Neurophysiology, 51, 90–112. Hansel, D., Mato, G., & Meunier, C. (1993). Clustering and slow switching in globally coupled phase oscillators. Physical Review E, 48, 3470–3477. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 192–210. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. Journal of Physiology (London), 117, 500–544. Konopacki, J., MacIver, M. B., Bland, B. H., & Roth, S. H. (1987). Carbacholinduced EEG “theta” activity in hippocampal brain slices. Brain Research, 405, 196–199. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. New York: Springer-Verlag. Lancaster, B., & Zucker, R. S. (1994). Photolytic manipulation of Ca2+ and the time course of slow, Ca2+ -activated K+ current in rat hippocampal neurones. Journal of Physiology London, 475, 229–239. Lanthorn, T., Storm, J., & Andersen, P. (1984). Current-to-frequency transduction in CA1 hippocampal pyramidal cells: Slow prepotentials dominate the primary range firing. Experimental Brain Research, 53, 431–443.
Spike Frequency Adaptation
853
Madison, D. V., & Nicoll, R. A. (1984). Control of the repetitive discharge of rat CA1 pyramidal neurones in vitro. Journal of Physiology, London, 354, 319–331. McCormick, D. A., Connors, B., Lighthall, J., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. Journal of Neurophysiology, 54, 782–806. Munk, M. H. J., Roelfsema, P. R., Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Role of reticular activation in the modulation of intracortical synchronization. Science, 272, 271–274. Pinsky, P. F., & Rinzel, J. (1994). Intrinsic and network rhythmogenesis in a reduced Traub model for CA3 neurons. Journal of Computational Neuroscience, 1, 39–60. Rinzel, J., & Ermentrout, G. B. (1992). Analysis of neural excitability. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 135–169). Cambridge, MA: MIT Press. Segev, I., Fleshman, J. W., & Burke, R. E. (1992). Compartmental models of complex neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 63–96). Cambridge, MA: MIT Press. Sherman, S. M., & Koch, C. (1986). The control of retinogeniculate transmission in the mammalian lateral geniculate nucleus. Experimental Brain Research, 63, 1–20. Steriade, M., Amzica, F., & Contreras, D. (1996). Synchronization of fast (30–40 Hz) spontaneous cortical rhythms during brain activation. Journal of Neuroscience, 16, 392–417. Steriade, M., & Llinas, R. R. (1988). The functional states of the thalamus and the associated neuronal interplay. Physiology Review, 68, 649–742. Traub, R., Miles, R., & Buzs´aki, G. (1992). Computer simulation of carbacholdriven rhythmic population oscillations in the CA3 region of the in vitro rat hippocampus. Journal of Physiology, 451, 653–672. Traub, R., Miles, R., & Wong, R. (1987a). Models of synchronized hippocampal bursts in the presence of inhibition. I. Single population events. Journal of Neurophysiology, 58, 739–751. Traub, R., Miles, R., Wong, R., Schulman, L. S., & Schneiderman, J. H. (1987b). Models of synchronized hippocampal bursts in the presence of inhibition. II. Ongoing spontaneous population events. Journal of Neurophysiology, 58, 752–764. Traub, R. D., Wittington, M. A., Stanford, I. M., & Jefferys, J. G. R. (1996). A mechanism for generation of long-range synchronous fast oscillations in the cortex. Nature, 383, 621–624. Traub, R., Wong, R., Miles, R., & Michelson, H. (1991). A model of a CA3 hippocampal pyramidal neuron incorporating voltage-clamp data on intrinsic conductances. Journal of Neurophysiology, 66, 635–649. Vanier, M. C., & Bower, J. M. (1996). A comparison of automated parametersearching methods for neural models. In J. M. Bower (Ed.), Computational neuroscience (pp. 477–482). San Diego: Academic Press. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615.
854
Sharon M. Crook, G. Bard Ermentrout, and James M. Bower
Wilson, M., & Bower, J. M. (1991). A computer simulation of oscillatory behavior in primary visual cortex. Neural Computation, 3, 498–509. Wilson, M., & Bower, J. M. (1992). Cortical oscillations and temporal interactions in a computer simulation of piriform cortex. Journal of Neurophysiology, 67, 981–995. Received June 9, 1997; accepted October 2, 1997.
LETTER
Communicated by David Ferster
GABAergic Inhibitory Control of the Transient and Sustained Components of Orientation Selectivity in a Model Microcolumn in Layer 4 of Cat Visual Cortex Paul Bush Nicholas Priebe Department of Physiology, University of California, San Francisco, San Francisco, CA 94143-0444, U.S.A.
Recently proposed models of orientation tuning in layer 4 of cat primary visual cortex (Somers, Nelson, & Sur, 1995; Douglas, Koch, Mahowald, Martin, & Suarez, 1995) rely on widespread inhibitory intracortical connections to suppress the nonoptimal component of a broadly tuned thalamic input, while local excitatory intracortical connections amplify the optimal component. However, new experimental data (Ferster, Chung, & Wheat, 1996) and theoretical analyses (Ferster, 1987; Krukowski, Priebe, & Miller, 1996) show that the temporally modulated component of thalamic input is well tuned and that the cortical circuitry must simply subtract an unmodulated DC component at nonoptimal orientations to obtain sharp tuning. In addition, within a single hypercolumn in layer 4, inhibitory and excitatory layer 4 neurons have approximately equal-sized axonal fields, making the most of their synapses within their own dendritic field (Kisvarday, Martin, Whitteridge, & Somogyi, 1985; Martin & Whitteridge, 1984). We have constructed a model of a single microcolumn in which GABAB inhibition subtracts the DC and controls the sustained response, while GABAA inhibition controls the response to transient and suprathreshold inputs. The model fits experimental data based on stimulation with drifting sine-wave gratings as well as flashed bars, explains a counterintuitive property of the GABAB K+ conductance, and at suboptimal orientations and submaximal contrasts produces an exponential distribution of firing frequencies. 1 Introduction A strong body of experimental evidence now exists (Chapman, Zahs, & Stryker, 1991; Reid & Alonso, 1995; Ferster et al., 1996) showing that the sharp orientation tuning of simple cells is produced by a linear arrangement of lateral geniculate nucleus (LGN) inputs (but see Pei, Vidyasagar, Volgvshev, & Creutzfeldt, 1994). Optimal stimuli activate lines of LGN inputs simultaneously to produce large depolarizations in the target simple cell. However, theoretical studies (Ferster, 1987; Krukowski et al., 1996; Troyer, Neural Computation 10, 855–867 (1998)
c 1998 Massachusetts Institute of Technology °
856
Paul Bush and Nicholas Priebe
Krukowski, Priebe, & Miller, 1997) show that nonoptimal stimuli result in a smaller, sustained input as the LGN afferents are activated sequentially. Since fully nonoptimal (cross-orientation) stimuli produce no change or even a slight hyperpolarization in simple cell membrane potentials (Berman, Douglas, Martin, & Whitteridge, 1991), this nonoptimal input is suppressed by some form of inhibition. Contrast-dependent inhibition for nonoptimal stimuli has been demonstrated in complex cells (Sclar & Freeman, 1982). This inhibition cannot be of the shunting type, since cellular input conductance changes by less than 15% during nonoptimal stimulation (Berman et al., 1991). In our model, we have used GABAB inhibition selectively to inhibit sustained nonoptimal input, as recently demonstrated experimentally (Allison, Kabara, Snider, Casagrande, & Bonds, 1996). Postsynaptic GABAB inhibition works through a slow, relatively small amplitude hyperpolarizing K+ conductance, which is believed to be primarily located on dendrites, in contrast to the proximal-somatic location of the fast, large GABAA Cl− conductance (Janigro & Schwartzkroin, 1988). Thus, the properties of GABAB inhibition are well suited to suppressing sustained, nonoptimal LGN input without large changes in input conductance. 2 Methods Experimental evidence suggests that GABAA and GABAB inhibition in cortex may be provided by different classes of interneurons and that GABAB interneurons (Ib cells) are preferentially activated by feedforward inputs and GABAA (Ia) cells by feedback connections (Benardo, 1994; Solis, Isaacson, & Nicoll, 1992; Wilson & Blower, 1992). While this segregation may not be total, we have adopted this circuitry in our simplified model (see Figure 1). In layer 4 of cat V1, GABAA inhibition is likely provided by clutch cells, small basket cells that target the soma and proximal dendrites of spiny stellates (E cells) (Kisvarday, Martin, Whitteridge, & Somogyi, 1985). Two other types of interneuron form a minority in layer 4: GABAB inhibition might be provided by neurogliaform cells, which synapse on the distal dendrites of their target cells (Kisvarday et al., 1990). The other type, bitufted cells, have been shown to produce GABAA inhibitory postsynaptic potentials (IPSPs) in their targets (Deuchars & Thomson, 1995a, 1995b). Connections between cells in the model depend on the spatial correlation between their LGN inputs (Krukowski et al., 1996). The amplitude of the connection between two cells is proportional to the area of overlap of their respective receptive fields (RFs). Cells with nonoverlapping RFs are not connected. Overlapping RFs are either net same phase (ON/ON or OFF/OFF) or opposite phase (ON/OFF). Excitatory spiny stellates contact cells of the same spatial phase; clutch cells target spiny stellates of the opposite phase (Ferster, 1988) and also inhibit other inhibitory interneurons of the same phase. Ib cells target spiny stellates of the same phase. For a single spiny stellate, each LGN input is on average larger than an input from another spiny stellate, but the
GABAergic Inhibitory Control
857
Anti−correlated
Correlated
spiny stellate
clutch cell
neurogliaform?
LGN
Figure 1: Model circuit diagram. Excitatory synapses are represented by arrows, GABAA synapses by open circles, and GABAB synapses by filled circles. Relatively weak feedforward input to clutch cells is indicated by a dashed line.
total intracortical excitatory input is greater than the total LGN input due to the greater number of intracortical synapses (Stratford, Tarczy-Hornoch, Martin, Bannister, & Jack, 1996). Five hundred cells were simulated, which is the number in a column of layer 4 of diameter 160 µm (Peters & Payne, 1993). Each cell was a reduced model consisting of 7 to 9 compartments (Bush and Sejnowski, 1993), 400 regular spiking spiny stellates, 84 fast-spiking clutch cells, and 16 adapting putative neurogliaform cells. Intrinsic Hodgkin-Huxley-like conductances producing these firing patterns were simulated as described previously (Bush & Sejnowski, 1994, 1996) (Ib cells Ca2+ decay constant 100 ms, gKCa 2.5 mS/cm2 ). Each cell received 14 on and 14 off inputs from an LGN spatiotemporal filter. These were arranged into an ON subfield of 2 rows of
858
Paul Bush and Nicholas Priebe
7 inputs adjacent (left or right randomly chosen) to an OFF subfield, also containing 2 rows of 7 inputs, forming 2 rectangular subfields each 900 by 27.50 (aspect ratio of 3.27; Jones & Palmer, 1987). The RF structure of all cells was identical. The rectangular grid arrangement of the LGN filter enforced a regularity in the LGN input pattern, but using less regular input patterns (connections deleted from the subfield, slight distortion of the rectangle) did not change the tuning of the model. The crucial parameter was the aspect ratio of the subfields, the tuning width being inversely proportional to this value. The model is dependent on this anisotropy in the inputs to provide orientation tuning. The orientations of the RFs had a mean of 0 degrees, SD 10 degrees, cutoff at 30 degrees (gaussian distribution). RFs were randomly positioned in an area 127.50 by 127.50 . The LGN filter was the same as in Krukowski et al. (1996) and Troyer et al. (1988), briefly, an input pattern, a sine-wave grating or flashed bar, was convolved with a spatial filter based on a standard difference of gaussians and a temporal filter extracted from reverse correlation analysis of the center response, resulting in a rate code modified by the contrast sensitivity of LGN cells. Spikes for the LGN cells were generated from Poisson statistics of the rate code. LGN inputs to inhibitory cells were somatic (Kisvarday et al., 1990), each of amplitude 0.5 nS for Ia cells and 0.9 nS for Ib cells. Inputs to E cells were dendritic and of amplitude 1.5 nS. Intracortical connections were proportional to the strength of the spatial correlation of RFs; total connection density was 10%. There were six types of intracortical synaptic connection, with synaptic delays of 1.5± 0.6 ms (gaussian distribution). Amplitudes in nS: mean (range) E–E 1.24 (0.7–2.8), E–Ia 6.37 (3.5–14), Ia–E 4.86 (2.8–11.2), Ib–E (only weakly dependent on spatial correlation) 0.36 (0.22–0.76), Ia–Ib 0.79 (0.1–2.8) Ia–Ia 0.27 (0.14–0.56). Synaptic conductances were implemented using the SNS software (Lytton, 1996). Time to peak and reversal potentials were: glutamate, 1 ms, 0 mV; GABAA 1.1ms, −70 mV; GABAB , 85 ms, −90 mV. All simulations were run using NEURON (Hines, 1984) on a DEC ALPHA 250; a 1-second simulation took 20 minutes of CPU time.
3 Results and Discussion In response to nonoptimally oriented drifting sine-wave gratings, Ib cells are activated by strong LGN inputs. The resulting slow hyperpolarization in the dendrites of E cells cancels the relatively small but sustained LGN input, preventing firing. Many E cells showed no change in membrane potential (Vm) to cross-orientation stimuli; some were hyperpolarized by a few millivolts (Berman et al., 1991). Mean control Vm was −56.7 mV; mean Vm for cross-orientation stimuli was −58.0 mV. Input conductance increased during nonoptimal stimulation by an average of 7%. The study of Berman et al. (1991) found increases of 15% or less in this situation. Weaker LGN inputs to Ia cells caused little firing to nonoptimal stimuli; instead, firing
GABAergic Inhibitory Control
859
of Ia cells was associated with the firing of E cells due to strong feedback connections. In response to optimal stimulation, strongly modulated LGN input causes large depolarizations in E cell dendrites that transiently break through the GABAB inhibition. Firing of E cells is then enhanced in three ways: (1) excitatory feedback from other E cells at the same phase (Douglas et al., 1995); (2) E cells activate Ia cells, which in turn inhibit Ib cells, disinhibiting the E cells; and (3) the GABAB K+ conductance is voltage dependent, turning off with depolarization (Satou, Mori, Tazawa, & Takagi, 1982; Newberry & Nicoll, 1985; Gage, 1992). The form of the voltage dependence was fitted to this current clamp data. The exact form of the dependence was not crucial to the performance of the model but had to be superlinear. Weaker voltage dependency, besides not matching the current clamp studies, required stronger Ia–Ib inhibition. With no voltage dependence, the GABAB conductance reduced E cell firing rates at optimal orientations even with strong Ia–Ib inhibition. It may seem that a GABAB conductance that inactivates due to depolarization turns off when it is most needed, but Figure 2 shows that GABAB IPSPs are maximal around resting potential, when they are needed to inhibit nonoptimal input, and significantly reduced at the depolarized potentials attained during optimal stimulation. The orientation tuning of the E cells in response to sustained stimulation, drifting sine-wave grating input, is shown in Figure 3. Half-width at half-height for 5%, 10%, and 50% contrast was 23.1, 19.2, and 20.3 degrees, respectively, showing no dependency on contrast (Sclar & Freeman, 1982) and close to the experimental mean of 19.5 degrees (Heggelund & Albus, 1978). Clutch cells had similar tuning with a mean half-width of 25.4 degrees. The Ib cells, despite being located in the same orientation column as the other cells, were activated at the cross-orientation due to strong feedforward inputs. At the optimal orientation, inhibition from Ia cells reduced the firing rates of the Ib cells, giving them approximately flat tuning curves; that is, they fired the same number of spikes at each orientation. With weaker Ia–Ib connections, the Ib cells displayed more conventional tuning curves with the same optimal orientation as the other cells. The mean optimal firing rate of the clutch cells was 120 Hz; thus, the mean optimal rate for all simple cells (excitatory plus inhibitory) was 28.0 Hz, in agreement with Albrecht (1995). The spontaneous firing rate was 0.1 Hz for E cells and 1 Hz for Ia cells. Figure 3 also shows tuning curves for 50% contrast stimuli with GABAA or GABAB conductances blocked. GABAA block increases the suprathreshold response to optimal and suboptimal stimuli, increasing the tuning halfwidth by 27%, close to the experimental increase of 33% resulting from GABAA block of complex cells in response to sustained sine-wave stimulation (Pfleger & Bonds, 1995). Blocking GABAB in the model produced a half-width increase of 32%, close to the experimental increase of 29% seen in simple cells after GABAB blockade (Allison et al., 1996). Intracellular
860
Paul Bush and Nicholas Priebe
optimal orientation
spontaneous
cross orientation
-30.0
Membrane potential (mV)
-40.0
-50.0
-60.0
-70.0
1000 ms -80.0 -7.0 -5.0 -3.0 -1.0 IPSP amplitude (mV)
Figure 2: GABAB K+ conductance (gK) was reduced by depolarization above a membrane potential (Vm) of −60 mV, according to gK = gK ∗ (Vm/ − 60)3 . (Left) Amplitude of 2 nS GABAB IPSP recorded in dendrite of E cell at various holding potentials (produced by current injection). (Right) Three traces from E cell dendrite during stimulation at 0 degree, no stimulation, and 90 degrees. GABAB inhibition is maximal when the membrane potential is around rest and significantly reduced for optimal stimulation.
injection of Cs+ ions, which are known to block the GABAB K+ conductance (Jarolimek, Bijak, & Misgeld, 1994), produces an increased response to cross-orientation stimulation (Ferster, 1987). The increase in response to cross-orientation stimulation shown by the model after GABAB block was greater than that seen in the experimental data (Allison et al., 1996). This may be due to a characteristic of the intrinsic firing properties of neurogliaform cells that was not included in the model: In response to relatively large inputs, neurogliaform cells fire continuously, but in response to small or threshold inputs, a single spike occurs followed by a quiescent period (Hestrin & Armstrong, 1996). This property strongly reduces spiking responses to small inputs, which in our model would reduce the effect of GABAB at cross-orientations. Another possible reason for the small effect of GABAB block on cross-orientation responses seen experimentally is the choice of blocker used. Allison et al. used 2-hydroxy-saclofen, whereas the phosphinic analogues of GABA now available are thousands of times more
GABAergic Inhibitory Control
861
15.0
50% contrast 10% contrast 5% contrast GABAa block GABAb block
Firing rate (Hz)
10.0
5.0
0.0 0.0
20.0
40.0 60.0 Orientation (degrees)
80.0
100.0
Figure 3: Orientation tuning of E cell responses to sine-wave grating of 1 cycle/degree drifting at 2.5 deg s−1 . The lower three traces show responses to 5%, 10%, and 50% contrast gratings. The upper two traces show responses to 50% contrast gratings with all GABAA or GABAB conductances reduced to 10% of control values. All data points are the mean of nine trials. Standard error bars were the same size as the graphing symbols and have been omitted for clarity.
potent (Mott & Lewis, 1994) and might be expected to produce a much larger effect. For transient stimuli such as the onset of a flashed bar, the slow time course of GABAB inhibition rules out a role in the control of orientation selectivity. Instead, fast GABAA inhibition must be utilized. Figure 4 shows E cell responses to flashed bars of varying orientation. Responses consist of an initial excitatory postsynaptic potential (EPSP) followed, at optimal orientations, by an oscillatory series of EPSPs, as in the data of Volgushev, Vidyasagar, and Xing (1995), though at a slightly lower frequency. The initial burst is due to the large transient response of the LGN cells, which also activates inhibitory cells. The resulting (primarily GABAB ) inhibition truncates the burst and causes a hyperpolarization. This behavior is similar to that seen in response to an electrical shock stimulus. However, the contin-
862
Paul Bush and Nicholas Priebe
ued presence of the bar slightly but significantly depolarizes the cells, and as the hyperpolarization decays, further firing occurs. Recurrent excitatory connections synchronize the firing of the E cells, which at this point are all approximately back at threshold. Inhibitory cells are once again recruited, and the oscillatory cycle begins again. As in the experimental data, the initial response is less orientation tuned, but this is not due to weak tuning of the LGN inputs; rather, it is due to the large transient response of the LGN cells to stimulus onset. This transient causes an EPSP in the E cells at all orientations but is also large enough to activate Ia cells directly. Their GABAA inhibition truncates the EPSP at nonoptimal orientations (see Figures 4D,E), according to the mechanism demonstrated theoretically as cross-phase (Krukowski et al., 1996; Troyer et al., 1998) and experimentally as spatially opponent inhibition (Ferster, 1988). In this scheme optimally oriented bars produce little GABAA inhibition because only clutch cells targeting spiny stellates of the opposite phase are activated. Nonoptimally oriented bars activate inhibitory cells over all phases (reaching a maximum at the cross-orientation), giving inhibition to all spiny stellates in the corresponding region. This can be shown by reducing the GABAA conductance (see Figures 4F–J). Initial EPSPs now become broader (mean width increases 65%) and cause more spiking, while succeeding EPSPs are not strongly affected. GABAA block does not cause sustained firing at nonoptimal orientations. A large increase in the initial transient (but not the sustained) response at all orientations after GABAA block has been seen experimentally in complex cells (Pfleger & Bonds, 1995). In this model, sharp orientation tuning is provided by the LGN inputs (Chapman et al., 1991; Reid & Alonso, 1995; Ferster et al. 1996) rather than cortically produced by a combination of lateral inhibition and amplification (Somers et al., 1995; Douglas et al., 1995). However, the combination of recurrent inhibition and excitation in the model does have a strong effect on the output distribution. Figure 5 shows that the strongly peaked (low variance) distribution obtained from pure feedforward circuitry is broadened by the addition of intracortical synapses. The action of the cortical circuitry is decorrelative, reducing the firing rate of some cells and increasing the firing rate of others to the same (optimal) stimulus. The distributions obtained for suboptimal stimuli were approximately exponential, most cells not firing at all or at low rates, with a small population firing at high rates. In general, the majority of stimuli activating a cell will be suboptimal, away from the peak of the tuning curve. It is these parts of the tuning curve that determine the threshold of discrimination, where the slope is steepest (Lehky & Sejnowski, 1990). The exponential firing rate distribution has been noted as one that minimizes energy usage by the brain (Baddeley, 1996). More generally, a set of high kurtosis (e.g., exponential) distributions, each composed of different cells from a single population, in response to a diverse set of input features, would constitute a sparse representation. It has been shown previously that coding strategies that maximize spareness produce local,
GABAergic Inhibitory Control
863
A
F
B
G
0
15
C
H
30
D
I
E
J
45
90
10 mV 200 ms
Figure 4: Response of model spiny stellates to long bar of width 0.5 degree flashed on and left on for 500 ms. Response for each orientation (value on left) shows a single trace (spikes truncated) above and an averaged trace (30 cells) below. A–E: control responses. F–J: responses with all GABAA conductances reduced to 10% of control values.
oriented, bandpass RFs of the type used in our simulations (Olshausen & Field, 1996). Sparse representations are more efficient because they possess a higher degree of statistical independence among their outputs (Olshausen & Field, 1996). To show this in our model would require increasing the diversity of RFs to include different spatial frequencies, direction selectivities, and so forth and is outside the scope of this study, but our results are at least suggestive of sparse coding. If these results are generalizable to other cortical layers and areas, it may be that stimulus selectivity is given by the spatiotemporal structure in the inputs, with GABAB inhibition used to cancel nonoptimal patterns of input. The cortical amplifier, in the form of recurrent excitation, in combination with intracortical inhibition, is not acting simply to amplify optimal and suppress nonoptimal LGN input. Instead, the excitatory and inhibitory circuits work together to enforce competition, selectively amplifying the output of some cells while suppressing others with approximately the same tuning to produce an efficient output representation.
864
Paul Bush and Nicholas Priebe
150.0
feedforward only 50% 0 deg 50% 30 deg 5% 0 deg
Number of cells
100.0
50.0
0.0 0.0
5.0
10.0 15.0 Firing rate (Hz)
20.0
25.0
Figure 5: Output distributions of model in response to sine-wave grating input (as Figure 3). Distribution obtained in response to optimal input (50% contrast, 0 degree) with no cortical circuitry (feedforward only) is highly peaked with low variance. Addition of cortical circuitry causes some decorrelation of firing rates, while suboptimal input (30 degrees oriented grating or 5% contrast) produces exponential output distributions.
Acknowledgments We thank R. Douglas and K. Martin for the use of their digitized spiny stellate data and R. C. Reid for LGN reverse correlation data. Thanks to A. Krukowski, T. Troyer, and K. Miller for useful discussions. This work was done in the lab of Kenneth D. Miller at UCSF, supported by a Whitaker Foundation Biomedical Engineering Research Grant to K. D. M. References Albrecht, D. G. (1995). Visual cortex neurons in monkey and cat: Effect of contrast on the spatial temporal phase transfer functions. Vis. Neurosci., 12, 1191–1210. Allison, J. D., Kabara, J. F., Snider, R. K., Casagrande, V. A., & Bonds, A. B. (1996). GABAB -receptor-mediated inhibition reduces the orientation selectivity of
GABAergic Inhibitory Control
865
the sustained response of striate cortical nerons in cats. Vis. Neurosci., 13, 559–566. Baddeley, R. (1996). An efficient code in V1? Nature, 381, 560–561. Benardo, L. S. (1994). Separate activation of fast and slow inhibitory postsynaptic potentials in rat neocortex in vitro. J. Physiol., 476(2), 203–215. Berman, N. J., Douglas, R. J., Martin, K. A. C., & Whitteridge, D. (1991). Mechanisms of inhibition in cat visual cortex. J. Physiol., 440, 697–722. Bush, P. C., & Sejnowski, T. J. (1993). Reduced compartmental models of neocortical pyramidal cells. J. Neurosci. Meth., 46, 159–166a. Bush, P. C., & Sejnowski, T. J. (1994). Effects of inhibition and dendritic saturation in simulated neocortical pyramidal cells. J. Neurophysiol., 71, 2183–2193. Bush, P. C., & Sejnowski, T. J. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. J. Comp. Neurosci., 3, 91–110. Chapman, B., Zahs, K. R., & Stryker, M. P. (1991). Relation of cortical cell orientation selectivity to alignment of receptive fields of the geniculocortical afferents that arborize within a single orientation column in ferret visual cortex. J. Neurosci., 11, 1347–1358. Deuchars, J., & Thomson, A. M. (1995a). Innervation of burst firing spiny interneurons by pyramidal cells in deep layers of rat somatosensory cortex: Paired intracellular recordings with biocytin filling. Neurosci., 69(3), 739–755. Deuchars, J., & Thomson, A. M. (1995b). Single axon fast inhibitory postsynaptic potentials elicited by a sparsely spiny interneuron in rat neocortex. Neurosci., 65(4), 935–942. Douglas, R. J., Koch, C., Mahowald, M., Martin, K. A., and Suarez, H. H. (1995). Recurrent excitation in neocortical circuits. Science, 269(5226), 981–985. Ferster, D. (1987). Origin of orientation-selective EPSPs in simple cells of cat visual cortex. J. Neurosci., 7, 1780–1791. Ferster, D. (1988). Spatially opponent excitation and inhibition in simple cells of the cat visual cortex. J. Neurosci., 8, 1172–1180. Ferster, D., Chung, S., & Wheat, H. (1996). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380(6571), 249–252. Gage, P. W. (1992). Activation and modulation of neuronal K+ channels by GABA. TINS, 15(2), 46–51. Heggelund, P., & Albus, K. (1978). Orientation selectivity of single cells in striate cortex of cat: The shape of orientation tuning curves. Vision Res., 18, 1067– 1071. Hestrin, S., & Armstrong, W. E. (1996). Morphology and physiology of cortical neurons in layer 1. J. Neurosci., 16, 5290–5300. Hines, M. (1984). Efficient computation of branched nerve equations. Int. J. BioMedical Computing, 15, 69–76. Janigro, D., & Schwartzkroin, P. A. (1988). Effects of GABA on CA3 pyramidal cell dendrites in rabbit hippocampal slices. Brain Res., 453, 265–274. Jarolimek, W., Bijak, M., & Misgeld, U. (1994). Differences in the CS block of baclofen and 4-aminopyridine induced potassium currents of guinea pig CA3 neurons in vitro. Synapse, 18(3), 169–177.
866
Paul Bush and Nicholas Priebe
Jones, J. P., & Palmer, L. A. (1987). An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol., 58, 1233–1258. Kisvarday, Z. F., Gulyas, A., Beroukas, D., North, J. B., Chubb, I. W., & Somogyi, P. (1990). Synapses, axonal and dendritic patterns of GABA-immunoreactive neurons in human cerebral cortex. Brain, 113(3), 793–812. Kisvarday, Z. F., Martin, K. A. C., Whitteridge, D., & Somogyi, P. (1985). Synaptic connections of intracellularly filled clutch cells: A type of small basket cell in the visual cortex of the cat. J. Comp. Neurol., 241, 111–137. Krukowski, A. E., Priebe, N. J., & Miller, K. D. (1996). A model of simple-cell orientation tuning: Feedforward tuning and correlation-based intracortical connectivity. Soc. Neurosci. Abstr., 22(1). Lehky, S. R., & Sejnowski, T. J. (1990). Neural model of stereoacuity and depth interpolation based on a distributed representation of stereo disparity. J. Neurosci., 10(7), 2281–2299. Lytton, W. W. (1996). Optimizing synaptic conductance calculation for network simulations. Neur. Comp., 8, 501–510. Martin, K. A., & Whitteridge, D. (1984). Form, function and intracortical projections of spiny neurones in the striate visual cortex of the cat. J. Physiol., 353, 463–504. Mott, D. D., & Lewis, D. V. (1994). The pharmacology and function of central GABAB receptors. Int. Rev. Neurobiol., 36, 97–223. Newberry, N. R., & Nicoll, R. A. (1985). Comparison of the action of baclofen with GABA on rat hippocampal pyramidal cells in vitro. J. Physiol., 360, 161–185. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609. Pei, X., Vidyasagar, T. R., Volgushev, M., & Creutzfeldt, O. D. (1994). Receptive field analysis and orientation selectivity of postsynaptic potentials of simple cells in cat visual cortex. Journal of Neuroscience, 14, 7130–7140. Peters, A., & Payne, B. R. (1993). Numerical relationships between geniculocortical afferents and pyramidal cell modules in cat primary visual cortex. Cereb. Cortex, 3, 69–78. Pfleger, B., & Bonds, A. B. (1995). Dynamic differentiation of GABAA -sensitive influences on orientation selectivity of complex cells in the cat striate cortex. Exp. Brain Res., 104(1), 81–88. Reid, R. C., & Alonso, J. M. (1995). Specificity of monosynaptic connections from thalamus to visual cortex. Nature, 378(6554), 281–284. Satou, M., Mori, K., Tazawa, Y., & Takagi, S. F. (1982). Two types of postsynaptic inhibition in pyriform cortex of the rabbit: Fast and slow inhibitory postsynaptic potentials. J. Neurophysiol., 48(5), 1142–1156. Sclar, G., & Freeman, R. D. (1982). Orientation selectivity in the cat’s striate cortex is invariant with stimulus contrast. Exp. Brain Res., 46, 457–461. Solis, J., Isaacson, J., & Nicoll, R. A. (1992). Functional role for postsynaptic GABAB receptors in the hippocampus: A reevaluation. Pharm. Comm., 2(1– 2), 32–37.
GABAergic Inhibitory Control
867
Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15(8), 5448–5465. Stratford, K. J., Tarczy-Hornoch, K., Martin, K. A. C., Bannister, N. J., & Jack, J. J. B. (1996). Excitatory synaptic inputs to spiny stellate cells in cat visual cortex. Nature, 382, 258–261. Troyer, T. W., Krukowski, A. E., Priebe, N. J., & Miller, K. D. (1998). Contrastinvariant orientation tuning in visual cortex: Feedforward tuning and correlationbased intracortical connectivity. Unpublished manuscript. Volgushev, M., Vidyasagar, T. R., & Xing, P. (1995). Dynamics of the orientation tuning of postsynaptic potentials in the cat visual cortex. Vis. Neurosci., 12, 621–628. Wilson, M., & Blower, J. M. (1992). Cortical oscillations and temporal interactions in a computer simulation of piriform cortex. J. Neurophysiol., 67(4), 981–995. Received March 3, 1997; accepted September 25, 1997.
LETTER
Communicated by William Skaggs
Changes in GABAB Modulation During a Theta Cycle May Be Analogous to the Fall of Temperature During Annealing Vikaas S. Sohal Michael E. Hasselmo Department of Psychology and Program in Neuroscience, Harvard University, Cambridge, MA 02138, U.S.A.
Changes in GABAB modulation may underlie experimentally observed changes in the strength of synaptic transmission at different phases of the theta rhythm (Wyble, Linster, & Hasselmo, 1997). Analysis demonstrates that these changes improve sequence disambiguation by a neural network model of CA3. We show that in the framework of Hopfield and Tank (1985), changes in GABAB suppression correspond to changes in the effective temperature and the relative energy of data terms and constraints of an analog network. These results suggest that phasic changes in the activity of inhibitory interneurons during a theta cycle may produce dynamics that resemble annealing. These dynamics may underlie a role for the theta cycle in improving sequence retrieval for spatial navigation. 1 Introduction Manipulations that abolish the hippocampal theta rhythm impair spatial navigation (Winson, 1978), but the computational function of theta is not known. Previous experiments have found that the amplitude of evoked field potentials due to neuronal spiking activity changes during different phases of the theta rhythm (Rudell, Fox, & Rank, 1980; Buszaki, Grastyan, Czopf, Kellenyi, & Prohaska, 1981; Rudell & Fox, 1984). This could result from changes in presynaptic and/or postsynaptic inhibition during a theta cycle. More recent evidence suggests that the strength of synaptic transmission also depends on the phase of theta rhythm (Wyble et al., 1997), another possible indication of phasic changes in presynaptic inhibition. Phasic changes in presynaptic inhibition could result from the rhythmic activation of presynaptic GABAB receptors, which selectively suppress recurrent synaptic transmission in piriform cortex (Tang & Hasselmo, 1994) and hippocampus (Ault & Nadler, 1982; Colbert & Levy, 1992). Stewart and Fox (1990) have hypothesized, and in vivo recordings have shown (Buszaki & Eidelberg, 1983; Fox et al., 1986; Skaggs, McNaughton, Wilson, & Barnes, 1996) that some interneurons tend to fire near a preferred phase of theta. Phasic firing by even a subset of interneurons may rhythmically activate presynaptic GABAB receptors, producing the changes in synaptic strength Neural Computation 10, 869–882 (1998)
c 1998 Massachusetts Institute of Technology °
870
Vikaas S. Sohal and Michael E. Hasselmo
observed in vivo (Wyble et al., 1997). Paired pulse depression, thought to result from activation of presynaptic GABAB receptors on interneurons, has a rise time and fast component of decay (Otis, Dekoninck, & Mody, 1993) that are compatible with theta frequency oscillations in GABAB -mediated synaptic suppression. These rhythmic changes could be important for retrieval of sequences of activity in the hippocampus. Evidence suggests that the hippocampus may represent a series of adjacent locations by a sequence of successively firing place cells (Tsodyks, Skaggs, Sejnowski, & McNaughton, 1996; Jensen & Lisman, 1996; Wallenstein & Hasselmo, 1997). This evidence includes the phenomenon of theta phase precession: during a theta cycle, place cells fire in the order in which their place fields are encountered along a path (O’Keefe & Recce, 1993; Skaggs et al., 1996). In a neural network model of hippocampal region CA3, we find that changes in GABAB suppression during a theta cycle improve sequence disambiguation: the retrieval of sequences with identical starting points. We show that these changes in the level of GABAB modulation (within a theta cycle) correspond to changes in (1) the effective temperature and (2) the relative energy of data terms and constraints of an analog network (using the terminology of Hopfield & Tank, 1985). Data terms and constraints correspond to afferent input and recurrent connections, respectively, in our model. 2 Analysis 2.1 The Problem of Sequence Disambiguation. When two sequences with the same starting components, {X1 , . . . , Xm } and {X1 , . . . , Xi , Yi+1 , . . . , Ym }, are stored in a network, it is ambiguous whether Xi should be followed by Xi+1 or Yi+1 . The only information that can resolve this ambiguity is knowledge of the desired end point, Xm or Ym . If the patterns of activity are neural representations of locations, then disambiguating forked sequences using knowledge of the desired end point corresponds to the everyday problem of deciding which way to turn at an intersection based on one’s destination. 2.2 A Model of Sequence Disambiguation. The simplest neural realization of this problem, shown in Figure 1, contains two sequences: {a1 , a2 } and {a1 , a3 }. When a1 becomes active, the pattern of activity that should follow a1 is ambiguous. We assume that during disambiguation of the sequences {a1 , a2 } and {a1 , a3 }, a2 receives a small amount of afferent input. This input could arrive in CA3 from the dentate gyrus via the mossy fibers or from the entorhinal cortex via the perforant path. It represents the knowledge that a2 is the desired goal and should bias the network so that it completes the sequence {a1 , a2 } rather than {a1 , a3 }. This assumption predicts that place cells representing the location of a goal should receive some biasing input
Changes in GABAB Modulation
871
Figure 1: (A) The simplest neural representation of a forked sequence, composed of the sequences {a1 , a2 } and {a2 , a3 }. a1 represents the point of the “fork,” and a2 and a3 represent the “branches” of the forked sequence. (B) The connectivity in a simple network that has stored this forked sequence.
while a rat searches for that goal. With strong enough biasing input to a2 , the network can easily disambiguate {a1 , a2 } from {a1 , a3 } when a2 is given a biasing input. Here we focus on the optimal characteristics of a network that can perform this disambiguation with weak bias. We study such a simplified network because the decision selectively to activate a2 but not a3 corresponds to the decision to activate neurons representing Xi+1 but not those representing Yi+1 during the previously described sequence disambiguation problem. Elsewhere (Sohal & Hasselmo, 1997) we show that the conditions that optimize the decision to activate a2 selectively in this simplified network can both be produced by GABAB receptors with
872
Vikaas S. Sohal and Michael E. Hasselmo
biologically plausible dynamics and significantly improve disambiguation of multiple sequences, composed of many items each, in more complex networks. The network parameters, shown in Figure 1B, are: A, the strength of excitation from a1 to a2 and a3 ; Abias , afferent input to a2 ; A0 , afferent input to the inhibitory interneuron; W, the strength of recurrent excitatory connections from a2 to a2 and a3 to a3 ; W 0 , the strength of excitatory connections from a2 and a3 to the interneuron; −H, the strength of the inhibitory connections from this interneuron to a2 and a3 ; h, activation of the model interneuron; η, the rate of passive decay for the activities, a2 , a3 , and h; and θ, the threshold of the neurons represented by a2 , a3 , and h. The parameters k(t) and k 0 (t) represent the GABAB suppression of recurrent synaptic transmission. All recurrent excitatory connections in the network (A, W, and W 0 ) are multiplied by k(t). So [1 − k(t)] represents the suppression of recurrent excitation at time t. Similarly, k 0 multiplies recurrent inhibition in the network (H), so [1 − k 0 (t)] represents the suppression of recurrent inhibition. The activation of each model neuron is a continuous variable representing the mean membrane potential of a pool of real neurons. The equations for the evolution of a2 , a3 , and h have been derived by averaging over short timescales and many neurons, and are similar to those used elsewhere (Wilson & Cowan, 1972): a˙ 2 = −ηa2 + kA + Abias + kW(a2 − θ )+ − k 0 H(h − θ )+ a˙ 3 = −ηa3 + kA + kW(a3 − θ )+ − k 0 H(h − θ )+
(2.1)
h˙ = −ηh + A0 + kW 0 (a2 − θ )+ + kW 0 (a3 − θ )+ , where (x)+ = x if x > 0 and 0 otherwise. As long as Abias > 0, a2 will cross threshold before a3 , giving a2 > θ and a3 < θ. Assume the interneuron will be activated before a3 > θ (we have found that this is the case for networks that perform sequence disambiguation). Then the activities are in the regime a2 > θ, a3 < θ, and h > θ, in which network dynamics reduce to the linear system: a˙ 2 = (kW − η)a2 − k 0 Hh + kA + Abias + k 0 Hθ a˙ 3 = −ηa3 − k 0 Hh + kA + k 0 Hθ h˙ = −ηh + kW 0 a2 + A0 − kW 0 θ. We measured the performance of sequence disambiguation by this system as a2 (t f inal )−a3 (t f inal ), where t f inal is the time at which sequence disambiguation ends. Thus, a2 (t f inal ) − a3 (t f inal ) measures the differential completion of the two sequences, {a1 , a2 } and {a2 , a3 }, at the end of sequence disambiguation.
Changes in GABAB Modulation
873
2.3 Optimal Network Parameters During Sequence Disambiguation. We found the k(t) and k 0 (t) that maximized the quantity a2 (t f inal ) − a3 (t f inal ) using the maximum principle (Pontryagin, Boltyanskii, Gamkrelidze, & Mishchenko, 1962). The maximum principle determines necessary conditions on the functions k(t) and k 0 (t) to maximize the difference a2 (t f inal ) − a3 (t f inal ). These conditions uniquely specify k(t) and k 0 (t): ½ ¾ if t < t f inal − t1 , k k = min kmax if t > t f inal − t1 π where t1 = p 0 0 2 4HW k max kmax − W 2 kmax ½ 0 ¾ k if t < t f inal − t1 − t2 , k 0 = min k 0max if t > t f inal − t1 − t2 π . where t2 = q 2 4HW 0 k 0max kmin − W 2 kmin This result is true as long as: t f inal > 0 0 ≤ kmin ≤ k(t) ≤ kmax ≤ 1 0 ≤ k 0min ≤ k 0 (t) ≤ k 0max ≤ 1. (The quantities inside the square roots are negative if and only if all eigenvalues of the linearized system are real. We assume that damped oscillations, produced by complex eigenvalues, represent the typical behavior of hippocampal neurons and therefore confine our analysis to this regime.) Thus, optimal sequence disambiguation always occurs when the suppression of recurrent excitation and inhibition are step functions that decrease from their maximum to their minimum values during the sequence disambiguation task. The maximum principle also shows that Abias should be fixed at its maximum value for optimal sequence disambiguation. Note that this optimality does not depend on the specific values of kmax , kmin , or t f inal . 2.4 Linearly Decreasing GABAB Suppression Significantly Improves Sequence Disambiguation. Because step function decreases in the suppression of synaptic transmission seem biologically unrealistic, we studied whether more plausible decreases in the suppression of synaptic transmission might also significantly improve sequence disambiguation in our simplified network. Any sufficiently smooth k(t), k 0 (t), such as the changes in synaptic strength observed in vivo (Wyble et al., 1997), may be locally approximated by linear functions. Therefore, we solved for the final state
874
Vikaas S. Sohal and Michael E. Hasselmo
of the system described above and studied how that final state changed if the suppression of recurrent synaptic transmission decreased linearly while that system disambiguated two sequences. k(t) and k 0 (t) took the form: k(t) = k 0 (t) = αt + β where α(100 msec) + β = 1 (100 msec represents about one-half of a theta cycle, and α measures the amount by which the suppression of recurrent synaptic transmission decreases during the sequence disambiguation task). If a1 is active and a2 is receiving biasing input, then the system has only two possible final states: for low Abias , both a2 and a3 are active, whereas for Abias sufficiently large, only a2 is active in the final state, so sequence disambiguation is successful. We found the minimum Abias necessary for successful sequence disambiguation as a function of α. The results, shown in Figure 2, demonstrate that increasing α, corresponding to a greater change in the suppression of recurrent synaptic transmission, allows successful sequence disambiguation using weaker biasing inputs, that is, using less information about the desired goal. This confirms that biologically plausible changes, such as linear decreases, in the suppression of synaptic transmission can also significantly improve sequence disambiguation. In simulations (Sohal & Hasselmo, 1997), theta frequency septal input to interneurons entrains oscillations in GABAB receptor activation such that GABAB suppression falls during the phase of pyramidal cell activity. Consistent with the preceding analysis, this decrease improves sequence disambiguation. Furthermore, the tendency toward preferred phases of interneuron firing in these simulations is similar to those observed in vivo (Skaggs et al., 1996). Thus, oscillations in GABAB suppression, which may underlie experimentally observed changes in the strength of synaptic transmission (Wyble et al., 1997), may also optimize sequence disambiguation. Alternatively, synaptic depression due to previous transmission (Markram & Tsodyks, 1996) may underlie the observed changes in synaptic strength, although this form of depression has been found only in neocortex, and hippocampal pyramidal cells are more likely to show facilitation than depression. 3 Relation to Annealing In our model, GABAB suppression is initially high during sequence disambiguation. As a result, recurrent excitation and inhibition are weak, so neurons representing multiple sequences are active. However, as GABAB suppression falls, recurrent excitation and inhibition become stronger, so that fewer neurons, representing only one sequence, are active while the remaining neurons are inhibited. Thus, as the level of GABAB modulation falls, the network shifts from sampling many possible states to selecting the
Changes in GABAB Modulation
875
Necessary Biasing Input
1.5
1
0.5
0 0
0.1
0.2
0.3
Amount by which GABA−B modulation changes
Figure 2: The minimum biasing input, Abias , required for successful sequence disambiguation as a function of α. In every case, the level of GABAergic modulation reaches the same value at the end of the sequence disambiguation task, but during the task, the level of GABAergic modulation decreases with slope α. Thus, α measures the rate at which the level of GABAergic modulation, and hence the suppression of recurrent synaptic transmission, declines during the sequence disambiguation task. This figure shows that greater rates of decrease in GABAergic modulation permit successful sequence disambiguation with weaker biasing inputs.
optimal state. A similar shift also occurs during annealing as the temperature of a system decreases. The system samples states of varying energy at high temperatures, but selects the minimum energy state at low temperatures. This suggests that a falling level of GABAB modulation may improve
876
Vikaas S. Sohal and Michael E. Hasselmo
sequence disambiguation in the same way that a decreasing temperature drives annealing. Hopfield and Tank (1985) showed that the equilibrium solution in an analog network equals the effective field solution for a Boltzmann machine (Hinton & Sejnowski, 1986) at a temperature determined by the gain width. Similarly, the equilibrium solution to our network is the same as the effective field solution in a Boltzmann machine, and the fall in GABAB modulation that occurs in our network is equivalent to decreasing both the effective temperature and the energy of data terms relative to constraints. Consider the energy function: ¸ · W E = ρexc − ((a2 − θ)2+ + (a3 − θ )2+ ) − A(a2 + a3 ) 2 + ρinh W 0 H(a2 a3 ) − ρaf f Abias a2 ,
(3.1)
where ρexc , ρinh , and ρaf f measure the relative contributions of recurrent excitation, recurrent inhibition, and afferent input, respectively, to the energy. Following Hopfield and Tank (1985), we find the effective field solution for a Boltzmann machine in which the ai are Ising spins restricted to the values ai = amax or amin , the temperature is τ , and the energy, E, is given by equation 3.1. Rescale variables so that amin = 0 < θ < amax = 1. Then the expected value for ai is: hai i =
eHi /τ , i = 2 or 3, 1 + eHi /τ
(3.2)
where: H2 = ρexc b−A − W(a2 − θ )+ c + ρinh HW 0 a3 − ρaf f Abias H3 = ρexc [−A − W(a3 − θ )+ ] + ρinh HW 0 a2 . Equation 3.2 reduces to: Hi , hai i ∼ = τ
(3.3)
Hi ¿ τ.
(3.4)
if
The condition in equation 3.4 is true when the amax state is far from the saturating regime of the equation, that is, when the mean activity hai i still increases linearly with synaptic input. This is true (Barkai & Hasselmo, 1994) over the range of firing frequencies observed in vivo (O’Keefe & Recce, 1993; Skaggs et al., 1996).
Changes in GABAB Modulation
877
Now consider a simplification of the neural network shown in Figure 1 in which the interneuron is replaced by linear inhibition: a˙ 2 = −ηa2 + kA + Abias + kW(a2 − θ )+ − kk 0 HW 0 a3 a˙ 3 = −ηa3 + kA + kW(a3 − θ )+ − kk 0 HW 0 a2 .
(3.5)
The steady-state solution to equation 3.5 is: a2 =
· ¸ k 1 Abias + A + W(a2 − θ )+ − k 0 HW 0 a3 η k k a3 = [A + W(a3 − θ )+ − k 0 HW 0 a2 ]. η
(3.6)
The steady-state solution for this network (see equation 3.6) equals the effective field solution (see equation 3.3) for which the temperature, τef f , equals η/k and the relative energies of afferent input, recurrent excitation, and recurrent inhibition are given by: 1 = ητef f k =1 1 = k0 = (assuming k 0 ≈ k). ητef f
ρaf f = ρexc ρinh
Falling GABAB modulation produces a rise in k and is therefore equivalent to annealing, in which there is a simultaneous fall in both the temperature and the energy of afferent input relative to recurrent excitation and inhibition. 3.1 Why Do the Relative Energies of Afferent and Recurrent Inputs Change? Because activation of GABAB receptors selectively suppresses recurrent but not afferent connections (Ault & Nadler, 1982; Colbert & Levy, 1992), the energy of afferent input relative to recurrent excitation and inhibition decreases as the level of GABAB modulation falls. Two analyses showed that the selectivity of GABAB suppression optimizes network performance. First, by applying the maximum principle, we found that the strength of recurrent synapses should increase during the sequence disambiguation task, but afferent input should remain maximal throughout. Second, suppose that the relative energies of afferent input and recurrent excitation and inhibition were fixed. To model this, we multiplied afferent input by the same factor, αt+β, as recurrent connections. Then the necessary biasing input, shown in Figure 2, decreases less with increasing α; that is, a fall in the level of GABAB modulation produces less of an improvement in sequence disambiguation. (However, note that sequence disambiguation still improves when the strengths of both afferent input and recurrent connections change. Thus, phasic changes in postsynaptic excitability could
878
Vikaas S. Sohal and Michael E. Hasselmo
improve sequence disambiguation, but this improvement would be smaller than the improvement caused by changes in GABAB suppression.) The decrease in the energy of afferent input relative to recurrent inputs can be understood intuitively. Afferent input contributes what Hopfield and Tank (1985) call “data terms” to the energy function. These represent the locations of the starting point and goal. The energy contributions of recurrent excitation and inhibition represent the constraints that one continuous sequence should be active and only one sequence should be active, respectively. Data terms determine the general location of the global minimum of the energy function, whereas constraints refine the global minimum but also introduce local minima (Lin & Lee, 1995). The level of GABAB modulation is initially high, so that afferent input and the corresponding data terms dominate the energy function, causing the network to converge to the neighborhood of the global energy minimum. As the level of GABAB modulation falls, recurrent connections become stronger, enforcing constraints, so that the network zeros in on the global energy minimum in which exactly one continuous sequence is active. Because constraints are enforced only after the network has converged to the neighborhood of the global energy minimum, the network is less prone to become trapped in local minima. Thus, the correspondence between annealing and a fall in GABAB suppression elucidates specific mechanisms by which GABAB modulation improves sequence dismabiguation: falling GABAB suppression both shifts the network from sampling multiple states to selecting the best one, and increases the strength of constraints relative to data terms. Figure 3 shows how these mechanisms improve the performance of the network described by equation 3.5. When the network includes linearly decreasing GABAB suppression, the initially strong GABAB suppression slows the subthreshold rise in activity. As a result, a2 rises well above threshold, but recurrent inhibition prevents a3 from crossing threshold before the end of the task (t f inal = 50 msec). In contrast, in the absence of GABAB suppression, both a2 and a3 rise above threshold. When GABAB suppression is present but does not decrease, a3 does remain subthreshold, but a2 barely crosses threshold. Figure 3 does not indicate whether there exists a constant, intermediate level of GABAB suppression that produces the same final state, (a2 (t), a3 (t)), as the linearly decreasing GABAB suppression, shown in Figure 3A. In fact, one such constant level of GABAB suppression does exist. However, because this constant, intermediate level of GABAB suppression does not produce the slow initial activity rise that results from falling GABAB suppression, it leads to successful sequence disambiguation over a smaller range of network parameters. For example, after strengthening recurrent excitation, W, while holding all other parameters fixed, only a2 crosses threshold in the network with the linearly decreasing GABAB suppression, whereas both a2 and a3 cross threshold in the network with the constant, intermediate level of GABAB suppression. Thus, consistent with the results of optimization and
Changes in GABAB Modulation
A
879
B
20
C 20
20 a2
15
15
15
10
a3 10
a2
10 threshold
threshold
threshold
a2
a3 5
5
5
a3
0
0
10
20
30 40 50 time (msec)
0
0
10
20
0 30 40 50 0 time (msec)
10
20
30 40 50 time (msec)
Figure 3: Activity in the network described by equation 3.5 when a2 is receiving biasing input. (A) The network includes linearly decreasing GABAB suppression. Under these conditions, a2 rises well above threshold. However, the initially strong GABAB suppression slows the subthreshold rise in activity, enabling recurrent inhibition to prevent a3 from crossing the threshold before the end of the task (t f inal = 50 msec). (B) GABAB suppression is absent, so syna= pses remain at their maximum strengths. As a result, both a2 and a3 rise above threshold. (C) GABAB suppression is present but does not decrease, so that synapses remain at their minimum strengths. In this case, although a3 remains below threshold, synapses are so weak that a2 barely rises above threshold.
the analogy with annealing, falling GABAB suppression leads to successful performance over a broader range of network parameters than do constant levels of GABAB suppression. We have found analogous results in simulations of more complex networks (Sohal & Hasselmo, 1997). While GABAB suppression falls, neurons that belong to a sequence whose end point receives biasing input (and which therefore correspond to a2 ) become active, while neurons belonging to competing sequences (and corresponding to a3 ) do not. However, when GABAB suppression is fixed at an intermediate value, both sets of neurons remain active.
880
Vikaas S. Sohal and Michael E. Hasselmo
4 Relation to Behavior Spatial navigation may utilize hippocampal representations of location (Touretzky & Redish, 1996; Tsodyks et al., 1996). We have shown that the dynamics of sequence disambiguation may resemble those of annealing, and other studies have used simulated annealing by a Boltzmann machine for robot navigation (Lin & Lee, 1995). Thus, cycles of GABAergic modulation may contribute to making choices between multiple competing paths through the environment. Retrieval of specific temporal sequences among competing alternatives in the hippocampus may also be important for nonspatial tasks requiring relational processing. For example, Bunsey and Eichenbaum (1996) found that both control rats and rats with hippocampal lesions learn sample-associate pairs equally well, but that only control rats display transitivity. A network that represents sample-associate pairs as sequences can produce transitivity. In such a network, changes in GABAB would have a similar role for enhancing retrieval of the optimal relational representation from multiple possible representations. These results suggest that one function of the theta rhythm may be to cause rhythmic changes in GABAB suppression of synaptic transmission. Our analysis demonstrates that such changes in GABAB suppression may produce dynamics that resemble annealing and improve sequence disambiguation in the hippocampus. References Ault, B., & Nadler, J. V. (1982). Baclofen selectively inhibits transmission at synapses made by axons of CA3 pyramidal cells in the hippocampal slice. J Pharmacol Exp Ther, 223, 291–297. Barkai, E., & Hasselmo, M. E. (1994). Modulation of the input/output function of rat piriform cortex pyramidal cells. J Neurophys, 72, 644–658. Bunsey, M., & Eichenbaum, H. (1996). Conservation of hippocampal memory function in rats and humans. Nature, 379, 255–257. Buszaki, G., & Eidelberg, E. (1983). Phase relations of hippocampal projection cells and interneurons to theta activity in the anesthetized rat. Brain Res, 266, 334–339. Buszaki, G., Grastyan, E., Czopf, J., Kellenyi, L., & Prohaska, O. (1981). Changes in neuronal transmission in the rat hippocampus during behavior. Brain Res, 225, 235–247. Colbert, C. M., & Levy, W. B. (1992). Electrophysiological and pharmacological characterization of perforant path synapses on CA1: Mediation by glutamate receptors. J Neurophys, 68, 1–8. Fox, S. E., Wolfson, S., & Ranck, J. B. (1986). Hippocampal theta rhythm and the firing of neurons in walking and urethane anesthetized rats. Exp. Brain Res., 62, 495–508. Hinton, G. E., & Sejnowski, T. J. (1986). Learning in Boltzmann machines. In
Changes in GABAB Modulation
881
D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 282–317). Cambridge, MA: MIT Press. Hopfield, J. J., & Tank, D. W. (1985). Neural computation of decisions in optimization problems. Biol Cybern, 52, 141–152. Jensen, O., & Lisman, J. E. (1996). Theta/gamma networks with slow NMDA channels learn sequences and encode episodic memory: Role of NMDA channels in recall. Learning and Memory, 3, 264–278. Lin, C. T., & Lee, C. G. S. (1995). A multi-valued Boltzmann machine. IEEE Trans Syst Man Cyber, 25, 660–669. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382, 807–810. O’Keefe, J., & Recce, M. L. (1993). Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocampus, 3, 317–330. Otis, T. S., Dekoninck, Y., & Mody, I. (1993). Characterization of synaptically elicited GABAB responses using patch-clamp recordings in rat hippocampal slices. J Physiol, 463, 391–407. Pontryagin, L. S., Boltyanskii, V. G., Gamkrelidze, R. V., & Mishchenko, E. F. (1962). The mathematical theory of optimal processes. New York: Interscience Publishers. Rudell, A. P., & Fox, S. E. (1984). Hippocampal excitability related to the phase of the theta rhythm in urethanized rats. Brain Res, 294, 350–353. Rudell, A. P., Fox, S. E., & Rank, J. B. (1980). Hippocampal excitability phaselocked to the theta rhythm in walking rats. Exp Neurology, 68, 87–96. Skaggs, W. E., McNaughton, B. L., Wilson, M. A., & Barnes, C. A. (1996). Theta phase precession in neuronal populations and the compression of temporal sequences. Hippocampus, 6, 149–172. Sohal, V. S., & Hasselmo, M. E. (1997). GABAB modulation improves sequence disambiguation in computational models of hippocampal region CA3. Hippocampus, in press. Stewart, M., & Fox, S. E. (1990). Do septal neurons pace the hippocampal theta rhythm? Trends Neurosci, 13, 163–168. Tang, A. C., & Hasselmo, M. E. (1994). Selective suppression of intrinsic but not afferent fiber synaptic transmission by baclofen in the piriform (olfactory) cortex. Brain Res, 659, 75–81. Touretzky, D. S., & Redish, A. D. (1996). Theory of rodent navigation based on interacting representations of space. Hippocampus, 6, 247–270. Tsodyks, M. V., Skaggs, W. E., Sejnowski, T. J., & McNaughton, B. L. (1996). Population dynamics and theta rhythm phase precession of hippocampal place cell firing: A spiking neuron model. Hippocampus, 6, 271–280. Wallenstein, G. W., & Hasselmo, M. E. (1997). GABAergic modulation of hippocampal population activity: Sequence learning, place field development, and the phase precession effect. J Neurophysiol, 78, 393–408. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys J, 12, 1–24. Winson, J. (1978). Loss of hippocampal theta rhythm results in spatial memory deficit in the rat. Science, 201, 160–163.
882
Vikaas S. Sohal and Michael E. Hasselmo
Wyble, B. P., Linster, C., & Hasselmo, M. E. (1997). Evoked synaptic potential size depends on the phase of theta rhythm in rat hippocampus. Soc Neurosci Abstr, 23, 508. Received June 2, 1997; accepted October 2, 1997.
LETTER
Communicated by Laurence Abbott
RST: A Connectionist Architecture to Deal with Spatiotemporal Relationships J.-C. Chappelier EPFL—DI-LIA, Ecublens, CH-1015 Lausanne, Switzerland
A. Grumbach ENST, D´ept INF, F-75634 Paris Cedex 13, France
In the past decade, connectionism has proved its efficiency in the field of static pattern recognition. The next challenge is to deal with spatiotemporal problems. This article presents a new connectionist architecture, RST (r´eseau spatio temporel [spatio temporal network]), with such spatiotemporal capacities. It aims at taking into account at the architecture level both spatial relationships (e.g., as between neighboring pixels in an image) and temporal relationships (e.g., as between consecutive images in a video sequence). Concerning the spatial aspect, the network is embedded in actual space (two- or three-dimensional), the metrics of which directly influence its structure through a connection distribution function. For the temporal aspect, we looked toward biology and used a leaky-integrator neuron model with a refractory period and postsynaptic potentials. The propagation of activity by spatiotemporal synchronized waves enables RST to perform motion detection and localization in sequences of video images. 1 Introduction In natural video sequences, two major implicit relations hold: 1. Within a single image, the probability of having the same value is greater for two neighboring pixels than for two distant ones, since natural images are usually made of homogeneous connected regions. 2. Between images, the probability for corresponding pixels to have the same value is greater for consecutive images than for distant images. Although some consider one of these aspects or the other (LeCun et al., 1989; B´eroule, 1987), most connectionist architectures do not explicitly take these two relations into account at the architecture level. If at all, these relations are usually considered only implicitly, at the training level. The main goal of our work is to design an architecture, called RST (r´eseau spatio temporel Neural Computation 10, 883–902 (1998)
c 1998 Massachusetts Institute of Technology °
884
J.-C. Chappelier and A. Grumbach
Figure 1: An example of RST architecture. Links starting from one neuron are plotted in gray, for one generation on the left-hand side, and for all generations on the right-hand side.
[spatio-temporal network]), firmly based on both spatial and temporal relations, and integrate them at the architecture level.1 This article first presents the theoretical framework for RST and then focuses on experiments related to the application of RST to motion detection and localization. 2 RST Architecture The first part of this section is dedicated to the network model, where spatial relationships are handled, and the second part deals with the neuron model, where the time dimension is taken into account. Concerning the spatial aspect, the network is embedded in actual space (two- or three-dimensional), the metrics of which directly influence its structure through a connection distribution function. For the temporal aspect, we choose a leaky-integrator neuron model with a refractory period and postsynaptic potentials. 2.1 The Network. Following the arguments of Thorpe and Imbert (1989; Thorpe, 1990) claiming that subcortical information processing in the visual system could mainly be feedforward and also to start with a simple case, a feedforward network was chosen for RST. A given number of neurons is randomly distributed (with an uniform distribution) in a portion of the two- or three-dimensional space, delimited by two segments (two planes in the three-dimensional case), respectively called input and output layers (see Figure 1). The characteristics of the problem treated by RST determine the sizes of both input and output layers, leading to a trapezoidal area for the network. Links are then created between the neurons according to their neighbor1 Neighborhood relations involved in feature maps are not related to the discussed point for two reasons: there is no time (Chappelier & Grumbach, 1996), and it needs training to make the architecture map the spatial input neighborhoods.
RST: A Connectionist Architecture
885
hood in the embedding space. A neuron has a probability Pc (r, γ ) of being connected to a previous neuron. This probability is expressed in terms of the polar coordinates (r, γ ) of the considered neuron with respect to the previous one.2 We choose to implement a generalization of multilayered architectures, which consists of keeping neighborhoods through local connectivity (Linsker, 1986; LeCun et al., 1989). The layered aspect is generalized by a probability distribution that has a peak at the average length between layers (for simplicity, we choose a gaussian distribution). The local aspect is parameterized by an angle α outside which no connection is possible. Figure 2 illustrates this probability of connection mathematically given by: µ ¶ (r − m)2 Pc (r, γ ) = Pmax · Hα (γ ) · exp − , 2σ2 where Pmax is a constant between 0 and 1, which stands for the maximum probability for a given neuron to have a connection; Hα (γ ) is the angular gate function, zero everywhere except for |γ | < α where its value is 1; and m and σ are, respectively, the average and the standard deviation of the radial part. Within this framework, the architecture of the network can be varied smoothly from a quasi-continuous flow to a discrete multilayered network (obtained with Pmax = 1, α ∼ π2 and σ ¿ m). The connections fixed, each link between any pair of neurons is randomly weighted using a uniform distribution on a given interval [wmin , wmax ]. The parameters entering the construction of RST network are as follows: Ne : number of input neurons N: total number of neurons m: average length of connections ρ: ratio between the length and height of the network wmin , wmax : limits of the interval within which weights are chosen Ns : number of output neurons α: maximum angle for connection σ : standard deviation of connection length
Pmax : greatest probability of connection for a given neuron 2.2 The Neuron Model. What kind of neuron model can be chosen in order to take the time dimension into account at the neuron level (H´erault 2 Notice that the probabilistic space associated to P is not the space (r, γ ) but the c more abstract set of allRpossible connection configurations for a given (r, γ ). There is, therefore, no reason for (r,γ ) Pc (r, γ )drdγ to be 1; never-connected neurons (i.e., for which
R
(r,γ )
Pc (r, γ )drdγ = 0) actually exist.
886
J.-C. Chappelier and A. Grumbach
Figure 2: An illustration of the probability of a neuron’s being connected with a neuron centered in (0,0), expressed in terms of the coordinates (X,Y) in RST network. On the left-hand side, the probability is presented in shades of gray, with black standing for 1 (areas where neurons are almost sure to be connected with the considered neuron). The right-hand side shows the same probability displayed in three dimensions. The parameters used for this illustration are m = 100, σ = 30, α = 0.3, and Pmax = 1.0.
& Jutten, 1994)? We focused on the simplest yet biologically inspired temporal neuron model: a leaky integrator with external refractory period and postsynaptic potentials (PSPs), as done, for instance, in NeuroBioCluster by Azmy and Vibert (Azmy, 1992; Vibert, Pakdaman, & Azmy, 1994).3 The implemented model is mainly described by two variables: a membrane potential V and a threshold θ. V is the sum of a specific potential U and an external potential I, which stands for the input of the neuron. Most of the time V is less than θ. Whenever V reaches θ, the neuron fires. It sends a spike to the downstream network and changes its state as follows: θ is increased by an amount 1θ (called adaptation or fatigue) and the specific potential U is lowered down to a postspike value U0 . When the neuron does not fire, the variables U and θ decay exponentially to their resting values U∞ and θ∞ (with time constants τU and τθ ). In addition to U and θ, another variable s, called shunt, is introduced in order to reproduce qualitatively the refractory period of the neuron. This shunt function modulates the input I of the neuron, so that the potential V actually is V(t) = U(t) + s(t) · I(t). Whenever the neuron fires, s drops to 0. After the firing of the neuron, s grows exponentially up toward its resting value 1 with the same time constant as U. 3 For more details on integrate-and-fire neurons and neuron modeling, refer to MacGregor and Lewis (1977).
RST: A Connectionist Architecture
887
Figure 3: All the variables involved in RST neuron model. Top to bottom: Some input I of a given neuron, the shunt function s, and the product s · I, which is added to the specific potential U (dotted line) so as to obtain the membrane potential V, which is compared with the threshold θ. When V reaches θ , a spike is sent. This spike is transformed by synapses into a PSP (represented here as out), which the downstream neurons actually get as input.
The input I of the neuron is the sum over space (all the input neurons of the considered neuron) and time (all the firing instants of its input neurons). In order to mimic biological synaptic transmission and provide temporal robustness, the spikes sent by inputs are received as PSPs described by a function ψ.4 The input I becomes: I(t) =
j=N XI j=1
à wj
nj X
! ψ(t −
j Tk )
,
k=1
where NI is the number of inputs of the considered neuron, wj the weight of the jth input link, nj the number of spikes previously sent by the jth input, j
and Tk the date of the kth spike of the jth input. Figure 3 illustrates the behavior of the neuron. Its parameters and their values in RST are given in Table 1.
4
Spikes are then represented only by their firing time.
888
J.-C. Chappelier and A. Grumbach
Table 1: Parameters for RST Neuron Model. Parameter
Description
dt τu τθ τψ U0 U∞ θ∞ 1θ
Simulation time step Decay time constant for potential Decay time constant for threshold Time constant for PSPs Postspike value of potential Resting value for potential Resting value for threshold Postspike increase of threshold
Value 0.5 ms 6.0 ms 6.0 ms 2.5 ms −77.0 mV −65.0 mV −62.0 mV 40.0 mV
2.3 Spatial Activity Propagation. Under which conditions will some input pattern generate activity through the network up to the output layer? Let us first consider the spatial aspect of this question (i.e., summing over time); temporal considerations will be made in the next section. In order to describe spatial propagation easily, we make the binary neuron approximation, since spikes are produced with respect to a formula similar to the usual binary one, j=N Xi wij Aj − θi , Ai = H j=1
where H is the Heaviside step function and Ni the number of inputs of neuron i, which is a random variable between 0 and the number N of neurons in the network. wij stands for the weight of the link between neuron i and its jth input. Aj is the activity of the jth input of neuron i and θi the threshold of neuron i. It should be regarded here as the average difference between the actual integrate-and-fire threshold θ and the membrane potential U of the considered neuron i. Because the actual values of θ − U for each neuron in a running network are not known, let us consider θi as a uniform random variable between 0 and θmax . Since inactive neurons do not contribute to the sum, the previous expression can also be written as X wij − θi , Ai = H j: Aj =1
that is, summing only on active inputs. For enough active inputs per neuron—that is, for the main core of spatial activity propagation—this sum of random variables can be approximated using the central limit theorem, ³ ´ X p wij ≈ N NA (i) w, NA (i) σw , Aj =1
RST: A Connectionist Architecture
889
¯ σx ) stands for a gaussian law with average x¯ and standard dewhere N (x, viation σx , NA (i) the number of active input neurons of neuron i, and w and σw the average and standard deviation of weights. If the deviation of this approximation is neglected and the local density of activity dA (X, Y) is introduced (i.e., the probability for a neuron at coordinates (X,Y) in the network to be active), it can be shown that ¶ µ w , dA (X, Y) = NA (X, Y) θmax 1+ with (x)1+ being the function that is 0 for x < 0, identity for x ∈ [0, 1], and 1 for x > 1. From the definition of both NA and dA comes à ! Z Z d w X Y+tan α (X−u) dA (u, v) Pc (X−u, Y−v) du dv , dA (X, Y) = θmax 0 Y−tan α (X−u) 1+
where d is the density of the neuron point process. This provides a functional equation in dA that can be computed. The result of such a computation is compared with the actual activity distribution in Figure 4. If the standard deviation σ of the connection probability is low enough (with respect to the average m) so as to obtain a layered network, the density of activity can be computed further. In this case, the density of activity for a layer n + 1 with respect to the density of activity in layer n becomes µ ¶ w dA (n) . dA (n + 1) = N(n) θmax 1+ If the number of inputs per neuron N(n) does not depend on the layer n, we have à à !! Nw ·n , dA (n) = dA (0) · exp θmax 1+
whose behavior is illustrated in Figure 5. It can also be shown that in this case, the variation of the density of activity with respect to Y (perpendicular to the network propagation) is in O(Y2 ) because it is proportional to the ratio of input areas. To summarize the spatial aspect of activity propagation, the key parameter that allows activity to reach the output layer is K = N w/θmax . If K < 1, the activity does not propagate up to the output layer; it dies out before, and no response is observed. If K > 1, there is a value Xc of X after which all the neurons are active within the connection cone: Xc = − ln dA (0)/ ln K. 2.4 Temporal Propagation. We focus here on the temporal aspect of the propagation and show how it allows RST to filter static input patterns from moving ones.
890
J.-C. Chappelier and A. Grumbach
Figure 4: Activity distribution. Comparison between the result of an actual simulation (with the leaky integrator neuron model) in RST (top) and the corresponding computed analytical (i.e., binary neuron approximation) result (bottom). In the simulation, only active neurons over accumulated time are represented (black squares). The analytical result shows the density of active neurons dA (X, Y) in shades of gray, with black standing for 1 (i.e., all neurons in the corresponding area are active). Parameters used for the illustration are θmax = 0.03, ω = 0.005, d = 0.0276, α = 0.2, m = 100.0, and σ = 20.0.
2.4.1 Local Point of View. As illustrated by Figure 6, a given neuron acts as a synchrony detector (at the firing rate we are working with in RST). The more synchronized (i.e., close in time) its inputs are, the higher the sum the neuron receives and therefore the higher is its probability to fire. Having PSPs instead of spikes is crucial for this aspect; it provides the system with some robustness to temporal noise (i.e., fluctuations in spike dates) since PSPs expand in time spike duration. 2.4.2 Global Point of View. For the temporal propagation through the network, waves of spikes going from input to output layer can be observed, as shown in Figure 7. Waves come from the fact that it takes precisely one
RST: A Connectionist Architecture
891
Figure 5: The two kinds of behavior for the density of activity with respect to the value of the constant K = N w/θmax . In the case where K < 1 (left), the activity does not propagate to the output layer; it dies out before, and no response is observed in the output layer. In the case where K > 1 (right), the value of x after which all the neurons are active is xc = − ln dA (0)/ ln K. The case K = 1 is not mentioned, for it has a null probability of occurrence.
Figure 6: Different cases of temporal summation at the neuron level. The less delay between two inputs, the higher the incoming potential, and therefore the higher the probability to fire.
892
J.-C. Chappelier and A. Grumbach
Figure 7: Temporal propagation in RST: waves of synchronous spikes (from left to right and top to bottom).
time step to go from one neuron to the next one and that neurons have a higher probability of firing when their inputs are synchronized, as previously explained. Those two aspects lead to amplification of synchronized wave propagation (van Hemmen, Domany, & Schulten, 1995) as it appears, for instance, in the synfire-chains of Abeles (Abeles, 1982; Abeles, Bergman, & Vaadia, 1994). It has been claimed several times that synchronous firing may be crucial for perception (Milner, 1974; von der Malsburg, 1981; Crick & Koch, 1990), and it has been experimentally traced in biological systems for more than 15 years (Toyama, Kimura, & Tanaka, 1981a, b; Abeles, 1982; Aersten, Gerstein, Habib, & Palm, 1989; Gray, Konig, ¨ Engel, & Singer, 1989). 2.4.3 Motion Detection Thanks to Synchronization. Spike synchronization plays the main role in RST for filtering static input patterns from moving ones. Two phases occur in motion detection by RST: the creation of activity a few neurons ahead of the input layer and the propagation of this activity. For static input objects (i.e., objects having been input for several time steps), the activity a few layers after the input is unsynchronized, composed of neurons firing with different frequencies. This is due to the fact that neurons are connected with different weights (for σw 6= 0). Having different weights, they fire with different frequencies and therefore have a small probability of being synchronized after a few spikes (this probability is related to the probability that these downstream neurons have the same sum of input weights). Concerning the propagation of this unsynchronized initial activity, in the case where K (as defined in section 2.3) is less than 1, the total activity decays as it goes away from the input layer. For moving objects, the triggered activity is composed of two transitory waves of spikes (for the current position and the previous one). These tran-
RST: A Connectionist Architecture
893
sitory waves are almost synchronized because they are fired at almost the same time. Being synchronized, these waves are amplified by the network and have a better chance of reaching the output layer. The more synchronized the waves are, the more neurons are activated, increasing the local density dA of active neurons. These two aspects—synchronized transitory waves for moving objects versus a conic area of unsynchronized activity for static objects on the one hand and more efficient propagation in synchronized cases than in unsynchronized ones on the other hand—properly combined and adjusted by means of RST parameters, allow motion detection and localization. This phenomenon can be illustrated by considering the following input patterns: a small object of size s moving in an area of size S versus a large static object of size S (the same as the wandered area). Since more input neurons are activated at a given time by the large-size pattern than by the small, moving one, the activity created a few neurons ahead of the input layer is greater in the second (static) case than in the first one. On the other hand, the activity triggered by the moving object, being more synchronized than in the static case, is more amplified by the network. Moving away from the input layer, an area is reached where the two activities are comparable, as illustrated in Figure 8. Before this area, the activity corresponding to the static object is greater than the activity corresponding to the moving object. After this area, it is the opposite. In order to filter moving objects from static ones, the RST output layer has to stand after this critical area for every possible s and S (for a given task). This point is investigated by experiments presented in the following section.
3 Experiments Experiments presented here focus on motion detection and localization. The aim is that output units should fire only where and when there is a moving object in the input sequence. We first present experiments related to static input objects, then to moving ones. The effect of RST parameters on the results is finally discussed. Except when explicitly mentioned, experiments were carried out with the following RST configuration: a three-dimensional network with Ne = 144 input neurons (12×12 input images, one pixel of the image corresponding to one neuron in the input layer) and Ns = 144 output neurons, a total number of N = 5000 neurons (density was then d = 0.18), α = 0.3, Pmax = 1.0, a network length, measured with an arbitrary unit (u), of 400 u, so ρ was 0.755, m set to 80 u, so as to get six layers (including input and output ones), σ = 22 u, which is small enough with respect to m actually to get layers, wmin = 0.0 and wmax = 0.9. For the network response, we focused on the global activity of the output layer (the average interspike interval), the size of the response measured by
894
J.-C. Chappelier and A. Grumbach
Figure 8: An illustration of the parameter area where ambiguity between large, static objects and smaller, moving ones is maximum. (Top) A static input object and the corresponding response of RST. (Bottom) A smaller moving object and the corresponding response of RST. The two responses are similar.
the total number of spiking neurons in the output layer, and, for static inputs, the period of spike bursts generated in the network. 3.1 Static Input Patterns. Experiments consist of presenting a remaining static object (a fully active disc) and measuring the response of RST for different object sizes. The size of an object in the input image is measured by the number of activated neurons in the image (i.e., the area of the object). Static input patterns should present a qualitatively different behavior above some critical size; small objects do not provide enough activity to reach the output (the die-out case mentioned in Figure 5); for larger objects, the activity cone reaches the output, the size of which linearly increases with the size of the input pattern. This behavior is confirmed by experiments as shown in Figure 9. Because of the refractory period of the neurons, the response of the network is not constant even for a constant input. It is periodically active and silent. This period increases with input size as presented in part c of Figure 9.
RST: A Connectionist Architecture
895
Figure 9: Three different features of the response of RST to some static input object plotted with respect to the size of this object expressed in pixels: the duration of interspike intervals (ISI) (a), the size of the response (b) and its period (the average time between two following waves of spikes), in (c). A threshold type activity can be seen. In (a), it is obtained for input sizes above 10 pixels. In (b), absolutely no response for sizes lower than 5 and a linearlike dependence for input size above 6 pixels is obtained. The middle curve is the average over several experiments, and the two others are plus and minus one standard deviation. The difference between the thresholds in case (a) and case (b) shows that the few firing neurons for input size less than 10 (b) are firing very sparsely (ISI > 35 ms in (a)). In (c), the threshold behavior is observed for sizes greater than 20 pixels. For small sizes, the period of spike waves is so large that it can be considered that no response comes out of the network. On the other hand, RST triggers periodic responses for sizes greater than 20 pixels with almost the same period (around 20 dt in the present experiments).
896
J.-C. Chappelier and A. Grumbach
Figure 10: Size of the response versus inverse speed (in dt) of objects in input images plotted for several sizes of a moving input object. For small speeds, the response is almost constant (as if the presented object were static). For speeds above 1/10 = 0.1 dt−1 , the response is more salient and increases with speed. The threshold speed of 0.1 dt−1 is independent of the size of the moving object.
3.2 Moving Objects. The same kinds of experiments were made with moving objects. The speed of objects (constant for a given experiment) is measured by the number of time steps dt between two moves. For instance, an object that is still for 5 time steps, moves one pixel left, stays once again for 5 time steps, moves one pixel left, and so forth, has a speed of 1/5=0.2 dt−1 . No motion was faster than 1 dt−1 . Figure 10 shows that above a critical speed (1/10 = 0.1 dt−1 in the presented experiments), the response of the network increases with respect to the speed of the presented object. Below this critical value, the response is almost independent of the speed. Furthermore, the critical speed is itself independent of the size of the moving object. 3.3 Processing Real-World Images. In order to illustrate the performance of RST beyond the former simple examples, although our purpose is not to build a spatiotemporal model of the retina as done, for instance, by H´erault and Beaudot (1993; Beaudot, 1994), we have processed real-world video sequences showing walking humans in front of some static objects (a window). For this application, illustrated in Figure 11, RST parameters
RST: A Connectionist Architecture
897
Figure 11: Application of RST network to motion detection. The response of the network (spiking neurons) is superimposed as white squares on the original images.
were set as follows: a three-dimensional network with 36 × 24 input neurons, 36 × 24 output neurons and a total of 25,000 neurons, a length of 400 u, Pmax = 1.0, m = 80 u (leading to 6 layers), σ = 22 u, α = 0.3, wmin = 0.1, and wmax = 0.65. Consecutive images of the video sequence were presented one after another to the RST input layer. These images were binarized so as to fit RST inputs. As shown in Figure 11, the firing neurons in the output layer were (in a reproducible manner) those corresponding to the moving objects in the input layer, although static objects sometimes give a response (in a nonreproducible manner), as illustrated by the window response in the sixth image. 4 Discussion 4.1 Effect of Architecture Parameters. How are the preceding results influenced by the RST construction parameters? The only parameters that influence the activity propagation are the average weight (w), the maximum difference between neuron threshold and potential (θmax ), and the average
898
J.-C. Chappelier and A. Grumbach
Figure 12: Evolution of the size of the response to an input object of size 12 for both static and dynamic cases, with respect to (a) the weight average, (b) the connection angle, and (c) the standard deviation of the connection length.
number of inputs per neuron (N). Among the construction parameters, only α and the standard deviation of the radial connection part (σ ) influence this number N: the larger they are, the more inputs a given neuron may have. Figure 12 shows the effect of the three influent parameters: w, α, and σ . In order not to bias the experiments, the neuron density was kept the same for all experiments. Since only the weight average appears in the formulas, the standard deviation of the weights (σw ) is not a significant parameter for the global behavior of the network (as long as it is not zero). However, the larger σw is, the more variability is found in the behavior (as expected). Parameters m and ρ change the number of layers only before the output is reached. If their variations are not too big, the same kind of results are obtained, as confirmed by experiments. 4.2 Representativity of the Output. How can an output response of RST be representative of some input sequence? Concerning the influence of
RST: A Connectionist Architecture
899
Table 2: Influence of the Characteristics of the Input Image on the Characteristics of the Output Image. Output Image
Input Size Speed Density Gapa Directionb
Size
Activity
Period
X X X X
X X X
X X X
Shape
Orientation
X X X X
a Distance
Note: between two input objects. b Direction of the movement. It influences the orientation of the response which is perpendicular to the movement direction.
characteristics of an input image on the observed response (see Table 2), we can notice the following, beyond the influences of size and speed already discussed: 1. When several objects are present in the input image, the density of input objects plays a similar role to object size since the number of active neurons is directly related to the density of objects. Moreover, density also affects the shape of the response because of the aggregation of several single object responses that occurs when the density is too high. The distance between input objects (called gap) also presents this aggregating effect when it is too small. However, the gap does not affect the total activity since changing it does not change the total number of active input neurons. 2. The direction of the movement in the input plane influences the orientation of the output response. This response has its longer axis perpendicular to the direction of the movement. This is due to the fact that activity triggered by a moving object results from the intersection of the two cones of activity of the current position and the previous one. In the most general case, however, output responses are not fully representative of the input sequences. For instance, very large-size static objects can also trigger activity in the output layer. Such situations should be avoided for normal use of RST in the context of motion detection. Another ambiguous situation may occur when overlapping objects, involving a large-sized one and a smaller, moving one, are present. Even if the size of the static object is not as large as to produce activity in the output layer, this object still creates activity within the network. This activity may be amplified by the nearby activity of the overlapping moving object. This is precisely what happens in the sixth image (first column, second row) of
900
J.-C. Chappelier and A. Grumbach
Figure 11, where the activity of the window is amplified by the activity of the pedestrian. However, this phenomenon, due to some special configuration of spatial activity unfortunately synchronized (therefore amplified) with the moving one, is not systematically reproducible, even when an identical input sequence comes back. 5 Conclusion and Perspectives Neural networks have so far been successful mainly in the area of static information processing, such as pattern recognition, classification, and vector quantization. Their application to problems where time plays the important role is still under development. In this context, the aim of our research was to design and test a new connectionist model suitable for spatiotemporal problems by keeping spatial neighborhoods within the network and using a temporal neuron model. From an experimental point of view, RST has shown a straightforward (without learning) ability for spatiotemporal pattern processing. As a confirmation of the approach, an application to motion detection and localization in real-world video sequences was presented. The most important parameters of the architecture for this kind of application are the connection angle, the standard deviation of connection length, and the weight average. This work should be regarded as a starting point of a wide-open research framework. Among the possibilities for future work, we would like to emphasize two. All the results presented so far were obtained without learning. So, first, it should be worth studying learning in RST so as to obtain more complex functionalities than motion detection (e.g. motion recognition). For such upgraded capacities, the principle of preserving spatiotemporal neighborhoods does not suffice, since this principle leads only to low-band spatiotemporal filtering. Learning by both preserving the previous capabilities and improving the structure of the network could provide higher-level functionalities. In order to be efficient, such learning should be embedded into some actual perception-action loop. Application to problems such as visual focusing on moving objects (tracking) could be considered a relevant goal. Another important research issue concerns the dynamical construction of the RST network, referred to as network growth. Inspired by living cells (Changeux, 1983), RST growth could rely on the two following principles: • Selective stabilization: Among a large set of connections, only those that are used often must be kept; the other connections are removed. • Neural growth: When two neurons are close enough and have correlated activities, a connection is created between them. RST architecture, taking the geographical position of neurons into account, can handle these principles.
RST: A Connectionist Architecture
901
Acknowledgments This research was funded by Direction des Enseignements Sup´erieurs des T´el´ecommunications, France T´el´ecom, Paris. We are grateful to the two anonymous reviewers; to D. Bonnet, D. Verna, and S. Midenet for their useful comments; to P. Weyerbrown for his careful reading of the manuscript for this article; and to D. Mary for his help in the experiments.
References Abeles, M. (1982). Local cortical circuits: An electrophysiological study. Berlin: Springer-Verlag. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Prut, Y., Bergman, H., & Vaadia, E. (1994). Synchronization in neuronal transmission and its importance for information processing. In G. Buzs´aki, R. Llin´as, W. Singer, A. Berthoz, & Y. Christen (Eds.), Temporal coding in the brain (pp. 39–50). Berlin: Springer-Verlag. Aersten, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal firing correlation: Modulation of effective connectivity. Journal of Neurophysiology, 61(5), 900–917. Azmy, N. (1992). Un Mod`ele connexionniste de la g´en`ese du rythme respiratoire: ´ Epist´ emologie, simulations et implications. Doctoral dissertation, Paris VI University. Beaudot, W. H. A. (December, 1994). Le traitement neuronal de l’information dans la r´etine de vert´ebr´es: Un creuset d’id´ees pour la vision artificielle. Doctoral dissertation, Institut National Polytechnique de Grenoble. B´eroule, D. (1987). Guided propagation inside a topographic memory. In First Int. Conf. on Neural Networks (pp. 469–476). San Diego: IEEE. Changeux, J.-P. (1983). L’homme neuronal. Paris: Fayard. Chappelier, J.-C., & Grumbach, A. (March, 1996). A Kohonen map for temporal sequences. In Proceedings of NEURAP’95 (pp. 104–110). Marseille, France. Crick, F., & Koch, C. (1990). Towards a neurobiolgical theory of consciousness. Seminars in Neurosciences, 2, 263–275. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory response in cat visual cortex. Nature, 338, 334–337. H´erault, J., & Beaudot, W. (April, 1993). Motion processing in the retina: About a velocity matched filter. In M. Verleysen (Ed.), Proceedings of European Symposium on Artificial Neural Networks (pp. 129–136). Brussels, Belgium. H´erault, J., & Jutten, C. (1994). R´eseaux neuronaux et traitement de signal. Paris: Hermes. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551.
902
J.-C. Chappelier and A. Grumbach
Linsker, R. (October, 1986). From basic network principles to neural architecture: Emergence of spatial opponent cells (1). Proceedings of the National Academy of Sciences USA, 83, 7508–7512. MacGregor, R. J., & Lewis, E. R. (1977). Neural modeling: Electric signal processing in the neurons systems. New York: Plenum Press. Milner, P. M. (1974). A model for visual shape recognition. Psychological Review, 81, 521–535. Thorpe, S. J. (1990). Spike arrival time: A highly efficient coding scheme for neural networks. In R. Eckmiller, G. Hartmann, & G. Hauske (Eds.), Parallel processing in neural systems and computers (pp. 91–94). Amsterdam: Elsevier. Thorpe, S. J., & Imbert, M. (1989). Biological constriants on connectionist models. In R. Pfeifer, Z. Schreter, F. Fogelman-Souli´e, & L. Steels (Eds.), Connectionism in perspective. Amsterdam: Elsevier. Toyama, K., Kimura, M., & Tanaka, K. (1981a). Cross-correlation analysis of interneuronal connectivity in cat visual cortex. Journal of Neuroscience, 46(2), 191–201. Toyama, K., Kimura, M., & Tanaka, K. (1981b). Organization of cat visual cortex as investigated by cross-correlation analysis. Journal of Neuroscience, 46(2), 202–214. van Hemmen, J. L., Domany, E., & Schulten, K. (Eds.). (1995). Models of neural networks II: Temporal aspects of coding and information processing in biological systems. New York: Springer-Verlag. Vibert, J.-F., Pakdaman, K., & Azmy, N. (1994). Interneural delay modification synchronizes biologically plausible neural networks. Neural Networks, 7(4), 589–607. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. No. 81-2). Gottingen, ¨ Germany: Department of Neurobiology, Max Planck Institute for Biological Chemistry. Received July 17, 1996; accepted August 19, 1997.
LETTER
Communicated by Shimon Ullman
A Neural Model of Contour Integration in the Primary Visual Cortex Zhaoping Li Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
Experimental observations suggest that contour integration may take place in V1. However, there has yet to be a model of contour integration that uses only known V1 elements, operations, and connection patterns. This article introduces such a model, using orientation selective cells, local cortical circuits, and horizontal intracortical connections. The model is composed of recurrently connected excitatory neurons and inhibitory interneurons, receiving visual input via oriented receptive fields resembling those found in primary visual cortex. Intracortical interactions modify initial activity patterns from input, selectively amplifying the activities of edges that form smooth contours in the image. The neural activities produced by such interactions are oscillatory and edge segments within a contour oscillate in synchrony. It is shown analytically and empirically that the extent of contour enhancement and neural synchrony increases with the smoothness, length, and closure of contours, as observed in experiments on some of these phenomena. In addition, the model incorporates a feedback mechanism that allows higher visual centers selectively to enhance or suppress sensitivities to given contours, effectively segmenting one from another. The model makes the testable prediction that the horizontal cortical connections are more likely to target excitatory (or inhibitory) cells when the two linked cells have their preferred orientation aligned with (or orthogonal to) their relative receptive field center displacements. 1 Introduction In early stages of the visual system, individual neurons are responsive only to stimuli in their classical receptive fields (RFs), which are only large enough to signal a small edge or contrast segment in the input (Hubel & Wiesel, 1962). The visual system must group separate local input elements into meaningful global features to infer the visual objects in the scene. Sometimes local features group together into regions, as in texture segmentation; at other times, they group into contours that may represent boundaries of underlying objects. Although much is known about the early visual processing steps that extract local features such as oriented edges, it is still unclear how the brain groups local features into global and more meaningful features. In this study, we model the neural mechanisms underlying the Neural Computation 10, 903–940 (1998)
c 1998 Massachusetts Institute of Technology °
904
Zhaoping Li
first stages of grouping of edge elements into contours—namely, contour integration. One of the first problems for contour grouping is the abundance of candidate “edges” produced by the simple edge detection mechanism that is believed to operate in V1 (Marr, 1982). Many of these edges are simply image contrast noise and are unlikely to belong to any significant or relevant object contour. It is desirable to influence the response of edge detectors by contextual information from the surround to enhance the sensitivity to more relevant edges. This could be the first step toward perceptual contour grouping. Indeed, V1 cells are observed to change their responses or sensitivities depending on the surround stimulus (Knierim & van Essen, 1992; Kapadia, Ito, Gilbert, & Westheimer, 1995); cells are more responsive if they are stimulated by edges that are aligned with other edge elements outside their RFs (Kapadia et al., 1995). These observations correspond well with psychophysical observations that human sensitivity to edge segments is also higher when they are aligned with other edges (Polat & Sagi, 1994; Kapadia et al., 1995). Horizontal cortical connections linking cells of nonoverlapping RFs have been observed and hypothesized as the underlying neural substrate (Rockland & Lund, 1983; Gilbert, 1992). These findings suggest that simple and local neural interactions even in V1 could contribute to primitive visual perceptual grouping as in contour integration, although V1 cells have been observed to change their sensitivities by visual attention (Motter, 1993). It is computationally desirable to understand how V1, with its small-size RFs and finite-range neural interactions, can nevertheless enhance sensitivities to globally defined contours. Models of visual algorithms have been proposed for similar purposes. For example, Zucker, Dobbins, and Iverson (1989) used relaxation labeling techniques (Hummel & Zucker, 1983) to perform global curve detection or inference using only local compatibility interactions between curve segment labels. However, modeling the contour enhancement using known V1 neural RFs and interactions is difficult and has yet to be accomplished. It is thus not clear if all the experimentally observed contour enhancement (Kapadia et al., 1995) has to be attributed to feedback from higher visual areas. The difficulty in the modeling largely stems from the dynamic stability problems associated with recurrently connected neural networks; the system has to be sensitive enough to enhance any possible smooth contours the input may provide, but selective and stable enough so that noise is not significantly enhanced. To overcome this difficulty, previous neural models of contour enhancement have assumed some nonneural or biologically questionable operations such as dipole fields and the requirement of feedback from higher centers (Grossberg & Mingolla, 1985), nonlinear rule-based interactions (Zucker et al., 1989; Heitger & von der Heydt, 1993), input gating (Braun, Niebur, Schuster, & Koch, 1994), and global activity normalizations (Yen & Finkel, 1997). In fact, modeling contour enhancement is difficult even without the constraints of the neural
Neural Model of Contour Integration
905
hardware. Many computer vision models of edge linking (e.g., Kass, Witkin, & Terzopoulos, 1988) need user intervention, and many more autonomous models (e.g., Shashua & Ullman, 1988; Guy & Medioni, 1993; Williams & Jacobs, 1996) suffer from one or another problem. It is thus desirable to find out whether contour enhancement can actually be modeled using just V1 neural elements and operations, or whether contour enhancement in V1 has to be totally attributed to top-down feedback. This article introduces a model of contour enhancement using only V1 elements, based on experimental findings, such as orientation selective cells, local recurrent neural circuits, and finite-range horizontal connections. The model is studied analytically and empirically to understand how sensitivity enhancement in long-range contours is successfully carried out in a network of neurons with finite-range interactions. The network dynamics are analyzed to reveal the temporal synchrony between cells within a contour, as observed in experiments (Gray & Singer, 1989; Eckhorn et al., 1988). Our analysis relates the extent of the contour enhancement and neural synchrony with contour characteristics such as length, curvature, and closure. The model makes a testable prediction about the horizontal connection structure: the postsynaptic cell type is more likely to be excitatory (or inhibitory) if two cells linked by the horizontal connection prefer orientations that are aligned (or orthogonal) to the relative displacement between their RF centers. In addition, this model introduces a mechanism that allows higher visual centers selectively to enhance or suppress contour sensitivities, in addition to the contour enhancement performed within V1. Our work is mainly aimed at modeling the aspects of contour enhancement that are observed in V1. Contour integration is most likely completed by higher visual centers, which are absent in our model. This is in contrast to many other models that aim to build a model with the best possible performance at contour integration rather than to understand how and where it is done in the brain. For instance, this model does not address or define illusory contours, since V1 cells are not as evidently responsive to illusory contours as are V2 cells (von der Heydt, Peterhans, & Baumgartner, 1984; Grosof, Shapley, & Hawken, 1993), and T, L, X junction units, which are not known to exist in V1, are required to detect many types of illusory contours. (However, our model does help to fill in the gaps in incomplete contours; see section 3.) Also, assuming that V1 does not address contours as global objects, the model merely enhances individual contour segments, without defining the saliency of a whole contour. Additionally, a mechanism of feedback control is provided by modeling the feedback signals and specifying their V1 target neurons, but we do not actually model how higher visual centers might respond to V1 outputs in order to construct the desired feedback. The article first presents the relevant experimental background. Then we describe the model and analyze it to show how contours are enhanced, how synchronization happens between contour segments, and how the contour
906
Zhaoping Li
enhancement and synchronization depend on contour characteristics. The performance of the model is demonstrated by examples. Then we model the top-down feedback and demonstrate selective enhancement, suppression, and the effective segmentation by top-down control. Finally, we place the model in the context of experimental findings and other models, and discuss its limitations and possible extensions. 2 Experimental Background Primary cortical neurons respond to input edges only within their classical receptive fields, which are local regions in the visual field mostly too small to contain any visual object (Hubel & Wiesel, 1962). RF centers are distributed visuotopically on the cortical surface and cells with overlapping RFs, but different preferred edge orientations are grouped together into hypercolumns (Hubel & Wiesel, 1962). Visual stimuli outside a cell’s classical RF, in a region whose size is larger than the RF, can influence the responses of the cell (see Allman, Miezin, & McGuinness, 1985; Knierim & van Essen, 1992). Generally, antagonistic suppression is observed when gratings or textures are presented in the surround (Allman et al., 1985), although surround facilitation (Maffei & Fiorentini, 1976) and orientation contrast facilitation have also been observed (Sillito, Grieve, Jones, Cudeiro, & Davis, 1995). By placing bars in the surround of the RF of a cell and roughly aligning them with a bar presented in the center in its preferred orientation, Kapadia et al. (1995) demonstrated a significant increase in response to the central bar, even when there are additional random stimuli in the background. This enhancement of response decreases with increasing separation or misalignment between the central and surround bars, and is stronger when multiple bars in the surround are aligned with the central bar to generate a smooth contour (Kapadia et al., 1995). Such contextual influences will be the mechanism used for the contour enhancement in our model. Qualitatively similar findings are observed psychophysically in humans under similar stimulus settings (Polat & Sagi, 1994; Kapadia et al., 1995). Human observers can also easily identify a smooth contour composed of individual or even disconnected edge segments among other random edge segments scattered in the background (Field, Hayes, & Hess, 1993). The sensitivity to such contours is enhanced when the contour closes on itself; this is called the closure effect (Kovacs & Julesz, 1993). Also, responses of V1 cells are modulated by visual attention (Motter, 1993), although earlier studies found these effects only in higher visual areas (Moran & Desimone, 1985). Primary visual cortex is composed of many excitatory and inhibitory neurons. Visual input is received mainly onto the dendrites of the excitatory cells, although inhibitory neurons also receive a very small amount of direct visual input (White, 1989; Douglas & Martin, 1990). The excitatory cells send outputs to higher visual areas (Douglas & Martin, 1990), which in turn feed back to V1 (Valverde, 1985; van Essen, 1985). There are several
Neural Model of Contour Integration
907
layers and different groups of excitatory cells (and inhibitory cells), and they are likely to serve different functions; some cells are more concerned with receiving visual inputs, whereas other cells process the signals and send outputs to higher visual centers (Salin & Bullier, 1995). It is not yet clear what the target cell types are for the higher center feedback (Salin & Bullier, 1995). Cortical neurons interact with each other locally and often reciprocally; the excitatory connections extend somewhat longer distances than the inhibitory ones (Douglas & Martin, 1990; White, 1989). These neural interactions typically link neurons with similar RF properties (White, 1989). The anatomical basis for the surround effect has been postulated to be the long-range horizontal connections linking cells up to 4 mm or more apart in the primary visual cortex (e.g., Kapadia et al., 1995; Gilbert, 1992; Allman et al., 1985). These connections emanate from the excitatory pyramidal cells in upper layers and contact both the excitatory and inhibitory postsynaptic cells, enabling monosynaptic excitation and disynaptic inhibition from one cortical site to another (Gilbert, 1992; McGuire, Gilbert, Rivlin, & Wiesel, 1991; Hirsch & Gilbert, 1991; Weliky, Kandler, Fitzpatrick, & Katz, 1995). The axonal fields of these connections are asymmetrical, extending for greater distances along one cortical axis than another (Rockland & Lund, 1983; Gilbert & Wiesel, 1983; Fitzpatrick, 1996). Cells preferring similar orientations tend to be linked (Ts’o, Gilbert, & Wiesel, 1986; Gilbert & Wiesel, 1989; Malach, Amir, Harel, & Grinvald, 1993) whether or not the relative displacements of their receptive field centers are aligned with or orthogonal to their preferred orientations (Gilbert & Wiesel, 1983). The horizontal cortical connections are also implicated in the temporal synchrony of the 40–60 Hz oscillations of neural responses (Gray & Singer, 1989; Eckhorn et al., 1988). Take two neurons with nonoverlapping RFs and aligned optimal orientations. The synchrony of their firing is negligible if two bars sweep over the two RFs independently, is significant when the bars sweep together, and is the strongest when a long, single, sweeping bar extends over both RFs (Singer & Gray, 1995). Usually the degree of neural synchrony decreases with increasing separation between neurons (Singer & Gray, 1995; Eckhorn, 1994). The extent of the oscillatory neural activities is not completely certain (Singer & Gray, 1995). It will be shown in our model that for inputs that contain contours, the strength of the neural oscillation depends on contour characteristics such as length and smoothness. Both the synchrony and the enhancement of responses for aligned edges have been postulated as mechanisms underlying feature linking (Gilbert, 1992; Singer & Gray, 1995; Eckhorn, 1994). 3 The Contour Integration Model This section starts by outlining the model and its neural elements. Then the response of a single edge element is analyzed, and the enhancement of responses that results from interactions between the elements is ana-
908
Zhaoping Li
Figure 1: Model visual space, hypercolumns, and edge segments. The input space is a discrete hexagonal or Manhattan grid.
lyzed and demonstrated. Next, the dynamics of the model are studied to reveal a tendency toward oscillations and the emergence of temporal coherence between contour elements. Finally, we introduce and demonstrate the mechanism that allows top-down feedback control. 3.1 Model Outline. Visual inputs are modeled as arriving at discrete spatial locations (see Figure 1). At each location i there is a model V1 hypercolumn composed of K neuron pairs. Each pair (i, θ) has RF center i and preferred orientation θ = kπ/K for k = 1, 2, . . . , K, and is called (a neural representation of) an edge segment. Each edge segment consists of an excitatory and an inhibitory neuron that are connected with each other. The excitatory cell receives the visual input; its output quantifies the response or salience of the edge segment and projects to higher visual areas. The inhibitory cells are treated as interneurons. When an input image contains an edge at i with orientation β and input strength Iˆiβ , edge segment iθ receives input Iiθ = Iˆiβ φ(θ − β), where φ(θ − β) = e−|θ −β|/(π/8) is the orientation tuning curve for the cell centered at θ . Segments outside the hypercolumn i receive no input contribution from Iˆiβ . The excitatory and inhibitory cells have membrane potentials xiθ and yiθ , respectively, and their outputs are gx (xiθ ) ≥ 0 and gy (yiθ ) ≥ 0, analog numbers modeling firing rates. Both gx (.) and gy (.) are sigmoid-like nonlinear and nondecreasing functions (see Figure 2). The cells for an edge segment send their outputs gx (xiθ ) and gy (yiθ ) to each other. The excitatory cell excites itself with synaptic strength Jo . Its output gx (xiθ ) is the edge’s response
Neural Model of Contour Integration
909
to visual inputs. Edge segment jθ 0 at another location can excite edge iθ monosynaptically by sending the excitatory signal Jiθ,jθ 0 gx (xjθ 0 ) to the excitatory cell in edge iθ, and/or inhibit the edge disynaptically by directing an excitatory signal Wiθ,jθ 0 gx (xjθ 0 ) to the inhibitory cell. Here Jiθ,jθ 0 and Wiθ,jθ 0 model the synaptic strengths of horizontal cortical connections. For any visual input pattern Iiθ for all i, θ, the neural dynamics evolve according to: X ψ(1θ )gy (yi,θ+1θ ) + Jo gx (xiθ ) x˙ iθ = −αx xiθ − +
X
1θ
Jiθ,jθ 0 gx (xjθ 0 ) + Iiθ + Io
j6=i,θ 0
y˙ iθ = −αy yiθ + gx (xiθ ) +
X
Wiθ,jθ 0 gx (xjθ 0 ) + Ic ,
(3.1) (3.2)
j6=i,θ 0
where 1/αx and 1/αy are the membrane time constants and Ic is the background input to the inhibitory cells, which will later be used to model topdown control signal. Io is the background input to the excitatory cells and includes a term that normalizes the activity—an inhibition that increases with the total activities in the local edge segments. Finally, ψ(1θ ) is an even function of 1θ modeling inhibition within a hypercolumn and decreases with |1θ|. When ψ(1θ) = 0 for 1θ 6= 0, the inhibitory cells couple only to the excitatory cell in the same edge segment; otherwise, an activated edge exerts some inhibition ∝ ψ(1θ ) to other edges in the same hypercolumn. (Note that this interaction within a hypercolumn does not model the emergence of the cell orientation selectivity; Somers, Nelson, & Sur, 1995.) Each neuron additionally receives some random noise input. The appendix lists the parameters used in this model. For ease of analysis, and without loss of generality, we use αx = αy = 1, and make gx ( ) and gy ( ) piecewise linear functions with threshold and saturation. Also, the excitatory cells have a unit gain g0x (x) in the operating range. Given an input pattern Iiθ , the network approaches a dynamic state after several membrane time constants, and the response gx (xiθ ) gives a saliency map. When gx (xiθ ) at location i is a unimodal function of θ (identifying θ with θ + π), the orientation θ¯i that would be perceived in higher centers P P ¯ can be modeled by ei2θi ∝ θ gx (xiθ )ei2θ / θ gx (xiθ ). Two edges of different orientations will be perceived to cross each other at location i if gx (xiθ ) is a bimodal function of θ. 3.2 A Single Edge Element. Before studying the contextual interactions between edges, we first analyze the input response properties of a single edge segment iθ, ignoring the other edges. For simplicity, we omit the subscripts i and θ, and denote I = Iiθ + Io : x˙ = −x − gy (y) + Jo gx (x) + I. y˙ = −y + gx (x) + Ic .
(3.3) (3.4)
910
Zhaoping Li
Figure 2: (A) Model neural elements, edge elements, visual inputs, and neural connections. Jo : self-excitatory connection; J: lateral excitatory connection between edge elements; W: lateral disynaptic inhibitory connection between edge elements, implemented as excitatory connections from the excitatory neuron of one edge element to the inhibitory neurons of the others. (B) Activation function gx (x) for the excitatory cells. (C) gy (y) for the inhibitory cells.
The average neural activity is determined by the equilibrium point E = ¯ which is the intersection of the two curves on which x˙ = 0 and y˙ = 0, ¯ y), (x, respectively (see Figure 3B): ¯ + Jo gx (x) ¯ + I. x˙¯ = 0 = −x¯ − gy (y) ¯ + Ic . y˙¯ = 0 = −y¯ + gx (x)
(3.5) (3.6)
¯ The input sensitivities, Increasing I or Ic raises or lowers the output gx (x). determined by solving the linearized version of the above equations, are ¯ = δgx (x)/δI
1+
¯ g0x (x) 0 0 ¯ x (x) ¯ − Jo g0x (x) ¯ gy (y)g
.
(3.7)
Neural Model of Contour Integration
911
Figure 3: (A) Edge response gx (¯x) as a function of visual input. Three response curves—one solid and two dashed ones—are plotted for three different cortical inputs Ic to the inhibitory cell. (B) Equilibrium point and curves for an edge element. The solid curve is the x˙ = 0 curve, and the dashed curve is the y˙ = 0 curve. Increasing I raises the x˙ = 0 curve, and increasing Ic raises, and somewhat deforms, the (monotonously increasing) y˙ = 0 curve, thus changing the equilibrium point (¯x, y¯ ). The neurons may approach the equilibrium point after some transient, or oscillate around it, as discussed in section 3.6.
¯ δgx (x)/δI c =
¯ 0x (x) ¯ −g0y (y)g ¯ 0x (x) ¯ − Jo g0x (x) ¯ 1 + g0y (y)g
¯ · [δgx (x)/δI]. ¯ = −g0y (y)
(3.8)
¯ increases continuously with In cases that interest us, the output gx (x) ¯ ≥ 0, and this requires visual input I (see Figure 3A), that is, δgx (x)/δI ¯ 0x (x) ¯ > Jo g0x (x) ¯ − 1. The visual input gain δgx (x)/δI ¯ is zero below input g0y (y)g
912
Zhaoping Li
threshold where g0x = 0 becomes nonzero after threshold where g0x = 1, and ¯ increases. can decrease beyond threshold when the inhibitory gain g0y (y) ¯ saturates when g0x = 0 again. Qualitatively, At high I, the activity gx (x) this sensitivity curve corresponds to physiological observations.1 Note that threshold input value I and the edge input response curve in Figure 3A change with Ic . This is because, as shown in equation 3.8, increasing Ic de¯ creases the output gx (x). According to equations 3.7 and 3.8, increasing I and Ic simultaneously ¯ if 1I/1Ic > g0y (y), and decreases it otherwise. This increases the output gx (x) leads to the following consequences. First, the visual input could be directed to both the excitatory (increasing I) and inhibitory (increasing Ic ) cells as experimentally observed (White, 1989). Their net effect will be to increase the ¯ as long as the visual input partition to the two cell types edge response gx (x) is appropriate. Second, the effect of input from other edges via horizontal connections can be seen as increasing I and/or Ic . Therefore, in general, the net contextual influence on the edge can be facilitatory or suppressive, depending on relative recruitment of horizontal fibers, as experimentally observed (Hirsch & Gilbert, 1991). Furthermore, since the gain g0y (y) increases with input level, such contextual influence is more likely to be inhibitory at higher input levels. This is also experimentally observed (Hirsch & Gilbert, 1991; Sengpiel, Baddeley, Freeman, Harrad, & Blakemore, 1995; Weliky et al., 1995). In our model, for simplicity, the visual input is directed solely to the excitatory cells, but this could easily be generalized. The horizontal connections in the model are specified (in section 3.3) such that the net contextual influence from appropriately aligned edges is facilitatory at all contrast levels, although the contextual influence from less aligned edges (these edges can still prefer similar orientations) can depend on stimulus levels. Hence, in this model, the change in the dominance between facilitatory and suppressive contextual influences occurs only for input patterns that are not contours, and therefore is not going to be discussed further in this article. 3.3 Interactions Between Edge Segments for Contour Integration. Edge element (jθ 0 ) excites or inhibits the edge element (iθ ) by sending an excitatoryto-excitatory output Jiθ,jθ 0 gx (xjθ 0 ) or an excitatory-to-inhibitory output Wiθ,jθ 0 gx (xjθ 0 ). The goal for the connection structure Jiθ,jθ 0 and Wiθ,jθ 0 is that edge elements within a smooth contour should enhance each other’s activ-
1 When the segment has too strong a self-excitation—large J g0 (x)—and not enough o x ¯ ¯ such that g0y (y)g ¯ 0x (x) ¯ < Jo g0x (x) ¯ − 1—the system is unstable, and inhibition—small g0y (y), ¯ jumps discontinuously with input I. Such cases are not considered here, the output gx (x) as we restrict to outputs as continuous functions of inputs for most input values.
Neural Model of Contour Integration
913
ities and that the isolated edge elements caused by noisy inputs should be suppressed, or at least not enhanced. Hence: • The connection Jiθ,jθ 0 will be large if one can find a smooth or small curvature contour to connect (iθ ) and (jθ 0 ), and it generally decreases with increasing curvature of the contour. • Edge elements will inhibit each other via Wiθ,jθ 0 when they are alternative choices in the route of a smooth contour. • Both connection types will decrease with increasing distances between the edge segments and become zero for large distances. • The connections have translation, rotation, and reflection invariance. This means the following: Let i − j be the line connecting the centers of two edges (iθ) and (jθ 0 ), which form angles θ1 and θ2 with this connecting line. The connections Jiθ,jθ 0 and Wiθ,jθ 0 depend only on |i− j|, θ1 , and θ2 , and satisfy Jiθ,jθ 0 = Jjθ 0 ,iθ and Wiθ,jθ 0 = Wjθ 0 ,iθ . Given these requirements, connections Jiθ,jθ 0 and Wiθ,jθ 0 both link cells that prefer similar orientations, as observed in experiments (Gilbert, 1992; Weliky et al., 1995) (see Figure 4). In addition, when the preferred orientations of two linked cells are aligned with the relative displacement of their RF centers, the postsynaptic cell type is more likely excitatory (the connection Jiθ,jθ 0 ); when the preferred orientations are more or less orthogonal to the relative displacement of the RF centers, the postsynaptic cell type is more likely inhibitory (the connection Wiθ,jθ 0 ). This provides a computational explanation to the puzzling experimental finding (Gilbert & Wiesel, 1983) that some horizontal connections link cells whose preferred orientations and relative RF center displacement do not align, but instead are roughly orthogonal to each other. These connections can serve to establish competition between alternative routes of a single contour by contacting inhibitory postsynaptic cells. This prediction (see the appendix for its derivation) of the model about the correlation between postsynaptic cell types and the degree of alignment between the two linked RFs has not been systematically investigated in experiments; a test is thus desirable. This excitatory and inhibitory edge interaction pattern is qualitatively similar to the edge compatibility function in Zucker et al. (1989). Altogether, an edge in a smooth contour mostly receives facilitatory inputs Jiθ,jθ 0 gx (xjθ 0 ) and few, if any, inhibitory inputs Wiθ,jθ 0 gx (xjθ 0 ), from other edges in the contour. This helps to enhance the response of edges in a contour at any contrast level. Figure 5 demonstrates the contour enhancement and noise suppression using such connections. The higher salience of the contours emerges rather quickly after the onset of the stimulus. For the example in Figure 5A where the visual input is weak, it takes a duration of three membrane time constants just for individual edges to integrate enough visual input for nonzero responses, but only another single time constant duration for the contextual input to drive the saliencies
914
Zhaoping Li
of the edges in the contour significantly higher than the background noise. The strengths and weakness of the model are further demonstrated in Figure 5B, where the model is challenged by the difficulties in a natural image.
3.4 A Straight Line. We can understand the performance of the model by analyzing some examples. In the first example, the visual input is a horizontal line on the x-axis ½ ˆIiθ = Iline if i is on the line and θ = 0 (3.9) 0 otherwise. The inputs Ic to the inhibitory cells of all edges are assumed to be the same. Let us consider the simplest case when edge elements outside the line (i.e., when i is not on the x-axis or θ 6= 0) are silent due to insufficient excitation. Then we can ignore all edges beyond the line, treat this system as onedimensional, and omit index θ . Let i denote the (one-dimensional) location of the line segment; then Wij = 0, and Jij ≥ 0 for all i, j 6= i, and x˙ i = −xi − gy (yi ) + Jo gx (xi ) +
X
Jij gx (xj ) + Iline + Io .
(3.10)
j6=i
y˙ i = −yi + gx (xi ) + Ic .
(3.11)
If the line is infinite, then by symmetry each neuron pair will have the same ¯ determined by: ¯ y) equilibrium point2 E = (x, ¯ + Jo gx (x) ¯ + (Iline + x˙¯ = 0 = −x¯ − gy (y) ¯ + Jo + = −x¯ − gy (y)
X
X
¯ + Io Jij gx (x))
(3.12)
j6=i
¯ + Iline + Io Jij gx (x)
(3.13)
j6=i
¯ + Ic . y˙¯ = 0 = −y¯ + gx (x)
(3.14)
PThis can be seen as either a single edge with extra external input 1I = ¯ or a giant single “edge” with a stronger self-excitatory con( j6=i Jij )gx (x) P ¯ for each edge nection (Jo + j6=i Jij ) (see Figure 6). Either way, activity gx (x) element in the line is enhanced (see Figures 5 and 7). The same minimum strength of input is required to excite a segment in a line or an isolated edge (see Figure 6A), if all segments in the line are equally 2 This equilibrium solution may or may not be stable, as studied in section 3.6. However, this does not matter, since this solution, stable or not, roughly gives the cells’ average responses.
Neural Model of Contour Integration
915
Figure 4: Model connections in a hexagonal grid visual space. Top and bottom graphs depict, respectively, the connections Jiθ,jθ 0 and Wiθ,jθ 0 from the center (thick) horizontal edge to other edges, denoted by their edge locations and orientations in the visual field. The quantitative values of the connection strengths Jiθ,jθ 0 and Wiθ,jθ 0 are described in the appendix. Each edge has the same connection pattern, suitably translated and rotated.
916
Zhaoping Li
Figure 5: Contour enhancement and noise reduction. (A) Performance for a synthetic image. (B) Performance for an input obtained by edge detection from a natural photo. The input and output edge strengths are denoted proportionately by the thicknesses of the edges. The same format applies to other figures in this article. The model outputs are the temporal averages of gx (x) over a period of 24 membrane time constants after input onset. In A, all visible edges have the same strength in the input and are differentially enhanced or suppressed at the output. On average, the line and circle segments are roughly 2.5 times as salient as the “noise” segments. For demonstration, we display the outputs after thresholding out the weaker edges (right). Because of the discrete sampling grid, the apparent gaps in the contours actually sit on no grid point. Hence, no gaps exist for the contours, and no filling in is needed. Similar situations occur in other figures in this article. In B, different edges have different input strength. The weak contour at the chin, the brim of the hat above the eyes, and the top of the hat (and many noisy edges) are in fact subthreshold in input strength; however, they are more or less preserved at the model output. Since this model has only a single scale, finer edges (e.g., details around the eyes) in the photos cannot be sampled or enhanced; a long gap in a contour (e.g., the long weak contour at the brim of the hat above the eyes at low-input contrast) is difficult to fill. Aliasing problems (e.g., the left cheek and hair lines) and artificial contour gaps (e.g., in the brim of the hat above the eyes because no sampling point exists at the contour location) just like those in the circle in A happen because
Neural Model of Contour Integration
917
excited. This is because the line segments will have zero output before they reach threshold and so cannot excite each other. Therefore, they behave independently as isolated edges before threshold. However, when some line segments receive subthreshold and others have superthreshold visual inputs, the former can give nonzero output under contextual excitation from the latter. This leads to subthreshold activation or filling in for the weaker or missing segments in a line (or contour; see Figure 9C). 3.5 Curvature, Contour Length, and Contour Closure. Enhancement in contours other than lines can be understood based on the special case of a contour of a constant nonzero curvature, namely, a circle. It is apparent that the analysis for a straight line is also applicable here, assuming for simplicity that the corresponding circle for the contour has a diameter larger than the longest synaptic connection between cells. Index i again denotes the (onedimensional) location of the segments along the (one-dimensional) contour, and |i − j| the (one-dimensional) distance between the segments i and j. The activities of the elements along the contour are analogously enhanced. It can be shown from Figure 6B, after some geometrical calculation, that the response levels for a segment in a contour and an isolated P segment differ by a factor (sy − s1 )/(sy − s2 ) where s1 = Jo − 1, s2 = Jo + i6= j Jij − 1, and sy P is the slope of the y˙ = 0 curve. The quantity i6= j Jij is the sum of horizontal connection strengths between one contour segment to all others; hence, it is larger for smoother contours by design, and thus the response enhancement is also larger for smooth contours.3 Furthermore, since sy is usually smaller for smaller input strength, the contour enhancement is stronger for low-input contrasts, which are the case in some physiological and psychophysical experiments (e.g., Kapadia et al., 1995; Kovacs & Julesz, 1993). Roughly, this model enhances the saliencies in a smooth contour against Figure 5: Facing page, continued. of a discrete sampling grid and a lack of a signal interpolation algorithm. In addition, two accidentally aligned but different contours (e.g., the contours for the hat, the hair line, and the cheek line near the right side of the image) can threaten to join each other because of the lack of T, L,and X junction signals that should prevent them. This model for contour enhancement also highlights region boundaries and pops out novelties (Li, 1997). Consequently, the edge segments for the hair pieces near the surrounds of the feather (even though the feather region is not very homogeneous), such as, the single feather across the brim, are more likely enhanced than those inside the feather. See the discussion in section 4. 3 It can be shown that this conclusion still holds if some contour segments exert relatively weak, nonzero suppressive components (by connection W) on other segments, especially in curves of higher curvatures. In such cases, a modification can be derived in the expression above for the amount of overall contour enhancement.
918
Zhaoping Li
Figure 6: Response changes from an isolated edge to edges in contours. All edges in a contour are assumed to receive the same visual input strength. (A) Changes in response curves. The solid curve is the response from an isolated edge, and the dashed curve is the response from edges in contours. (B) Changes in equilibrium curves and points. The two thick curves are the x˙ = 0 (solid) and y˙ = 0 (dashed) curves for an isolated edge. Only the x˙ = 0 curve is changed, going from an isolated edge (thick, solid curves) to edges in a contourP (thin, solid curve). Such changes can be seen as caused by either extra excitation j6=i Jij gx (x) from neighboring edge segments to a single edge P (left figure) or an increase in the self-excitation or curve slope Jo − 1 → Jo + j6=i Jij − 1 in a giant “edge.” The equilibrium point changes from the lower black dot to the upper one.
background by up to 200–300% near threshold and around 100% at higher input strengths. In a contour of finite length, translation invariance (symmetry between segments) along the contour breaks down near its two ends. There the enhancement is less since the end segments get comparatively less excitation from other segments. In particular, this predicts that a closed or longer contour will have higher saliency than an open or shorter one, as
Neural Model of Contour Integration
919
920
Zhaoping Li
seen in Figure 7. These predictions are expected to hold also for contours of nonconstant curvature. Note that the line in Figure 7 is actually a closed line by the periodic boundary condition used in the model’s visual space. Since the line has zero curvature, its salience is higher than that of the circle. Note that edge segments in an open and a closed contour (see Figures 7b and 7c) have roughly the same saliency except near the ends of the open contour, where saliency decreases. This closure effect is weak compared to what is implied from psychophysical observation (Kovacs & Julesz, 1993; see section 4). Exactly how the saliency decays toward the ends depends on how Jiθ,jθ 0 decays with intersegment distances and the longest connection length. Note that the model used a discrete hexagonal grid in the visual space; the apparent small gaps in the circle and curves are the artifact of the coarseness of the grid. 3.6 Neural Oscillations and Synchrony Between Contour Segments. ¯ which are also ¯ y), So far, we have analyzed only the equilibrium points (x, roughly the average responses. Upon visual stimulation, the neurons may either approach the equilibrium after a transient phase or sustain dynamic activities about the equilibrium, such as oscillations. Here we show how variations of neural activities about their averages reflect characteristics of the contour. Each edge segment, a pair of connected excitatory and inhibitory cells, can be modeled as a neural oscillator (Li & Hopfield, 1989) oscillating around ¯ With interactions between the segments, the os¯ y). the equilibrium point (x, cillators are coupled and exhibit collective behavior reflecting contour char-
Figure 7: Facing page. Model performance for input contours amid noise. Each row is dedicated to one input condition. The left column shows the visual input, the middle column shows the average neural responses, and the right column shows the outputs of the segments as a function of time. The visual inputs are presented at time zero and stay on. The neural signals in the right column are shown superposed. In the graph, the solid curves plot outputs of the segments along the contour, away from its ends; the dash-dotted curves plot the outputs of segments near the contour’s ends; and the dashed curves, usually the lowest-lying ones, are the output for a single noise segment. Note the decrease in average neural activities, the amplitudes of the oscillations, and the synchrony between the activities of the segments, as the contours become open, shorter, or more curled, or for segments near the contour ends. The line in the top row is actually a closed line by the periodic boundary conditions, and so by having zero curvature, its salience is higher than that of the circle. Also, the discrete hexagonal grid visual space gives apparent small gaps in the circle and curves because these “gaps” fall on no grid points; no filling in is needed. Time is in units of the cell membrane time constant, and so a time constant of 5–10 msec would lead to oscillation frequencies in the figures around 10–40 Hz.
Neural Model of Contour Integration
921
acteristics embedded in the coupling. To analyze the dynamics, denote for simplicity xi − x¯ i → xi , yi − y¯ i → yi , and write as vectors X = (x1 , x2 , . . . , )T and Y = (y1 , y2 , . . .)T . For small X and Y, we approximate by a linear expansion of equations 3.10 and 3.11 about the equilibrium point (x¯ i , y¯ i ): X˙ = −X − G0 y Y + JX Y˙ = −Y + G0 x X,
(3.15) (3.16)
where4 J is a matrix with elements (J)ij = Jo g0x (x¯j ), if i = j, and (J)ij = Jij g0x (x¯j ) otherwise; G0 x and G0 y are diagonal matrices with elements (G0 x )ii = g0x (x¯ i ) and (G0 y )ii = g0y (y¯ i ). For a contour with a constant curvature (i.e., a circle), all its segments have the same equilibrium point if they receive the same input ¯ Then G0 x and G0 y are proportional strength, x¯ i = x¯j ≡ x¯ and y¯ i = y¯j ≡ y. to the identity matrix, and J is symmetric (since we imposed symmetry along contour directions, i.e., Jij = Jji ). Then J has an orthogonal set of eigenvectors {Xk } and real eigenvalues λk for k = 1, 2, . . . which we order 1 ≥ λ2 ≥ . . . Take {Xk } as the new basis to represent X and Y; we such that λP P have X = k xk Xk , Y = k yk Xk , and ¯ k + λk xk x˙ k = −xk − g0y (y)y y˙ = −y + k
k
(3.17)
¯ k, g0x (x)x
(3.18)
which has solution xk (t) = xk (0)e−(1−λ /2)t cos(ωk t + φ k ), k
where the oscillation frequency is ωk =
q
(3.19) ¯ 0x (x) ¯ − (λk )2 /4, and initial g0y (y)g
conditions determine the amplitude xk (0) and the phase φ k . The exponential in equation 3.19 suggests that the system will be dominated by the first oscillation mode5 X1 since λ1 ≥ λk for all k. The relative oscillation amplitudes and phases of the segments in a contour are determined by the components of the complex vector X1 . Let us suppose for simplicity that the edge segments concerned are in ¯ = 1. For a contour of constant the linear operating region where g0x (x) curvature with uniform inputs to its segments, matrix J is Toplitz (i.e., Jij = J(i+a) mod N,(j+a) mod N for all a, where N is the contour length or matrix dimension) under translation and rotation invariance of the model and 4 Here we take for simplicity that contour segments do not link to each other with connection Wiθ,jθ 0 . The analysis here needs a little modification, but the general conclusion still holds when W connections are included. 5 From the analysis in the next paragraph, it will be apparent that it is unlikely to have degeneracy in the first two modes (the case when λ1 = λ2 ).
922
Zhaoping Li
has nonnegative elements. It can then be shown that the eigenvectors are the cosine and sine waves along the contour, that is, Xjk ∝ eifk j with spatial frequency fk and the eigenvalues are the corresponding Fourier coefficients of the row vector of the J matrix. In particular, the eigenvector X1 is the zerofrequency Fourier wave; hence, all components of X1 are equal, x1i = xj1 , and thus all segments in a contour oscillate with the same amplitude and phase. The eigenvalueP for this mode is the zero-frequency Fourier coefficient, and thus λ1 = Jo + all j 6= i on contour Jij . We can therefore relate λ1 to the characteristics of the contour, as reflected by the connections Jij along its length. It follows that the strength of the oscillation is largest in a long line, decays with increasing contour curvature (or decreasing contour length for circles), and is weakest for an isolated edge. The isolated edge is a special case with a single oscillator (X1 is a scalar) when λ1 = Jo . When λ1 < 2, the oscillation is dampened and disappears after some transient phase. Otherwise, if segment couplings are sufficiently strong, the oscillations6 will grow until the nonlinearity invalidates the linear analysis and constrains the oscillation to have a finite amplitude (see the activities for the line segments in Figure 7). These predictions are expected to hold approximately for general contours with nonconstant curvatures or nonuniform inputs. However, because the translation invariance is compromised in these cases (i.e., J deviates from being Toplitz and symmetric), some differences in oscillation amplitudes and the relative phases are expected. Similarly, nonzero relative phases may emerge between edge segments near the end of a contour, or between such segments and those near the middle of the contour. q ¯ 0x (x) ¯ − (λ1 )2 /4 The role of λ1 in the oscillation frequency ω1 = g0y (y)g suggests a correlation between stronger contours and lower oscillation frequencies, as is the case when comparing the circle with the line in Figure 7. However, in strong, sustained oscillations, the small-amplitude linear approximation no longer holds. The nonlinearity greatly influences the frequency and makes this prediction imprecise. With strong visual input and contour enhancement, oscillation can be completely suppressed by the nonlinearity near the saturation region. In realistic neural systems, however, saturation can be prevented by the gain control or adaptation for large inputs or activity levels. Synchronization within a contour will happen even when the visual inputs are turned on at different times for different contour segments. On the other hand, synchrony is rare between different contours even when their visual inputs are turned on simultaneously. These are both demonstrated For large enough λ1 , the oscillation frequency from the linear analysis ω1 = is imaginary, so the local dynamics about the equilibrium point is in fact not oscillatory. However, the global nonlinear dynamics can be shown to be oscillatory provided that the system is not in the region of unstable self-amplification, that ¯ 0x (x) ¯ > λ1 − 1. is, g0y (y)g
p
6
¯ 0x (x) ¯ − (λ1 )2 /4 g0y (y)g
Neural Model of Contour Integration
923
in Figure 8; segments within a line quickly reach synchrony after asynchronous stimulus onset, but circle segments and line segments become out of synchrony within two oscillation cycles after synchronous stimulus onset. In fact, when two contours are nearby, they tend to desynchronize largely due to the normalizing neural interactions (see the appendix) and the mutual suppressive couplings Wiθ,jθ 0 between them. The dynamic coupling between two contours also causes frequency shifts. Furthermore, the nature and synchrony of the oscillations of the weaker contour tend to be distorted by the stronger contour such that some of the contour segments oscillate in a nonsinusoidal manner (see Figure 8B). We do not discuss the details of such dynamic coupling further because they are not used for contour integration in this model. However, synchronization within a contour and desynchronization between contours can be exploited for the purpose of contour segmentation (see section 4). 3.7 Control of the Contour Saliency by Top-Down Feedback: Selective Contour Enhancement/Suppression, Filling In, and Contour Segmentation. This section shows that our model of V1 provides a mechanism by which higher visual areas can selectively influence the response to given contours. We just assume that the higher centers already know which segments belong to a contour and what feedback signals to send back. In the model, the influence from higher centers is additional to, but not necessary for, the contour-enhancing capabilities by the V1 neural circuit. Higher visual areas are modeled as sending Ic as a feedback signal to the inhibitory cells, to influence the edge outputs according to equation 3.8: ¯ δgx (x)/δI c =−
¯ 0x (x) ¯ g0y (y)g ¯ 0x (x) ¯ − Jo g0x (x) ¯ 1 + g0y (y)g
.
(3.20)
Let Ic = Ic,background +Ic,control . The background input Ic,background is the same for all edge segments, and can be used to modulate the overall level of visual alertness. Higher areas can selectively enhance or suppress a given contour by providing a negative or positive Ic,control (i.e., decreasing or increasing Ic ) for the selected contour segments (iθ ). A strong enough Ic,control > 0 on a given contour can completely suppress the outputs from that contour, leading to the effective removal or segmentation of this contour from other contours in the visual input. It is not yet clear from experimental data which are the cells in V1 that should be the target of feedback (Salin & Bullier, 1995; also see section 4). This model chooses the inhibitory interneurons as the targets for the reason that it is computationally desirable not to mix visual inputs from the external world with the internal feedback signals. Given that the excitatory cells are the visual input neurons in this model, directing the feedbacks Ic,control to the excitatory cells would lead to a confusion between the internal and the external. Since it targets the interneurons in this model, the feedback
924
Zhaoping Li
Figure 8: (A) Synchrony within a contour even with asynchronous visual onset. Neural outputs for line segments as in the top row of Figure 7, except that the visual input for the 17 left-most line segments starts at t = 0 (left arrow) whereas input for the 18 right-most line segments starts at t = 7 (right arrow). The solid curve and dashed curve correspond to one of the 17 left-most and one of the 18 right-most segments, respectively; all segments in each group have roughly the same outputs. Note that synchrony is achieved within one oscillation cycle after the onset of the later segments. Compare with the top row of Figure 7. (B) Desynchronization between contours. Neural output based on visual input as in Figure 9A, where a circle and line overlap and their inputs onset simultaneously at t = 0. The edge outputs are plotted superposed, with solid curves for the line segments and dashed curves for the circle segments. Note that the line segments oscillate with smaller frequencies and are more synchronized (the solid curves for different segments are almost identical). The circle segments differ somewhat in their oscillation amplitudes and phases, especially after two oscillation periods, much more so than they would if the line were absent (cf. Figure 7). This is because the strong activities in the line segments interfere with the activities in the circle segments close to the line via local neural interactions. The coupling between the two contours also distorts the oscillation frequencies.
Neural Model of Contour Integration
925
cannot completely substitute for visual stimulation. As is evident, Ic is ef¯ 0x (x) ¯ 6= 0. Without visual input or excitation from fective only when g0y (y)g other segments, an edge segment’s membrane potential x¯ is below threshold ¯ = 0. Decreasing or removing the feedback Ic merely reduces or and g0x (x) removes the inhibition onto this segment and is not enough to activate the excitatory cell beyond threshold when the background (nonvisual) input Io to the excitatory cell is too weak. However, with some, even subthreshold, visual input Iiθ or contextual excitation from neighboring edges, an edge segment can increase its activity or become active if Ic is reduced or removed. Therefore, this model can enhance and complete or fill in a weak and incomplete input contour under the control of the feedback (see Figure 9C), but cannot enhance or “hallucinate” any contour that does not at least partially exist in the visual input I (see Figure 9B). See section 4 for more detailed discussion on the targets of the feedback, the related experimental findings, and computational considerations. Figure 9 demonstrates central control. Without central control, Ic,control = 0, a visual input consisting of two contours, a circle and a line, and some noise segments results in a salient line and circle, and less salient noise segments (see Figure 9A). By adding Ic,control > 0 for the line and Ic,control < 0 for the circle, the saliency of the line is suppressed, and the circle becomes most salient. If feedback control Ic,control > 0 for the line is strong enough, then its neural activity can be completely eliminated, effectively segmenting it away from the circle (see Figure 9A). Without central control, the gaps in an input line are partially filled in by the excitation from other line segments (see Figure 9C); with reduced Ic on the line, the initially fragmented line becomes almost completely filled in (see Figure 9C). 4 Summary and Discussion 4.1 Summary of the Motivation, Components, and Performance of the Model. Although experimental data suggest that contour enhancement may be first attempted in V1 (Kapadia et al., 1995), it has been difficult to model it by a realistic model of V1, largely because of the inherent dynamic stability problems in recurrent neural networks. The V1 network has to be sensitive enough to amplify all possible smooth contours it may receive, and at the same time should not have uncontrollable sensitivities that would make the contours and noises be amplified indiscriminately. Such difficulties have until now prevented a realistic V1 model of contour enhancement. Therefore, it is desirable to find out whether contour enhancement can indeed be accomplished by V1 and thus be modeled by a realistic V1 circuit, or whether all the observed contour sensitivity enhancements in experiments (Kapadia et al., 1995) have to be attributed to higher visual areas and their feedbacks to V1. It is one of the goals of this study to demonstrate and understand how contour enhancement can indeed be reasonably carried out in a model using only the known elements of V1. Accordingly, our model uses
926
Zhaoping Li
Figure 9: Central feedback control. (A) Without central control, both the line and the circle in the input (top left) are suitably enhanced against the background (top right). The bottom left graph demonstrates the selective enhancement of the circle by Ic,control = −Ic,background /5 and suppression of the line by Ic,control = Ic,background /10 on the respective edge segments. Stronger line suppression by Ic,control = Ic,background /3 can effectively remove all responses to the line (bottom right). (B) Lack of hallucination: the nonexisting line in the input cannot be enhanced by central control Ic,control = −Ic,background /5 to enhance both the nonexisting line and the existing circle. (C) Gaps in the line of the input (left) are partially filled in by the contour enhancement mechanism in V1 (center) and more completely filled in by the central control to enhance the contours with Ic,control = −Ic,background /3 (right).
Neural Model of Contour Integration
927
orientation selective cells, a local neural circuit with recurrent interactions between excitatory and inhibitory cells, and the particular connection patterns suggested by experimental evidence (Gilbert, 1992; Hirsh & Gilbert, 1991; Weliky et al., 1995; White, 1989; Douglas & Martin, 1990; Kapadia et al., 1995). The neural interactions in the model enhance the cell activities for edge segments belonging to smooth contours against a background of random edge segments and induce synchronized oscillatory neural activities between segments within a contour. We show analytically and empirically that the extent of contour enhancement and neural synchrony is stronger for longer, smoother, and closed contours. These behaviors of the model are consistent with experimental observations (Kapadia et al., 1995; Field et al., 1993; Kovacs & Julesz, 1993; Gray & Singer, 1989; Eckhorn et al., 1988). In addition, this model introduces a mechanism that allows higher visual areas to feed back and selectively enhance or suppress activities for given contours, and even to achieve a crude form of contour segmentation. This model makes the following testable predictions, which have not been systematically investigated experimentally: (1) the horizontal cortical connections from the excitatory cells should more likely contact excitatory or inhibitory postsynaptic cells if the two linked cells have their preferred orientations roughly parallel or orthogonal, respectively, to their relative RF displacement; (2) the strength of neural oscillation, as well as neural synchrony, should increase with contour length, smoothness, and closure. For analytical tractability and simplicity, the model adopts the following idealizations: a 1:1 ratio between the excitatory and inhibitory cell numbers, the lack of connections between the inhibitory cells, and the lack of direct visual input to the interneurons. Without essential changes to the model performance, these idealizations can be relaxed to give additional complexities in model behavior. For instance, each model cell should be seen as modeling a local group of cells of similar types. Hence, the 1:1 ratio between the excitatory and inhibitory model cell numbers is really a ratio between local cell groups, and the recurrent local connections between them model the recurrent connections in the local cell groups. Also, introducing direct visual inputs to the inhibitory cells can give additional input gain control to allow a larger dynamic range for the system. The recurrent local interactions between excitatory and inhibitory cells used in this model have long been part of a “basic circuit” for the cerebral cortical organization (Shepherd, 1990). They have been used, for instance, in a model of the olfactory bulb for odor recognition and segmentation (Li & Hopfield 1989; Li, 1990). A closely related version of this circuit is also used in a model of visual cortical RFs and surround influences (Somers, Todorov, Siapas, & Sur, 1995). Our model requires a neural connection structure with both a colinear excitatory component Jiθ,jθ 0 and flanking or orthogonal and disynaptically inhibitory component Wiθ,jθ 0 . A similar connection structure, where the inhibitory component is modeled monosynaptically, is proposed by Zucker
928
Zhaoping Li
et al. (1989). A connection structure like our colinear excitatory connection component Jiθ,jθ 0 is termed “association field” by Field et al. (1993) and is generic for many contour enhancement models (e.g., Zucker et al., 1989; Braun et al., 1994; Yen & Finkel, 1997). Experimentally, however, cortical cells have horizontal axonal fields that extend orthogonally as well as parallel (colinearly) to the preferred orientation of the cells (Gilbert & Wiesel, 1983). The “association field” like connections can account for only those axons that extend in a roughly parallel direction. Our model suggests that the orthogonally extending axons (see Figure 4) should be found to contact preferentially inhibitory postsynaptic cells (the connection Wiθ,jθ 0 ), for the computational purpose of mediating competition between alternative routes of a contour. The different functions served by the different branch directions of the horizontal axons lead naturally to the anisotropic horizontal axonal fields as observed in V1 (Gilbert & Wiesel, 1983; Fitzpatrick, 1996). The quantitative degree of anisotropy in the axonal field is not crucial in this model. The predicted correlation between axon directions and postsynaptic cell types has yet to be experimentally tested. 4.2 Higher Center Feedback. The feedback control mechanism in the model may also relate to the attentional effects observed in V1 cells (Motters, 1993). It has the desirable property that while higher areas can enhance input contours, and even fill in the gaps in an incomplete contour, they cannot create a contour that does not exist in visual input. This property could be exploited by the higher visual centers to test hypotheses about the visual input and cooperate with V1 to reconstruct a coherent percept.7 A complete model of top-down, bottom-up cooperation should include a mechanism by which higher areas can respond to V1 outputs and construct the requisite top-down control signals. This mechanism is left for future work. Experimental data have suggested both excitatory (Mignard & Malpeli, 1991; Nault, Michaud, Morin, Casanova, & Molotchnikoff, 1990) and inhibitory (Alonso, Cudeiro, Rerez, Gonzalez, & Acuna, 1993; Fitzpatrick, personal communication, 1996) effects of feedback signals. Most data on feedback fibers show that feedback terminals synapse onto dendritic spines (Rockland, personal communication, 1996; Johnson & Burkhalter, 1992), which are usually associated with excitatory cells, though earlier observations from Johnson and Burkhalter (1991) suggest that feedback fibers terminate near inhibitory interneurons. Although the evidence is not consistent or clear-cut (Salin & Bullier, 1995), some points in the opposite direction from our model construction that higher-center feedbacks are directed to the inhibitory interneurons. Computationally, I believe that it is important not
7 In this model, the “hallucination” is prevented by setting the background input I o sufficiently low such that excitation by Io alone without visual input is impossible even with feedback enhancement. Hallucination could occur if Io is large enough.
Neural Model of Contour Integration
929
to mix bottom-up visual input signals with top-down feedback ones. Hence the feedback fibers should avoid the input neurons, which in this model are the excitatory cells. Indeed, in the brain, feedback fibers generally avoid cortical layer 4, which is the input layer (Salin & Bullier, 1995). Experimental evidence also suggests that top-down feedback to V1 can modulate V1 activities, but cannot substitute for visual input to activate V1 (Salin & Bullier, 1995). If hallucinations should be avoided in the visual system, then, as is shown in section 3, it helps to send the feedback via the inhibitory interneurons, as is done in this model. More realistically, V1 has different layers and groups of excitatory cells. Different excitatory cell groups are likely to serve different functions, and “hallucinations” may be avoided when the feedback fibers target other excitatory cells and avoid the excitatory cells in the input layer. It is also likely that top-down feedback first contacts the noninput excitatory cells, which then transform the signals to the inhibitory interneurons. Such signal transformations may be needed in order to render the visual representations in the higher areas in terms of the representation in V1. It is also likely that the feedback effects are dynamically modulated and can be excitatory or inhibitory, depending on the levels of neural activity and contextual conditions that are beyond the current model. More consistent and informative experimental data are desired to guide our further understanding. It is more clearly established in the lateral geniculate nucleus (Salin & Bullier, 1995) and the olfactory bulb (Shepherd, 1990) that higher-area feedback terminates mostly onto the inhibitory interneurons. It will be interesting to explore the extent of the universality of computational mechanisms across stages and sensory modalities. An analogous model of higher center control and olfactory segmentation has been proposed for the olfactory bulb (Li, 1990). 4.3 Limitations and Extensions of the Model. As a first attempt to model contour integration using V1 components, this model has many weaknesses. The degree of closure effect in particular is much weaker than what is implied by psychophysical observations (Kovacs & Julesz, 1993). It is only in the edge segments near the ends of the open contour that saliencies decrease by up to 20% to 30% (see Figure 7). Saliencies of the segments far from the ends are comparable to those in a closed contour. It is not clear whether a stronger closure effect can be achieved by more sophisticated neurons and their interactions, or by introducing mechanisms beyond V1 and the scope of this model. It is conceivable that closure effect can be stronger by introducing the notion of object surfaces and adding surface saliency to a region bounded by a closed curve. Such means are likely to involve higher visual centers. It is desirable to find out experimentally whether the closure effect, so far investigated only psychophysically, is significant in V1 without the involvement of the higher center feedback. Another noticeable fact is that if saliencies are proportional to the temporally averaged neural activities, the quantitative changes in saliency with contour length, curva-
930
Zhaoping Li
ture, and closure for reasonably smooth contours are not very significant. This could be a weakness or strength of this model, depending on whether one desires sensitivity or robustness of the model performance. However, if one uses the maximum neural activities over a time window as a measure of saliency, then saliencies change more significantly with contour characteristics. This model addresses only the local saliencies of individual edge segments. One may conceivably use synchrony between segments to obtain the global saliency of a whole contour. That should make the contour saliency more sensitive to the contour characteristics. However, this is beyond the scope of this article. This model uses an idealized image sampling grid in a single scale, making it difficult to handle cases when contours and their locations are defined in multiple scales, as is the case in many natural images. This difficulty is noticeable and explained in Figure 5 for the photo input. Substantial work, at least in the scale of simulations, will be required to make the model multiscale. Figure 5B also makes it clear that the problems of image sampling and signal interpolation and interpretation, though beyond the contour enhancement model and probably outside V1, should be solved to give better inputs to the model and to interpret the outputs better. In addition to orientation and spatial location, RFs in V1 are tuned for motion direction, motion speed, disparity, ocularity, scale, and color (Hubel & Wiesel, 1962; Livingstone & Hubel, 1984). Object contours can thus exist in all these dimensions. The current model can be extended to stereo, time, and color dimensions. The extended model will link edge segments with compatible selectivities in these dimensions, as well as aligned orientations. Indeed, experiments reveal that horizontal connections tend to link cells with similar RF properties in dimensions other than orientation preference (e.g., Gilbert, 1992), and that activities of cells with similar ocular dominance, color selectivity, and other complex RF properties tend to be correlated (Ts’o et al., 1986; Ts’o & Gilbert, 1988; Singer & Gray, 1995). Such an extension has not been carried out yet. The interactions designed to instantiate contour integration can be explored to see whether they also account for other visual contextual phenomena such as figure-ground and texture segmentation, the tilt illusion, and noncontour perceptual filling ins (e.g., Gilbert, 1992; Field et al., 1993). Given its significant simplification, I expect that the model will not explain many visual phenomena beyond contour integration. For example, it does not capture the highly flexible gain control and receptive field adaptation observed in the primary visual cortex (Gilbert, 1992). However, some model limitations can be interpreted in other ways. For example, experiments on contextual influence on the orientation selectivity of cat V1 cells (Gilbert & Wiesel, 1990) have found some data consistent but other data inconsistent with the classical perceptual tilt illusion phenomena. This model also displays similar ambiguous results, depending on how the contextual stimuli are placed in the visual fields.
Neural Model of Contour Integration
931
Recent experimental evidence suggests that V1 cells contribute to figureground distinctions (Zipser, Lamme, & Schiller, 1996; Lamme, 1995), a fundamental problem that has been addressed by many models (e.g., Kienker, Sejnowski, Hinton, & Schumacher, 1986). Although our model was originally aimed at grouping edges to contours, my more recent studies find that the model can also signal figure-ground differences and contribute to region grouping and the phenomena of pop-out. Since boundaries and regions are complementary to each other, it is reasonable to expect or require a model of boundary enhancement to signal region differences. This potential of the model is yet to be fully explored; some early results can be found in Li (1997). 4.4 Relating to Previous Models. Many other models are related to at least parts of our contour integration model. We relate our model to the most representative and relevant ones, and acknowledge that many more models exist and can be found—for instance, in the references cited in this article. One class of models addresses the underlying computation without using biologically plausible model elements, interactions, or algorithms. Shashua and Ullman (1988) modeled the perceptual saliencies of contour elements using a simple iterative network with local interactions between image elements. The model performs well, and Figure 5A of this article is patterned after an example of theirs. Guy and Medioni’s model (1993) lets local image features vote on underlying global contours. The contours are then extracted by combining the votes with several methods of combination. Motivated by the existence in V2 of cells responsive to illusory contours (von der Heydt et al., 1984), Heitger and von der Heydt (1993) suggested a model that infers illusory occluding contours from T-junctions, corners, and line ends extracted from the image. This inference requires a highly nonlinear operation to check the consistency of the occluding contour interpretations based on “end-stopped” signals. A recent model by Williams and Jacobs (1996) generates in the image plane a probability distribution over all the contours that can join two separated edge fragments, modeling the linking process as a random walk in position and orientation in the image plane. This model can be implemented in a network with local, albeit nonneural, interactions. However, the network has to know which edges to join before it can complete a contour between them. Another class of models is closer to neurobiology. For example, the model by Grossberg and Mingolla (1985) aims for boundary or contour completion. The algorithm requires dipole fields in V1 that prevent two perpendicular contours crossing each other and bipolar cells in V2 that provide cooperative feedback. The neural basis for the dipole fields is not clear, and contour enhancement necessarily requires feedback from V2. The model by Zucker et al. (1989) used a relaxation labeling algorithm (Hummel & Zucker, 1983) to infer contour segments from an initial image measurement resembling that in V1. The edge segments are labeled by their local orientations and curva-
932
Zhaoping Li
tures. Two segments excite or inhibit each other depending on whether they are compatible with each other, in a way similar to our connection structure J and W. The orientation and curvature labels are updated iteratively by the compatibility interaction. The algorithm performs well after two iterations and should converge to some final configuration (Hummel & Zucker, 1983). Braun et al. (1994) suggested a model for contour integration that maps well to the human contour perception in the same visual displays. This model requires fast-adapting neural synapses and dendritic gating, which have yet to be confirmed by experiments. The model by Yen and Finkel (1996) uses the association field like horizontal connections in their network, and it works well. However, the model algorithm includes nonneural operations such as a global normalization of unit responses after each network iteration and a rule-based algorithm for interunit synchronization. Compared with these models, we stress the restriction to the V1 elements and operations. Only by doing so can we ascertain whether contour enhancement can really be first attempted in V1 or has to be attributed to top-down feedback. Such consideration also accounts for our decision not to employ in this model image signals from T-junctions and corners, for there is no evidence for such units in V1. Consequently, this model enhances existing weak contours rather than inferring (invisible) foreground occluding (illusory) contours from visible image signals in the background. Furthermore, we study the model analytically in addition to the empirical simulation study, so as to grasp the relationship of neural interactions, contour characteristics, and the corresponding model behavior and to overcome the formidable dynamic stability problem for reliable contour enhancement in a recurrent neural network model of V1. This model also avoids some of the undesirable features associated with some other models. For instance, this model does not give unreasonably high saliencies to short segments of contours that are attached smoothly to long and smooth contours. This is a problem for Shashua and Ullman’s model (1988), for example, partly because an edge is defined in their model as directed; that is, an edge of orientation θ is different from another edge of orientation θ + π at the same location. Such directed edges also appear in other nonbiological models such as Williams and Jacobs (1996) chiefly to suit the particular contour enhancement algorithms. Our model is also unique in addressing the V1 response to top-down feedback, not as a requirement for contour integration but as an additional feature that is simple but computationally powerful. 4.5 Neural Oscillations. Oscillations do not seem to be essential to solve the computational problem of contour enhancement and could be just an epiphenomenon. They are intrinsic properties of recurrently interacting excitatory and inhibitory cell populations used in our model. On the other hand, since the strength of the oscillations and degree of synchrony do correlate with characteristics of the contours, the extra information carried in
Neural Model of Contour Integration
933
the oscillation could be exploited for other computations such as feature linking (von der Malsburg, 1981). This model predicts weaker and more transient oscillations for shorter or weaker contours or for isolated edges. This may explain the failure to observe oscillatory neural behaviors in some experiments (Singer & Gray, 1995). There are many other models of cortical neural oscillations and their possible roles in feature linking and segmentation (e.g., Baldi & Meir, 1990; Sporns, Tononi, & Edelman, 1991; Konig ¨ & Schillen, 1991; von der Malsburg & Buhmann, 1992; Sompolinsky, Golumb, & Kleinfeld, 1991; Murata & Shimizu, 1993; Wang, 1995). Some model neural oscillators by oscillation phase variables only (e.g., Sompolinsky et al., 1991; Baldi & Meir, 1990), making the actual neural activity levels unavailable to study feature enhancement. Our oscillator, with its interconnected excitatory and inhibitory cells, is suggested by models of the olfactory bulb (Freeman, 1987; Li & Hopfield, 1989). Such neural oscillator models are also employed by many visual cortical models where both the phases and amplitudes of the oscillations can be studied (e.g., Sporns et al., 1991; Konig ¨ & Schillen, 1991; von der Malsburg & Buhmann, 1992; Murata & Shimizu, 1993; Wang, 1995). However, these visual cortical models do not address the contour integration problem. In summary, we have introduced a biological plausible model of contour integration in V1. The model exhibits experimentally observed behaviors and makes testable predictions about V1 anatomy and physiology. Extensions of the model to other visual input dimensions can be explored. This model also provides an analytical framework to study the neural dynamics and other visual computations in V1 (e.g., Li, 1997). Some of the properties and mechanisms of the model can be exploited to study other computational problems such as hypothesis testing through top-down control, feature linking, and figure-ground segmentation. Appendix This appendix gives the detailed model parameters and a derivation of the model connection structure. A.1 Model parameters. The number of orientations or edge elements at each spatial grid point is K = 12. Equations 3.1 and 3.2 determine the model dynamics. The gain functions for the neurons are: if x < Tx 0 (A.1) gx (x) = (x − Tx ) if Tx ≤ x ≤ Tx + 1 1 if x > Tx + 1 if y < 0 0 if 0 ≤ y ≤ Ly (A.2) gy (y) = g1 y g1 Ly + g2 (y − Ly ) if 0 < Ly ≤ y,
934
Zhaoping Li
where Tx = 1, Ly = 1.2, g1 = 0.21, and g2 = 2.5. Except for the case of Figure 5B, where variable input strengths are used for different edges, the edge input strength for all visible edge segments in all other cases is the same: Iˆiθ = 1.02, and Iˆiθ = 0 otherwise. The weighting function ψ(θ ) for the inhibitory cell to the local excitatory cells at the same grid points is: 1 0.8 ψ(θ) = 0.7 0
when θ = 0 when |θ| = π/K = 15o when |θ| = 2π/K = 30o otherwise.
(A.3)
The background inputs to the inhibitory cells is Ic,background = 1.0. The central feedback control Ic,control is applied to the inhibitory cells of the edge segments in an analogous way as the visual inputs are applied to the excitatory cells. If the higher centers intend a control to grid point i with orientation β, then Ic,control on the edge segment (iθ ) is Ic,control (iθ) = Iˆc,control (iβ)ψ(θ − β),
(A.4)
where Iˆc,control (iβ) serves a function analogous to Iˆiβ in the visual input. The background inputs Io to the excitatory cells includes the following: Io = Ie,background + Inormalization ,
(A.5)
where Ie,background = 0.85, and Inormalization is a normalization current that depends on the local edge activities, so its value for the edge element (iθ ) is "P Inormalization (iθ) = −2.0
P
j∈S
P
θ0
gx (xjθ 0 )
j∈Si
1
#2 ,
(A.6)
where Si is a neighborhood of all grid point j that are no more than two grid distance away from i. This normalization Inormalization is after the model by Heeger (1992) for cortical cells to account for nonorientation-specific local cortical activity normalization and nonlinearity. It can be implemented by other inhibitory interneurons of relatively short time constants that receive inputs from local excitatory cell pools and feed back to them. In addition, each neuron receives an input Inoise , which is a random noise with an average temporal width of 0.1 and an average height of 0.1. Noise inputs to different neurons are independent. The self-excitatory connection is Jo = 0.8. The long-range synaptic connections Jiθ,jθ 0 and Wiθ,jθ 0 are determined as follows. Let the two edge elements (iθ) and jθ 0 be separated by a grid distance d, and denote the angles between the edge elements and the line connecting two edge elements by θ1
Neural Model of Contour Integration
935
and θ2 , where |θ1 | ≤ |θ2 | ≤ π/2, and θ1,2 are positive or negative depending on whether the edges rotate clockwise or counterclockwise toward the connecting line in no more than a π/2 angle. Denote β = 2|θ1 | + 2 sin(|θ1 + θ2 |), 1θ = θ − θ 0 with |θ − θ 0 | ≤ π/2, then
Jiθ,jθ 0 =
Wiθ,jθ 0 =
2 7 2 0.126e−(β/d) −2(β/d) −d /90 0 0
if 0 < d <= 10.0 and β < π/2.69 or 0 < d <= 10.0 and β < π/1.1 and |θ1 | < π/5.9 and |θ2 | < π/5.9 otherwise.
1.5 1.5 0.14(1 − e−0.4(β/d) )e−(1θ/(π/4))
π if d = 0 or d ≥ 10 or β < 1.1 π or |1θ| ≥ 3 or |θ1 | < π/11.999 otherwise.
A.2 Derivation of the Connection Structure. We derive the qualitative structure of the neural connections, using the following computational requirements: (1) two nearby edge segments should enhance each other’s activities if one could draw a smooth contour passing both of them; (2) a small enough gap in a smooth contour of sufficient input length and strength should be filled by the network; (3) a smooth contour of finite length and width in input should not grow in length or width by the network enhancement. According to equations 3.7 and 3.8, an edge segment (jθ 0 ) can enhance or suppress edge segment (iθ ) by sending monosynaptic excitatory input via connection Jiθ,jθ 0 or disynaptic inhibitory input via connection Wiθ,jθ 0 . Hence, condition 1 above requires Jiθ,jθ 0 6= 0 if (iθ ) and (jθ 0 ) are nearby and roughly coaligned, as illustrated in Figure 4. The bounds on the overall scale of the J connection can be obtained by considering a contour of a straight line on the x-axis, as in equation 3.9. Omit the θ variables, let the ith segment be at x = 0, origin of the x-axis, and the jth the otherP contour segments. IfP only segment i is missing in input, condition 2 requires j Jij gx (xj ) > Tx , or j Jij > Tx (for gx (xj ) ≤ 1). If segment i and all segments j on the positive x-axis are missing in input, condition 3 requires P J < Tx if the strong contour should be prevented to grow into the j<0 ij positive x-axis. These requirements, together with the contour reflection symmetry Jij = Jji , constrain the overall scale of J within a factor of 2. However, a segment near and roughly parallel to the x-axis is excited by the coaligned contour segments on the x-axis and is likely to be filled by condition 2 since it provides an alternative route for the contour. To prevent such a straight contour from thickening, condition 3 requires disynaptic inhibition connections Wiθ,jθ 0 connecting this segment from the nonaligned horizontal contour segments on the x-axis. A W connection structure as in Figure 4 results; a lower bound on the overall scale of the W connections can then be obtained. Condition 1 discourages W connections between coaligned segments,
936
Zhaoping Li
and condition 3 discourages J connections between roughly parallel segments displaced roughly orthogonally to their orientations. Our model connection is then derived from all these requirements, in addition to the space invariance (symmetry) requirements for the computation and the local and smoothness requirements expected as reasonable for cortical connections (see section 2). Acknowledgments I thank Jochen Braun for introducing me to the topic and helping me to appreciate the difficulties of dynamic stability in models of cortical circuits. I also thank Peter Dayan for many helpful conversations, discussions, and comments on the manuscript; and Yair Weiss, Shimon Ullman, Mriganka Sur, Geoffrey Hinton, and William Freeman for discussions and help on references. This work was supported by the Hong Kong Research Grant Council and the Center for Biological and Computational Learning at MIT. References Allman, J., Miezin, F., & McGuinness, E. (1985). Stimulus specific responses from beyond the classical receptive field: Neurophysiological mechanisms for local-global comparisons in visual neurons. Ann. Rev. Neurosci., 8, 407– 430. Alonso, J. M., Cudeiro, J., Perez, R., Gonzalez, F., & Acuna, C. (1993). Influence of layer 5 of layer 6 of area 19 of the cat visual cortex on responses of cells in layer 5 of area 17 to stimuli of high velocity. Exp. Brain Res., 93, 363–366. Baldi, P., & Meir, R. (1990). Computing with arrays of coupled oscillators: An application to preattentive texture discrimination. Neural Comp., 2, 458–471. Braun, J., Niebur, E., Schuster, H. G., & Koch, C. (1994). Perceptual contour completion: A model based on local, anisotropic, fast-adapting interactions between oriented filters. Society for Neuroscience Abstracts, 20, 1665. Douglas, R. J., & Martin, K. A. (1990). Neocortex. In G. M. Shepherd (Ed.), Synaptic organization of the brain (3rd Ed.). New York: Oxford University Press. Eckhorn, R. (1994). Oscillatory and non-oscillatory synchronizations in the visual cortex and their possible roles in associations of visual features. Progress in Brain Research, 102, 405–426. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Multiple electrode and correlation analysis in the cat. Biol. Cybern., 60, 121–130. Field, D. J., Hayes, A., & Hess, R. F. (1993). Contour integration by the human visual system: Evidence for a local “association field.” Vision Res., 33(2), 173– 193. Fitzpatrick, D. (1996). The functional organization of local circuits in visual cortex: Insights from the study of tree shrew striate cortex. Cerebral Cortex, 6, 329–341.
Neural Model of Contour Integration
937
Freeman, W. J. (1987). Simulation of chaotic EEG patterns with a dynamic model of the olfactory system. Biol. Cybern., 56(2–3), 139–150. Gilbert, C. D. (1992). Horizontal integration and cortical dynamics. Neuron, 9(1), 1–13. Gilbert, C. D., & Wiesel, T. N. (1983). Clustered intrinsic connections in cat visual cortex. J. Neurosci., 3(5), 1116–1133. Gilbert, C. D., & Wiesel, T. N. (1989). Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. J. Neurosci., 9(7), 2432– 2442. Gilbert, C. D., & Wiesel, T. N. (1990). The influence of contextual stimuli on the orientation selectivity of cells in primary visual cortex of the cat. Vision Res., 30(11), 1689–1701. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698–1702. Grosof, D. H., Shapley, R. M., & Hawken, M. J. (1993). Macaque V1 neurons can signal “illusory” contours. Nature, 365(6446), 550–552. Grossberg, S., & Mingolla, E. (1985). Neural dynamics of perceptual grouping: Textures, boundaries, and emergent segmentations. Percept. Psychophys., 38(2), 141–171. Guy, G., & Medioni, G. (1993). Inferring global perceptual contours from local features. In Proc. IEEE CVPR (pp. 786–787). Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neurosci., 9, 181–197. Heitger, F., & von der Heydt, R. (1993). A computational model of neural contour processing: Figure-ground segregation and illusory contours. In Proc. of 4th Intl. Conf. on Computer Vision (pp. 32–40). Hirsch, J. A., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. J. Neurosci., 11(6), 1800–1809. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. Hummel, R., & Zucker, S. W. (1983). On the foundation of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 267–287. Johnson, R. R., & Burkhalter, A. (1991). Feedback connections in visual cortex contact inhibitory neurons. Soc. Neurosci. Abst., 332, 8. Johnson, R. R., & Burkhalter, A. (1992). Corticocortical feedback connections in visual cortex synapse selectively with dendritic spines. Soc. Neurosci. Abst., 134, 15. Kapadia, M. K., Ito, M., Gilbert, C. D., & Westheimer, G. (1995). Improvement in visual sensitivity by changes in local context: Parallel studies in human observers and in V1 of alert monkeys. Neuron, 15(4), 843–856. Kass, M., Witkin, A., & Terzopoulos, D. (1988). SNAKE: Active contour models. Int. J. Computer Vision, 1, 321–332. Kienker, P. K., Sejnowski, T. J., Hinton, G. E., & Schumacher, L. E. (1986). Separating figure from ground with a parallel network. Perception, 15, 197–216. Knierim, J. J., & van Essen, D. C. (1992). Neuronal responses to static texture patterns in area V1 of the alert Macaque monkeys. J. Neurophysiol., 67, 961–980.
938
Zhaoping Li
Konig, ¨ P., & Schillen, T. B. (1991). Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization. Neural Comput., 3, 155–166. Kovacs, I., & Julesz, B. (1993). A closed curve is much more than an incomplete one: Effect of closure in figure-ground segmentation. Proc. Natl. Acad. Sci. USA, 15, 90(16), 7495–7497. Lamme, V. A. (1995). The neurophysiology of figure-ground segregation in primary visual cortex. Journal of Neuroscience, 15(2), 1605–1615. Li, Zhaoping. (1990). A model of olfactory adaptation and sensitivity enhancement in the olfactory bulb. Biological Cybernetics, 62, 349–361. Li, Zhaoping. (1997). Primary cortical dynamics for visual computation. In K-Y. M. Wong, I. King, & D-Y. Yeung (Eds.) Theoretical Aspects of Neural Computation workshop, Hong Kong: Springer-Verlag. Li, Zhaoping, & Hopfield, J. J. (1989). Modeling the olfactory bulb and its neural oscillatory processings. Biol. Cybern., 61(5), 379–392. Livingstone, M. S., & Hubel, D. H. (1984). Anatomy and physiology of a color system in the primate visual cortex. J. Neurosci., 4, 309–356. Maffei, L., & Fiorentini, A. (1976). The unresponsive regions of visual cortical receptive fields. Vision Res., 16(10), 1131–1139. Malach, R., Amir, Y., Harel, M., & Grinvald, A. (1993). Relationship between intrinsic connections and functional architecture revealed by optical imaging and in vivo targeted biocyting injections in primate striate cortex. Proc. Natl. Acad. Sci. USA, 90, 10469–10473. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. San Francisco: Freeman. McGuire, B. A., Gilbert, C. D., Rivlin, P. K., & Wiesel, T. N. (1991). Targets of horizontal connections in macaque primary visual cortex. J. Comp. Neurol., 305, 370–392. Mignard, M., & Malpeli, J. G. (1991). Paths of information flow through visual cortex. Science, 251, 1249–1251. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229, 782–784. Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas V1, V2, and V4 in the presence of competing stimuli. J. Neurophysiol., 70(3), 909–919. Murata, T., & Shimizu, H. (1993). Oscillatory binocular system and temporal segmentation of stereoscopic depth surfaces. Biological Cybernetics, 68, 381– 391. Nault, B., Michaud, Y., Morin, C., Casanova, C., & Molotchnikoff, S. (1990). Responsiveness of cells in area 17 after local interception of the descending path from area 18. Soc. Neurosci. Abst., 502, 7. Polat, U., & Sagi, D. (1994). The architecture of perceptual spatial interactions. Vision Res., 34(1), 73–78. Rockland, K. S., & Lund, J. S. (1983). Intrinsic laminar lattice connections in primate visual cortex. J. Comp. Neurol., 216(3), 303–318. Salin, P., & Bullier, J. (1995). Corticocortical connections in the visual system: Structure and function. Physiological Reviews, 75, 107–154.
Neural Model of Contour Integration
939
Sengpiel, R., Baddeley, R., Freeman, T., Harrad, R., & Blakemore, C. (1995). Two different mechanisms underlie suppressive phenomena in cat primary visual cortex. Soc. Neurosci. Abstr., 21, 1629. Shashua, A., & Ullman, S. (1988). Structural saliency. Proceedings of the International Conference on Computer Vision (pp. 482–488). Tampa, FL. Shepherd, G. M. (1990). Synaptic organization of the brain (3rd ed.) New York: Oxford University Press. Sillito, A. M., Grieve, K. L., Jones, H. E., Cudeiro, J., & Davis, J. (1995). Visual cortical mechanisms detecting focal orientation discontinuities. Nature, 378, 492–496. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neurosci., 18, 555–586. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. Journal of Neuroscience, 15(8), 5448–5465. Somers, D. C., Todorov, E. V., Siapas, A. G., & Sur, M. (1995). Vector-based integration of local and long-range information in visual cortex (AI Memo. No. 1556). Cambridge, MA: MIT. Sompolinsky, H., Golumb, D., & Kleinfeld, D. (1991). Cooperative dynamics in visual processing. Phys. Rev. A., 43, 6990–7011. Sporns, O., Tononi, G., & Edelman, G. M. (1991). Modeling perceptual grouping and figure-ground segregation by means of reentrant connections. Proc. Natl. Acad. Sci. USA, 88, 129–133. Ts’o, D., & Gilbert, C. (1988). The organization of chromatic and spatial interactions in the primate striate cortex. J. Neurosci., 8, 1712–1727. Ts’o, D. Y., Gilbert, C. D., & Wiesel, T. N. (1986). Relationships between horizontal interactions and functional architecture in cat striate cortex as revealed by cross-correlation analysis. J. Neurosci., 6(4), 1160–1170. Valverde, F. (1985). The organization principles of the primary visual cortex in the monkey. In A. Peters & E. G. Jones (Eds.), Cerebral cortex (pp. 207–258). New York: Plenum Press. van Essen, D. (1985). Functional organization of primate visual cortex. In A. Peters & E. G. Jones (Eds.), Cerebral cortex (pp. 259–329). New York: Plenum Press. von der Heydt, R., Peterhans, E., & Baumgartner, G. (1984). Illusory contours and cortical neuron responses. Science, 224, 1260–1262. von der Malsburg, C. (1981). The correlation theory of brain function (Report). Gottingen, ¨ West Germany: Max-Planck-Institute for Biophysical Chemistry. von der Malsburg, C., & Buhmann, J. (1992). Sensory segmentation with coupled neural oscillators. Biological Cybernetics, 67, 233–242. Wang, D. (1995). Synchronous oscillations based on lateral connections. In J. Sirosh, R. Miikkulainen, & Y. Choe (Eds.), Lateral interactions in the cortex: Structure and function [Electronic book]. Available http://www.cs.utexas.edu/users/nn/web-pubs/htmlbook96/. Weliky, M., Kandler, K., Fitzpatrick, D., & Katz, L. C. (1995). Patterns of excitation and inhibition evoked by horizontal connections in visual cortex share a common relationship to orientation columns. Neurons, 15, 541–552.
940
Zhaoping Li
White, E. L. (1989). Cortical circuits. Boston: Birkhauser. Williams, L. R., & Jacobs, D. W. (1996). Local parallel computation of stochastic completion fields. In Proc. IEEE CVPR (pp. 161–168). Yen, S.-C., & Finkel, L. H. (1997). Salient contour extraction by temporal binding in a cortically-based network. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Neural Information Processing Systems 9. Cambridge, MA: MIT Press. Zipser, K., Lamme, V. A., & Schiller, P. H. (1996). Contextual modulation in primary visual cortex. J. Neurosci., 15, 7376–7389. Zucker, S. W., Dobbins, A., & Iverson, L. (1989). Two stages of curve detection suggest two styles of visual computation. Neural Computation, 1, 68–81. Received August 23, 1996; accepted August 13, 1997.
LETTER
Communicated by Richard Lippmann
Predictive Neural Networks for Learning the Time Course of Blood Glucose Levels from the Complex Interaction of Counterregulatory Hormones Klaus Prank Clemens Jurgens ¨ Alexander von zur Muhlen ¨ Georg Brabant Abteilung Klinische Endokrinologie, Medizinische Hochschule Hannover, D-30623 Hannover, Germany
Diabetes mellitus is a widespread disease associated with an impaired hormonal regulation of normal blood glucose levels. Patients with insulin-dependent diabetes mellitus (IDDM) who practice conventional insulin therapy are at risk of developing hypoglycemia (low levels of blood glucose), which can lead to severe dysfunction of the central nervous system. In large retrospective studies, up to approximately 4% of deaths of patients with IDDM have been attributed to hypoglycemia (Cryer, Fisher, & Shamoon, 1994; Tunbridge, 1981; Deckert, Poulson, & Larsen, 1978). Thus, a better understanding of the complex hormonal interaction preventing hypoglycemia is crucial for treatment. Experimental data from a study on insulin-induced hypoglycemia in healthy subjects are used to demonstrate that feedforward neural networks are capable of predicting the time course of blood glucose levels from the complex interaction of glucose counterregulatory (glucose-raising) hormones and insulin. By simulating the deficiency of single hormonal factors in this regulatory network, we found that the predictive impact of glucagon, epinephrine, and growth hormone secretion, but not of cortisol and norepinephrine, were dominant in restoring normal levels of blood glucose following hypoglycemia. 1 Introduction Glucose is an essential metabolic substrate for the brain that ensures the normal functioning of the central nervous system (Cryer et al., 1994). To maintain constant levels of blood glucose (glucose homeostasis) a redundant system of glucose counterregulatory (glucose-raising) factors has evolved that ensures the continuous supply of glucose. This system consists of the activation of hormones such as glucagon, catecholamines (epinephrine, norepinephrine), growth hormone (GH), and cortisol upon low concentrations of blood glucose (hypoglycemia). As shown in Figure 1 the complex pattern Neural Computation 10, 941–953 (1998)
c 1998 Massachusetts Institute of Technology °
942
Klaus Prank et al.
Figure 1: Time course of counterregulatory hormones, insulin, and glucose. After a 30-minute baseline measurement, a bolus injection of insulin was given to induce hypoglycemia. The measurements of hormonal activation and glucose levels were continued for another 60 minutes. Mean plasma levels (n = 6). One single y-axis is used to allow for an easy display of all counterregulatory hormones, insulin, and glucose. To get the exact scaling of the respective hormone and glucose respectively, multiply with the following factors: Glucose: 0.01 mmol/L, glucagon: 0.5 ng/L, epinephrine: 0.01 nmol/L, cortisol: 2 nmol/L, growth hormone: 0.2 µg/L.
of secretion of these hormones has a temporal hierarchy that begins with the increase of glucagon and epinephrine secretion, followed by an activation of norepinephrine, cortisol, and GH. To date, there have been no methods for exploring the cooperative regulation of this hormonal system. Feedforward neural networks have the advantage of broad linear and nonlinear approximation abilities without making specific assumptions of the system under study. We have demonstrated recently that feedforward neural networks used for time-series prediction are able to separate the secretory pattern of parathyroid hormone
Time Course of Blood Glucose Levels
943
(PTH) in a group of patients with osteoporosis (a bone disease) from that in a group of healthy subjects (Prank et al., 1995). This predictive approach could be extended to separating the temporal pattern of GH secretion in patients with excessive irregular GH release (acromegaly) from healthy subjects (Prank et al., 1996a, b). Nonlinear artificial neural networks are capable of learning relationships between input and output data that are not obviously apparent to human analysis (Weigend, Huberman, & Rumelhart, 1990). Here we used feedforward neural networks for two purposes: (1) to capture the temporal dynamics of single hormones in the regulation of blood glucose homeostasis and evaluate their predictive rank in the counterregulatory mechanism following insulin-induced hypoglycemia and (2) to predict the future time course of blood glucose levels from past values of the counterregulatory hormones and insulin. In contrast to previous studies in which we predicted only the future dynamics of a single hormone from the past (Prank et al., 1995, 1996a, 1996b) we now predict the time course of blood glucose from the future and past dynamics of multiple regulatory hormones. In addition, we predict blood glucose levels using only data from the past. A similar predictive approach for a cardiological problem has been published by Baxt and coworkers (Baxt 1990, 1991, 1992a, 1992b; Baxt & White, 1995; Baxt & Skora, 1996), in which the presence of acute myocardial infarction is predicted from clinical input variables and their predictive impact is evaluated. 2 Data and Methods 2.1 Data and Experimental Protocol. Six healthy young male subjects took part in this study. The study was approved by the local Committee of Medical Ethics, and all subjects gave their written informed consent. To determine blood levels of counterregulatory hormones (glucagon, epinephrine, norepinephrine, growth hormone, cortisol), insulin, and glucose, blood samples were drawn every 2 minutes via a central venous catheter. This catheter was placed 3 hours before the beginning of the sampling procedure. The baseline levels of the hormones were measured over 30 minutes before a bolus of 0.15 input units regular insulin/kg body weight was injected intravenously to induce hypoglycemia (see Figure 1). High-frequency blood sampling was continued for another 60 minutes to evaluate the counterregulatory hormonal activation and the blood glucose temporal profile. Plasma glucose concentrations were determined by the hexokinase-6-phosphate-dehydrogenase method. Plasma epinephrine and norepinephrine concentrations were measured in duplicate by using a single-isotope catecholamine orthomethyl transferase radioenzymatic assay (Peuler & Johnson, 1977) and a separation method by reverse-phase high-frequency liquid chromatography prior to scintillation counting (Ratge, Baumgardt, Knoll, & Wisser, 1983). Glucagon, cortisol, and insulin were determined by commercially available radioimmunoassays (Ensinck, 1983;
944
Klaus Prank et al.
Figure 2: Feedforward neural network (fully connected, but not all connections are shown here) for predicting blood glucose levels from glucose-raising (counterregulatory) hormones and insulin. Measured hormone concentrations from the future and the past were used for making predictions.
Farmer & Pierce, 1974; Kuzuya, Blix, Horwitz, & Steiner, 1977). Growth hormone was analyzed by a highly sensitive chemiluminescence assay (Iranmanesh, Grisso, & Veldhuis, 1994). All samples were analyzed in duplicate, and except for epinephrine and norepinephrine, the samples of one study day were all analyzed in the same run to avoid interassay variation. 2.2 Methods. We explored the predictability of blood glucose levels from the time course of the counterregulatory hormones and insulin, applying a large variety of fully connected feedforward neural networks with linear as well as nonlinear (sigmoidal) activation functions (see Figure 2). 2.2.1 Network Architectures and Prediction of Glucose Levels. The network size ranged from 18 to 54 input units corresponding to a 6-minute- to 18minute-wide window for each of the six hormones used for predicting glucose levels. The number of hidden units varied between 0 and 15, and one output unit was used for the predicted blood glucose level (see Figure 2). Glucose levels glc(ti )∗ were predicted from the actual as well as m succeeding and m preceding measurements of hormone concentrations: glc(ti )∗ = f (ins(ti−m , . . . , ti+m ), glu(ti−m , . . . , ti+m ), epi(ti−m , . . . , ti+m ), nor(ti−m , . . . , ti+m ), hgh(ti−m , . . . , ti+m ), cor(ti−m , . . . , ti+m )),
Time Course of Blood Glucose Levels
945
where the variable names are defined as follows: ins, insulin; glu, glucagon; epi, epinephrine; nor, norepinephrine; hgh, human growth hormone; and cor, cortisol. The input and target data were normalized to a [0, 1] interval prior to training and testing. In addition to this prediction scheme, we predicted the future time course of blood glucose levels up to k time steps into the future using n preceding measurements of hormone concentrations: glc(ti+k )∗ = f (ins(ti , . . . , ti−n+1 ), glu(ti , . . . , ti−n+1 ), epi(ti , . . . , ti−n+1 ), nor(ti , . . . , ti−n+1 ), hgh(ti , . . . , ti−n+1 ), cor(ti , . . . , ti−n+1 )). The predictive performance of each network architecture was evaluated using the normalized mean squared error (NMSE) as the prediction error estimate (Gershenfeld & Weigend, 1994), since it is decorrelated from the variance of the respective time series: NMSE =
h(glci − glc∗i )2 i , σ 2 (glci )
where the angle brackets denote the mean over all predictions of glucose levels and σ 2 (glci ) denotes the variance of the measured glucose concentrations. An NMSE value of 1.0 may be achieved simply by always guessing the mean of the measured glucose profile as the predicted value. Thus, any NMSE value below 1.0 means that the respective network is able to predict the temporal dynamics of blood glucose levels from the time course of the measured hormones. The time step where the prediction error NMSE reached a value of 1.0 was recorded for each of the subjects and defined as the prediction time. 2.2.2 Pruning. To reduce the cost of the network and improve generalization, we explored a variety of pruning techniques, such as a noncontributing-units pruning algorithm (Dow & Sietsma, 1991; Biedermann, 1994), Optimal Brain Damage (Solla, Le Cun, & Denker, 1990), Optimal Brain Surgeon (Stork & Hassibi, 1993), and magnitude-based pruning. 2.2.3 Training and Validation. Weights were updated using resilient propagation (RProp), a local adaptive learning algorithm performing supervised batch learning in multilayer perceptrons (Riedmiller & Braun, 1993). The basic principle of RProp is to eliminate the harmful influence of the size of the partial derivative on the weight step. As a consequence, only the sign of the derivative is considered to indicate the direction of the weight update. The RProp algorithm takes three parameters: the inital update value 10 , which determines the size of the first weight step (10 = 0.1); the maximum weight step 1max (1max = 50.0); and the weight-decay term α, which determines the relationship between the output error and the reduction in the size of the weights (α = 10−4 ). Because of the limited number of cases
946
Klaus Prank et al.
available in this study, we made use of a “leave-one-out” or “jackknife” method for training. The data sets of one subject were held out of the training procedure for testing, and another one was left for cross-validation to terminate training and avoid overfitting. The results obtained using RProp were compared to results applying standard backpropagation (“vanilla” backpropagation) using a learning factor of η = 0.2. The simulations in this study were performed on a Sun SPARCstation 20 using the Stuttgart Neural Network Simulator (SNNS). 2.2.4 Testing and Simulation of Deficient Hormonal Activation. The testing NMSE for the prediction of the blood glucose levels was evaluated for each of the subjects using the normalized data sets from all six hormones measured in this study. We simulated a deficiency in the activation of the counterregulatory hormones following hypoglycemia (following the insulin bolus injection). This was done by normalizing the values in the activation phase (from t = 0 minute to t = 60 minutes) to the maximum value of the respective hormone concentration during the baseline phase (from t = −30 minutes to t = 0 minute). This simulation of deficient activation was done separately for each hormone. 3 Results We found that a feedforward network with sigmoidal activation functions, 5 input units per hormone (8-minute input window), 10 hidden units, and 1 output unit performed best using the NMSE as the performance criterion for the prediction using past and future hormonal data as well as for the prediction using only data from the past. This network architecture was used in all of the simulations presented in the remainder of this article. Linear predictors were not capable of predicting the dynamics of blood glucose from the counterregulatory hormones. Using a linear feedforward neural network with 5 input units per hormone, we found a mean prediction error (NMSE) of 2.62 ± 1.42 for prediction blood glucose from future and past values. This sort of predictor is equivalent to an autoregressive (AR) model. Using all measured hormones, the time course of blood glucose levels demonstrated the best predictability using RProp as the training algorithm combined with a noncontributing-units pruning technique (see Table 1 and Figure 3). The other pruning techniques explored in this study did not improve the predictive performance compared to the noncontributing-units technique (see Table 2). Using standard backpropagation resulted in a decreased generalization compared to RProp (see Table 1). Simulating a deficiency in the activation of epinephrine to restore normal blood glucose levels resulted in the largest prediction error found in the simulations (see Table 1 and Figure 3), followed by growth hormone and glucagon (see Table 1). In contrast to these hormones, the omission of the secretory activation of norepinephrine (see Table 1) or cortisol (see Table 1 and Figure 3) has no major impact on the
Time Course of Blood Glucose Levels
947
Table 1: Prediction Error vs. Deficient Activation of Single Counterregulatory Hormones. Deficient Hormonal Activation None Epinephrine Norepinephrine Glucagon Cortisol Growth hormone
RProp
Backprop
0.22 ± 0.12 1.18 ± 0.23 0.23 ± 0.09 0.38 ± 0.14 0.26 ± 0.14 0.50 ± 0.36
0.92 ± 0.65 1.46 ± 0.85 0.89 ± 0.22 1.19 ± 0.71 0.89 ± 0.22 1.07 ± 0.55
Note: Hormone concentration data from the future and the past were used for these predictions (see Figure 2). The data are given as mean ± S.D. in six subjects.
Table 2: Prediction Error Using Hormone Concentration Values from the Future and the Past. Pruning Method None Noncontributing units Magnitude pruning Optimal Brain Surgeon Optimal Brain Damage
NMSE 0.30 ± 0.09 0.22 ± 0.12 0.32 ± 0.13 0.41 ± 0.23 0.50 ± 0.16
Note: Weights were updated using RProp. Data are given as mean ± S.D. in six subjects.
predictability of the blood glucose profile. By using only hormone concentrations from the past, blood glucose levels were predictable up to 11 ± 7 minutes applying an NMSE value of 1.0 as the cutoff (see Figure 4). The prediction time ranged from 4 minutes to 24 minutes within the six subjects investigated for glucose predictability in this study. 4 Discussion The neural network approach presented in this study is to our knowledge the first that statistically evaluates the cooperative regulation in a complex hormonal system. This is achieved by learning a nonlinear mapping of only small input data sets of each counterregulatory hormone and insulin to the blood glucose levels as the target values. In contrast to nonlinear feedforward networks, simple linear predictors that are equivalent to autoregressive models were not capable of forecasting the time course of blood glucose from the complex interaction of the counterregulatory hormones. Using a noncontributing-units pruning technique, we slightly improved the predictive performance of the best predictor. This might be due to the fact that
948
Klaus Prank et al.
Figure 3: Measured and predicted time course of mean blood glucose levels (N = 6). RProp and a “noncontributing-units” pruning technique were used. The neural network architecture as shown in Figure 2 was used for making predictions.
Time Course of Blood Glucose Levels
949
Figure 4: Representative measured and predicted (4 to 12 minutes ahead) time course of blood glucose levels in one subject. Only hormonal data from the past 8 minutes were used for these predictions. No glucose data were used as input values.
950
Klaus Prank et al.
our extensive evaluation of different neural network architectures led to an almost optimal network for the specific task of blood glucose prediction. In contrast to our approches presented in this study, Tresp, Moody, and Delong (1994) used a combination of a compartment module and a neural network module to predict the long-term behavior of blood glucose levels in a diabetic patient. In the preliminary study of Tresp et al. (1994), the data used for predicting blood glucose were measured only in one diabetic patient over almost 6 months and consisted of the times and dosages of insulin injections, the times and amounts of food intake, and the times and durations of exercise. Blood glucose levels were measured only a few times a day. No quantitative results of the predictive performance were given. Applying a bootstrap procedure by simulating the deficiency in the activation of single hormonal factors following insulin-induced hypoglycemia, we could specifically identify the predictive impact of each input variable on the restoration of normal blood glucose levels. We found that epinephrine, growth hormone, and glucagon have the highest predictive impact on the time course of blood glucose levels, whereas deficient activation of norepinephrine or cortisol had only little impact on the predictability of blood glucose. This ranking for the impact of counterregulatory hormones holds true for both the RProp and the backprop learning algorithm, although RProp generalized better than backprop. The approach presented in this article allows for the simulation of a complete or partial deficiency of a single and a combination of regulatory factors. In patients with insulin-dependent diabetes mellitus who suffer from recurrent hypoglycemia, not only the catecholaminergic activation but also the response of other counterregulatory hormones such as glucagon to hypoglycemia is attenuated (Mitrakou et al., 1991; Cryer, 1992). Such an impaired response to hypoglycemia also plays an important role in patients after pancreaticoduodenectomy (standard Whipple’s operation) who are lacking a normal counterregulatory response of glucagon or patients after pituitary surgery associated with a deficiency of growth hormone and cortisol release (Cryer, 1992). The approach presented in this study allows for a systematic evaluation of such situations. Furthermore, we have demonstrated that the current methodology is applicable to predicting blood glucose levels using only past knowledge of the temporal profiles of hormonal regulators, although our approach is not aimed as a one-to-one correspondence of the physiological counterregulation. Measuring hormone concentrations in the bloodstream on a highly frequent basis, as done in this study, is still in an experimental status because of the enormous study set-up and the costs for the hormone concentration measurements. To date, biosensors that would be capable of assessing fluctuations of hormone concentrations in the bloodstream on an ”on-line” continuous basis are not available. However, biosensors containing biological receptors such as the nicotinic acetylcholine receptor (Gotoh, Tamiya, Momoi, Kagawa, & Karube, 1987) and the L-glutamate receptor (Uto, Michaelis, Hu, Umezawa, & Kuwana, 1990) have been developed. The methods presented in this arti-
Time Course of Blood Glucose Levels
951
cle offer an avenue to explore the complex physiological regulation of other biological target systems by interacting hormonal systems. Acknowledgments This study was supported by the Deutsche Forschungsgemeinschaft under grant Br 915/4-2. References Baxt, W. G. (1990). Use of an artificial neural network for data analysis in clinical decision-making: The diagnosis of acute coronary occlusion. Neural Comp., 2, 480–489. Baxt, W. G. (1991). Use of an artificial neural network for the diagnosis of myocardial infarction. Ann. Intern. Med., 155, 843–848. Baxt, W. G. (1992a). Improving the accuracy of an artificial neural network using multiple differently trained networks. Neural Comp., 4, 772–780. Baxt, W. G. (1992b). Analysis of the clinical variables driving decision in an artificial neural network trained to identify the presence of myocardial infarction. Ann. Emerg. Med., 21, 1439–1444. Baxt, W. G., & Skora, J. (1996). Prospective validation of artificial neural networks trained to identify acute myocardial infarction. Lancet, 347, 12–15. Baxt, W. G., & White, H. (1995). Bootstrapping confidence intervals for clinical input variable effects in a network trained to identify the presence of acute myocardial infarction. Neural Comp., 7, 624–638. Biedermann, J. (1994). Anwendungen Neuronaler Netze beim VLSI-CAD. Diploma thesis, Institut fur ¨ Numerische und Angewandte Methematik, GeorgAugust-Universit¨at Gottingen. ¨ Cryer, P. E. (1992). Glucose homeostasis and hypoglycemia. In J. D. Wilson & D. W. Foster (Eds.), Williams textbook of endocrinology (pp. 1223–1253). Philadelphia: W. B. Saunders. Cryer, P. E., Fisher, J. N., & Shamoon, H. (1994). Hypoglycemia. Diabetes Care, 17, 734–755. Deckert, T., Poulsen, J. E., & Larsen, M. (1978). Prognosis of diabetics with diabetes before the age of thirty-one. I. Survival, cause of deaths and complications. Diabetologia, 14, 363–370. Dow, R., & Sietsma, J. (1991). Creating artificial neural networks that generalize. Neural Networks, 4, 67–79. Ensinck, J. W. (1983). Immunoassays for glucagon. In P. LeFebvre (Ed.), Glucagon: Handbook of experimental pharmacology (pp. 203–221). Berlin: Springer-Verlag. Farmer, R. W., & Pierce, C. E. (1974). Plasma cortisol determination: Radioimmunoassay and competitive protein binding compared. Clin. Chem., 20, 411– 414. Gershenfeld, N. A., & Weigend, A. S. (1994). The future of time series: Learning and understanding. In A. S. Weigend & N. A. Gershenfeld (Eds.), Time series
952
Klaus Prank et al.
prediction: Forecasting the future and understanding the past (pp. 1–70). Reading, MA: Addison-Wesley. Gotoh, M., Tamiya, E., Momoi, M., Kagawa, Y., & Karube, I. (1987). Acetylcholine sensor based on ion sensitive field effect transistor and acetylcholine receptor. Anal. Lett., 20, 857–870. Iranmanesh, A., Grisso, B., & Veldhuis, J. D. (1994). Low basal and persistent pulsatile growth hormone secretion are revealed in normal and hyposomatotropic men studied with a new ultrasensitive chemiluminescence assay. J. Clin. Endocrinol. Metab., 78, 526–535. Kuzuya, H., Blix, P. M., Horwitz, D. L., & Steiner, D. F. (1977). Determination of free and total insulin and c-peptide in insulin-treated diabetics. Diabetes, 26, 22–29. Mitrakou, A., Ryan, C., Veneman, T., Mokan, M., Jenssen, T., Kiss, I., Durrant, J., Cryer, P., & Gerich, J. (1991). Hierarchy of glycemic thresholds for counterregulatory hormone secretion, symptoms, and cerebral dysfunction. Am. J. Physiol., 260, E67–E74. Peuler, J. D., & Johnson, G. A. (1977). Simultaneous single isotope radioenzymatic assay of norepinephrine and dopamine. Life Sci., 21, 625–636. Prank, K., Kloppstech, M., Nowlan, S. J., Sejnowski, T. J., & Brabant, G. (1996a). Self-organized segmentation of time series: Separating growth hormone secretion in acromegaly from normal controls. Biophys. J., 70, 2540–2547. Prank, K., Kloppstech, M., Nowlan, S. J., Sejnowski, T. J., & Brabant, G. (1996b). Random secretion of growth hormone in humans. Phys. Rev. Lett., 77, 1909– 1911. Prank, K., Nowlan, S. J., Harms, H. M., Kloppstech, M., Brabant, G., Hesch, R. D., & Sejnowksi, T. J. (1995). Time series prediction of plasma hormone concentration: Evidence for differences in predictability of parathyroid hormone secretion between osteoporotic patients and normal controls. J. Clin. Invest., 95, 2910–2919. Ratge, D., Baumgardt, G., Knoll, E., & Wisser, H. (1983). Plasma free and conjugated catecholamines in diagnosis and localization of pheochromocytoma. Clin. Chim. Acta, 132, 229–235. Riedmiller, M., & Braun, H. (1993). A direct adaptive method for faster backpropagation learning: The RPROP algorithm. In Proceedings of the IEEE International Conference on Neural Networks 1993. Solla, S., Le Cun, Y., & Denker, J. (1990). Optimal brain damage. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 598–605). San Mateo, CA: Morgan Kaufmann. Stork, D., & Hassibi, B. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In T. J. Sejnowski, G. E. Hinton, & D. S. Touretzky (Eds.), Advances in neural information processing systems (pp. 164–171). San Mateo, CA: Morgan Kaufmann. Tresp, V., Moody, J., & Delong, W. R. (1994). Neural network modeling of physiological processes. In S. J. Hanson, T. Petsche, M. Kearns, & R. L. Rivest (Eds.), Computational learning theory and natural learning systems 2. Cambridge, MA: MIT Press. Tunbridge, W. M. G. (1981). Factors contributing to deaths of diabetics under 50 years of age. Lancet, 2, 569–572.
Time Course of Blood Glucose Levels
953
Uto, M., Michaelis, E. K., Hu, I. F., Umezawa, Y., & Kuwana, T. (1990). Biosensor development with a glutamate receptor ion-channel reconstituted in a lipid bilayer. Anal. Sci., 6, 221–225. Weigend, A. S., Huberman, B. A., & Rumelhart, D. E. (1990). Predicting the future: A connectionist approach. Int. J. Neural Syst., 1, 193–209. Received May 2, 1996; accepted September 17, 1997.
LETTER
Communicated by Frederico Girosi
Properties of Support Vector Machines Massimiliano Pontil INFM, Dipartimento di Fisica dell’Universit`a di Genova, 16146 Genova, Italy
Alessandro Verri INFM, D151 Universit`a di Genova, 16146 Genova, Italy
Support vector machines (SVMs) perform pattern recognition between two point classes by finding a decision surface determined by certain points of the training set, termed support vectors (SV). This surface, which in some feature space of possibly infinite dimension can be regarded as a hyperplane, is obtained from the solution of a problem of quadratic programming that depends on a regularization parameter. In this article, we study some mathematical properties of support vectors and show that the decision surface can be written as the sum of two orthogonal terms, the first depending on only the margin vectors (which are SVs lying on the margin), the second proportional to the regularization parameter. For almost all values of the parameter, this enables us to predict how the decision surface varies for small parameter changes. In the special but important case of feature space of finite dimension m, we also show that there are at most m + 1 margin vectors and observe that m + 1 SVs are usually sufficient to determine the decision surface fully. For relatively small m, this latter result leads to a consistent reduction of the SV number. 1 Introduction Support vector machines (SVMs) have been recently introduced as a new technique for solving pattern recognition problems (Cortes & Vapnik, 1995; Blanz et al., 1996; Scholkopf ¨ et al., 1996; Osuna, Freund, & Girosi, 1997). According to the theory of SVMs (Vapnik, 1982, 1995), while traditional techniques for pattern recognition are based on the minimization of the empirical risk—that is, on the attempt to optimize the performance on the training set—SVMs minimize the structural risk—that is, the probability of misclassifying yet-to-be-seen patterns for a fixed but unknown probability distribution of the data. This new induction principle, which is equivalent to minimizing an upper bound on the generalization error, relies on the theory of uniform convergence in probability (Vapnik, 1982). What makes SVMs attractive is the ability to condense the information contained in the training set and the use of families of decision surfaces of relatively low VC dimension (Vapnik & Chervonenkis, 1971). Neural Computation 10, 955–974 (1998)
c 1998 Massachusetts Institute of Technology °
956
Massimiliano Pontil and Alessandro Verri
In the linear, separable case, the key idea of an SVM can be explained simply. Given a training set S that contains points of either of two classes, an SVM separates the classes through a hyperplane determined by certain points of S, termed support vectors. In the separable case, this hyperplane maximizes the margin, or twice the minimum distance of either class from the hyperplane, and all the support vectors lie at the same minimum distance from the hyperplane (and are thus termed margin vectors). In real cases, the two classes may not be separable, and both the hyperplane and the support vectors are obtained from the solution of a problem of constrained optimization. The solution is a trade-off between the largest margin and the lowest number of errors, with trade-off controlled by a regularization parameter. This article explores the nature of support vectors and explains how the regularization parameter determines the decision surface, in both the linear and nonlinear case. We thus investigate some mathematical properties of support vectors and characterize the dependence of the decision surface on the changes of the regularization parameter. The analysis is first carried out in the simpler linear case and then extended to include nonlinear decision surfaces. We review the theory of SVMs in section 2 and present our analysis in section 3. We summarize the conclusions of our work in section 4. 2 Theoretical Overview In this section, we recall the basics of the theory of SVM (Vapnik, 1995; Cortes & Vapnik, 1995) in both the linear and nonlinear cases. We start with the simple case of linearly separable sets. 2.1 Optimal Separating Hyperplane. In what follows we assume we are given a set S of points xi ∈ Rn with i = 1, 2, . . . , N. Each point xi belongs to either of two classes and thus is given a label yi ∈ {−1, 1}. The goal is to establish the equation of a hyperplane that divides S, leaving all the points of the same class on the same side while maximizing the minimum distance between either of the two classes and the hyperplane. For this purpose, we need some preliminary definitions. Definition 1. such that
The set S is linearly separable if there exist w ∈ Rn and b ∈ R
w · xi + b ≥ 1 w · xi + b ≤ −1
if yi = 1, if yi = −1.
(2.1)
In more compact notation, these two inequalities can be rewritten as yi (w · xi + b) ≥ 1, for i = 1, 2, . . . , N. The pair (w, b) defines a hyperplane of equation w · x + b = 0,
(2.2)
Properties of Support Vector Machines
(a)
957
(b)
Figure 1: Separating hyperplane and optimal separating hyperplane. The solid lines in (a) and (b) separate the two identical sets of open circles and triangles, but the solid line in (b) leaves the closest points (the filled circles and triangle) at the maximum distance. The dashed lines in (b) identify the margin.
named separating hyperplane (see Figure 1a). If we denote with w the norm of w, the signed distance di of a point xi from the separating hyperplane (w, b) is given by di =
w · xi + b . w
(2.3)
Combining equations 2.2 and 2.3 for all xi ∈ S, we have yi di ≥
1 . w
(2.4)
Therefore, 1/w is the lower bound on the distance between the points xi and the separating hyperplane (w, b). One might ask, Why not simply rewrite equation 2.2 as yi (w · xi + b) ≥ 0? The purpose of the 1 on the right-hand side of equation 2.2 is to establish a one-to-one correspondence between separating hyperplanes and their parametric representation. This is done through the notion of canonical representation of a separating hyperplane.1 1 This intermediate step toward the derivation of optimal separating hyperplanes is slightly different from the derivation originally developed in Cortes and Vapnik (1995).
958
Massimiliano Pontil and Alessandro Verri
Definition 2. Given a separating hyperplane (w, b) for the linearly separable set S, the canonical representation of the separating hyperplane is obtained by rescaling the pair (w, b) into the pair (w0 , b0 ) in such a way that the distance of the closest point equals 1/w0 . Through this definition we have, minxi ∈S {yi (w0 · xi + b0 )} = 1. Consequently, for a separating hyperplane in the canonical representation, the bound in equation 2.4 is tight. In what follows we assume that a separating hyperplane is always given the canonical representation and thus write (w, b) instead of (w0 , b0 ). We are now in a position to define the notion of optimal separating hyperplane. Definition 3. Given a linearly separable set S, the optimal separating hyperplane (OSH) is the separating hyperplane that maximizes the distance of the closest point of S. Since the distance of the closest point equals 1/w, the OSH can be regarded as the solution of the problem of maximizing 1/w subject to the in equation 2.2 or, Problem P1 Minimize subject to
1 2w · w yi (w · xi
+ b) ≥ 1,
i = 1, 2, . . . , N.
Two comments are in order. First, if the pair (w, b) solves P1, then for at least one xi ∈ S, we have yi (w · xi + b) = 1. In particular, this implies that the solution of P1 is always a separating hyperplane in the canonical representation. Second, the parameter b enters in the constraints but not in the function to be minimized. The quantity 2/w, which measures the distance between the two classes in the direction of w, is named margin. Hence, the OSH can also be seen as a separating hyperplane, which maximizes the margin (see Figure 1b). We now study the properties of the solution of the problem P1. 2.2 Support Vectors. Problem P1 can be solved by means of the classical method of Lagrange multipliers (Bazaraa & Shetty, 1979). If we denote with α = (α1 , α2 , . . . , αN ) the N nonnegative Lagrange multipliers associated with the constraints in equation 2.2, the solution to problem P1 is equivalent to determining the saddle point of the function L=
N X 1 w·w− αi {yi (w · xi + b) − 1}, 2 i=1
(2.5)
Properties of Support Vector Machines
959
¯ and with L = L(w, b, α). At the saddle point, L has a minimum for w = w ¯ , and thus we can write, b = b¯ and a maximum for α = α N X ∂L = yi αi = 0, ∂b i=1
(2.6)
N X ∂L = w− αi yi xi = 0 ∂w i=1
with ∂L = ∂w
µ
∂L ∂L ∂L , ,..., ∂w1 ∂w2 ∂wn
(2.7)
¶ .
Substituting equations 2.6 and 2.7 into the right-hand side of equation 2.5, we see that problem P1 reduces to the maximization of the function
L(α) =
N X
αi −
i=1
N 1X αi αj yi yj xi · xj , 2 i,j=1
subject to the constraint of equation 2.6 with α ≥ 0.2 This new problem, called dual problem, can be formulated as: Problem P2 Maximize subject to
1 − P2 α · Dα + yi αi = 0 α ≥ 0,
P
αi
where both sums are for i = 1, 2, . . . , N, and D is an N × N matrix such that Dij = yi yj xi · xj .
(2.8)
¯ from equation 2.7, it follows that ¯ b), As for the pair (w, ¯ = w
N X
α¯ i yi xi ,
(2.9)
i=1
while b¯ can be determined from the Kuhn-Tucker conditions ¯ − 1) = 0, ¯ · xi + b) α¯ i (yi (w
i = 1, 2, . . . , N.
(2.10)
Note that the only α¯ i that can be nonzero in equation 2.10 are those for which the constraints (in equation 2.2) are satisfied with the equality sign. 2
In what follows, α ≥ 0 means αi ≥ 0 for every component αi of any vector α.
960
Massimiliano Pontil and Alessandro Verri
The corresponding points xi , termed support vectors, are the points of S closest to the OSH (see Figure 1b). Given a support vector xj , the parameter b¯ can be obtained from the corresponding Kuhn-Tucker condition as ¯ · xj . b¯ = yj − w The problem of classifying a new data point x is now simply solved by computing ¯ ¯ · x + b). sign(w
(2.11)
In conclusion, the support vectors condense all the information contained in the training set S, which is needed to classify new data points. 2.3 Linearly Nonseparable Case. If the set S is not linearly separable or one simply ignores whether the set S is linearly separable, the problem of searching for an OSH is meaningless (there may be no separating hyperplane to start with). Fortunately, the previous analysis can be generalized by introducing N nonnegative variables ξ = (ξ1 , ξ2 , . . . , ξN ) such that yi (w · xi + b) ≥ 1 − ξi ,
i = 1, 2, . . . , N.
(2.12)
If the point xi satisfies inequality 2.2, then ξi is null and equation 2.12 reduces to equation 2.2. Instead, if the point xi does not satisfy inequality 2.2, the term −ξi is added to the right-hand side of it to obtain inequality 2.12. The generalized OSH is then regarded as the solution to Problem P3 Minimize subject to
P 1 ξi 2w · w + C yi (w · xi + b) ≥ 1
ξ ≥ 0.
− ξi
i = 1, 2, . . . , N
P The term C ξi , where the sum is for i = 1, 2, . . . , N, can be thought of as some measure of the amount of misclassification. Note that this term leads to a more robust solution, in the statistical sense, than P the intuitively more P appealing term C ξi2 . In other words, the term C ξi makes the OSH less sensitive to the presence of outliers in the training set. The parameter C can be regarded as a regularization parameter. The OSH tends to maximize the minimum distance 1/w for small C and minimize the number of misclassified points for large C. For intermediate values of C, the solution of problem P3 trades errors for a larger margin. The behavior of the OSH as a function of C will be studied in detail in the next section.
Properties of Support Vector Machines
961
In analogy to what was done for the separable case, problem P3 can be transformed into the dual Problem P4 Maximize subject to
− 12 α · Dα + P yi αi = 0 0 ≤ αi ≤ C,
P
αi i = 1, 2, . . . , N,
with D the same N ×N matrix of the separable case. Note that the dimension of P4 is given by the size of the training set, while the dimension of the input space gives the rank of D. From the constraints of problem P4, it follows that if C is sufficiently large and the set S linearly separable, problem P4 reduces to P2. ¯ it is easy to find that ¯ b), As for the pair (w, ¯ = w
N X
α¯ i yi xi ,
i=1
¯ , solution of the dual problem P4, while b¯ can again be determined from α and from the new Kuhn-Tucker conditions, ¯ − 1 + ξ¯i ) = 0 ¯ · xi + b) α¯ i (yi (w (C − α¯ i )ξ¯i = 0,
(2.13) (2.14)
where the ξ¯i are the values of the ξi at the saddle point. Similar to the separable case, the points xi for which α¯ i > 0 are termed support vectors. The main difference is that here we have to distinguish between the support vectors for which α¯ i < C and those for which α¯ i = C. In the first case, from condition 2.14, it follows that ξ¯i = 0, and hence, from condition 2.13, that the support vectors lie at a distance 1/w¯ from the OSH. These support vectors are termed margin vectors. The support vectors for which α¯ i = C, instead, are misclassified points (if ξi > 1), points correctly classified but closer than 1/w¯ from the OSH (if 0 < ξ ≤ 1), or, in some degenerate cases, even points lying on the margin (if ξi = 0). In any event, we refer to all the support vectors for which αi = C as errors. An example of generalized OSH with the relative margin vectors and errors is shown in Figure 2. All the points that are not support vectors are correctly classified and lie outside the margin strip. We conclude this section by discussing the extension of the theory to the nonlinear case. 2.4 Nonlinear Kernels. In most cases, linear separation in input space is too restrictive a hypothesis to be of practical use. Fortunately, the theory can be extended to nonlinear separating surfaces by mapping the input points into feature points and looking for the OSH in the corresponding feature space (Cortes & Vapnik, 1995).
962
Massimiliano Pontil and Alessandro Verri
Figure 2: Generalized optimal separating hyperplane. The two sets of circles and triangles are not linearly separable. The solid line is the optimal separating hyperplane; the filled circles and triangles are the support vectors (the margin vectors are shown in black, the errors in gray).
If x ∈ Rn is an input point, we let ϕ(x) be the corresponding feature point with ϕ a mapping from Rn to a certain space Z (typically a Hilbert space of finite or infinite dimension). In both cases, we denote with ϕi the components of ϕ. Clearly, to an OSH in Z corresponds a nonlinear separating surface in input space. At first sight, it might seem that this nonlinear surface cannot be determined unless the mapping ϕ is completely known. However, from the formulation of problem P4 and the classification stage of equation 2.11, it follows that ϕ enters only in the dot product between feature points, since Dij = yi yj ϕ(xi ) · ϕ(xj ), and ¯ · ϕ(x) + b¯ = w
X
¯ α¯ i yi ϕ(xi ) · ϕ(x) + b.
Consequently, if we find an expression for the dot product in feature space that uses the points in input space only, that is
ϕ(xi ) · ϕ(xj ) = K(xi , xj ),
(2.15)
full knowledge of ϕ is not necessary. The symmetric function K in equation 2.15 is called kernel. The nonlinear separating surface can be found as the solution of problem P4 with Dij = yi yj K(xi , xj ), while the classification stage reduces to computing ´ ³X sign α¯ i yi K(xi , x) + b¯ .
Properties of Support Vector Machines
963
Therefore, the extension of the theory to the nonlinear case is reduced to finding kernels that identify certain families of decision surfaces and can be written as in equation 2.15. A useful criterion for deciding whether a kernel can be written as in equation 2.15 is given by Mercer’s theorem (Courant & Hilbert, 1959; Cortes & Vapnik, 1995): a kernel K(x, y), with x, y ∈ Rn , is a dot product in some feature space, or K(x, y) = ϕ(x) · ϕ(y), if and only if ZZ K(x, y) = K(y, x) and K(x, y) f (x) f (y) dx dy ≥ 0, ∀ f ∈ L2 . Given such a kernel K, a possible set of functions ϕ = (ϕ1 , ϕ2 , . . .) satisfying equation 2.15 can be determined from the eigenfunctions ϕˆi solution of the eigenvalue problem, Z (2.16) K(x, y)ϕˆ i (x) dx = λi ϕˆi (y), √ ˆ is finite, the kernel K is said with ϕi = λi ϕˆi . If the set of eigenfunctions ϕ to be finite and can be rewritten as X (2.17) K(x, y) = λi ϕˆi (x)ϕˆi (y), where the sum ranges over the set of eigenfunctions. In the general case, the set ϕ is infinite, the kernel is said to be infinite, and the sum in equation 2.17 becomes a series or an integral. We now give two simple examples of kernels. The first is the polynomial kernel, K(x, y) = (1 + x · y)d ,
x, y ∈ [−a, a]d .
It can easily be verified that the polynomial kernel satisfies Mercer’s theorem and is finite. The separating surface in input space is a polynomial surface of degree d. In this case, a mapping ϕ can be determined directly from the definition of K. In the particular case, n = 2 and d = 2; for example, if x = (x1 , x2 ), we can write √ √ √ ϕ(x) = (1, 2x1 , 2x2 , x21 , x22 , 2x1 x2 ). The second example is the gaussian kernel, ¶ µ −kx − yk2 , K(x, y) = exp 2σ 2 for some σ ∈ R. The gaussian kernel clearly satisfies Mercer’s theorem but is infinite because equation 2.16 has a continuum of eigenvalues. It is easy to verify that in this case, the eigenvalues are given by the normalized √ Fourier transform of the gaussian, 2π σ exp(−ksk2 σ 2 /2), with exp(ix · s) as corresponding eigenfunctions. The separating surface in input space is a weighted sum of gaussians centered on the support vectors. We are now fully equipped to discuss some mathematical properties of the solution of problem P4.
964
Massimiliano Pontil and Alessandro Verri
3 Mathematical Properties The goal is to study the dependence of the OSH on the parameter C. We first deal with the linear case and then extend the analysis to nonlinear kernels. 3.1 Lagrange Multiplier of a Margin Vector. We start by establishing a simple but important result on the Lagrange multipliers of the margin vectors. We want to show that the Lagrange multiplier associated with a margin vector is a step-wise linear function of the regularization parameter C. To prove it, we need a few preliminary definitions. Since there is no risk ¯ and w. ¯ ¯ , b, of confusion, we now write α, b, and w instead of α We introduce the sets of support vector indexes, I = {i: 0 < αi < C}
and
J = {i: αi = C},
and let M + 1 and E be the number of indexes in I and J, respectively. The set I identifies the M + 1 margin vectors, and J the E errors. Although E can also be equal to 0, we suppose that there are at least two margin vectors (that is, M > 0). This last hypothesis may not be satisfied for highly degenerate configurations of points and small values of C, but does not appear to be restrictive in cases of interest. Finally, and with no further loss of generality, we assume that all the points are support vectors3 and, hence, that M + 1 + E = N. We start by sorting the support vectors so that [ {N} and J = {M + 1, M + 2, . . . , N − 1}, I = I∗ with I∗ = {1, 2, . . . , M}, and labeling the points so that yN = −1. The KuhnTucker conditions (see equation 2.13) for i ∈ I tell us that yi (w · xi + b) = 1.
(3.1)
Equation 3.1, by means of equations 2.8 and 2.9, can be rewritten as N X
αj Dji + yi b = 1.
(3.2)
j=1
From the equality constraint have αN =
N−1 X
αi yi .
P
yi αi = 0, instead, and since yN = −1, we
(3.3)
i=1
3 This follows from the fact that if the points with α = 0 are discarded, problem P4 i has still the same solution.
Properties of Support Vector Machines
965
At the same time, from equation 3.2 with i = N, we get b=
N X
αj DjN − 1.
(3.4)
j=1
Plugging equations 3.3 and 3.4 into 3.2, we obtain N−1 X
αj Hji = 1 + yi ,
i ∈ I∗ ,
(3.5)
j=1
where H is the (N − 1) × (N − 1) matrix Hij = yi yj (xi − xN ) · (xj − xN ).
(3.6)
Notice that H can be written as ¶ µ HM HME , H= > HME HE HM being the M×M submatrix between margin vectors, HE the E×E submatrix between errors, and HME the M × E submatrix between margin vectors and errors. Separating the sum on margin vectors and errors in equation 3.5, we find: X X αj Hji + C Hji = 1 + yi , i ∈ I∗ . (3.7) j∈I
j∈J
In vector notation, equation 3.7 rewrites as HM αM + CHME 1E = 1M + yM , with αM = (α1 , α2 , . . . , αM ), yM = (y1 , y2 , . . . , yM ), and 1M and 1E the M- and E-vectors with all the components equal to unit. Assuming that the matrix HM is invertible (see the appendix for a proof of this fact), we have ¡ ¢ αM = HM−1 1M + yM − CHM−1 HME 1E . (3.8) From equation 3.8, we infer that the Lagrange multiplier associated with a margin vector can always be written as the sum of two terms. As made clear by the subscript M, the first term depends on only the margin vectors, while the second is proportional to C and depends on both the margin vectors and errors. An important consequence of the existence of HM−1 is that the vectors xi − xN , = 1, 2, . . . , M are linearly independent. As a corollary, the number of margin vectors cannot exceed n + 1, that is, M ≤ n. Notice that this does not mean that the number of points lying on the margin cannot exceed n+1. In degenerate cases, there may be points lying on the margin with α = 0, or even support vectors lying on the margin with α = C.
966
Massimiliano Pontil and Alessandro Verri
3.2 Dependence on the Regularization Parameter. We are now in a position to study the dependence of the OSH on the parameter C. We first show that the normal to the OSH can be written as the sum of two orthogonal vectors. 3.2.1 Orthogonal Decomposition. In components equation 3.8 can be rewritten as αi = ri + gi C
i ∈ I∗ ,
(3.9)
with rM = HM−1 (1M + yM )
(3.10)
gM = −HM−1 HME 1E .
(3.11)
and
Notice that the ri and gi are not necessarily positive (although they cannot be both negative). If we define rN =
X
ri yi
i∈I∗
gN =
X
gi yi +
(3.12) X
i∈I∗
yi ,
(3.13)
i∈J
then equation 3.9 is also true for the margin vector of index N as X X X X X ri yi + gi yi C + yi C = yi αi + C yi = αN , rN + gN C = i∈I∗
i∈I∗
i∈J
i∈I∗
i∈J
where the last equality is due to the constraint in equation 2.6 and the fact that αi = C for all i ∈ J. Plugging equation 3.9 into 2.9 and separating the constant and linear term, we obtain w = w1 + Cw2 ,
(3.14)
with w1 =
X
ri yi xi ,
i∈I
w2 =
X i∈J
yi xi +
(3.15) X
gi yi xi .
(3.16)
i∈I
It can easily be seen that w1 and w2 are orthogonal. Substituting equa-
Properties of Support Vector Machines
967
tions 3.12 and 3.13 into 3.15 and 3.16, respectively, we obtain w1 =
X
ri yi (xi − xN ),
i∈I∗
w2 =
X
yi (xi − xN ) +
X
gi yi (xi − xN ).
i∈I∗
i∈J
Then, through the definition of HM and HME , we have w1 · w2 = rM HME 1E + rM HM gM .
(3.17)
Plugging equation 3.11 in 3.17, it follows immediately that w1 · w2 = 0. 3.2.2 Changing the Regularization Parameter. We now study the effect of small changes of the regularization parameter C on the OSH. Since C is the only free parameter of SVMs, this study is relevant from both the theoretical and practical viewpoints. In what follows, we let C take on values over the positive real axis R+ . First, we notice that the possible choices of support vectors for all possible values of C (distinguishing between margin vectors and errors) are finite. If we neglect degenerate configurations of support vectors, this implies that R+ can be partitioned in a finite number of disjoint interval, each characterized by a fixed set of support vectors. Notice that the right-most interval is necessarily unbounded. After this preliminary observation, we can already conclude that with the exception of the C values corresponding to the interval ends, the set of support vectors does not vary for small changes of C. But through the previous analysis, we can also study the dependence of the normal vector w on the parameter C. From equation 3.14, it follows that if C changes by δC and the margin vectors and errors remain the same, the normal vector w changes by δCw2 along the direction of w2 . We can make this statement more precise by distinguishing between two cases. In the first case we let M reach the maximum value n. Since HM always has maximum rank, we have n + 1 independent Kuhn-Tucker conditions like equation 3.1, and the OSH is completely determined by the n + 1 margin vectors. Consequently, since for almost all C the set of support vectors remains the same for small changes of C, w2 must vanish, and we have w=
X
ri yi xi .
(3.18)
i∈I
Equation 3.18 tells us that if M = n, the OSH is fixed and unambiguously identified by the n + 1 margin vectors. The fact that the OSH is fixed makes it possible to determine the maximum interval around C, say, (C1 , C2 ], in which the OSH is given by equation 3.18. To this purpose, it is sufficient to compute the ri and gi from equations 3.10 and 3.11 and find C1 and C2 as
968
Massimiliano Pontil and Alessandro Verri
the minimum and maximum C for which the αi associated with the margin vector xi satisfy the constraint 0 < αi ≤ C. In the second case, we have M < n. The OSH is now given by equation 3.14 with w2 6= 0. Thus, for a small change δC, the new OSH w0 can be written as w0 = w + δCw2 .
(3.19)
Equation 3.19 tells us that if M < n, the OSH changes of an amount δCw2 . Here again there exists a maximum interval (C1 , C2 ] around C in which the OSH is given by equation 3.19. Similar to the previous case, one could determine the minimum and maximum C for which the αi associated with the margin vectors satisfy the constraint 0 < αi ≤ C. However, since to a changing OSH might correspond a new set of support vectors, these minimum and maximum values are only a lower and upper bound for C1 and C2 , respectively. Finally, we observe that even if M < n, the OSH can always be written as a linear combination of n + 1 support vectors, for example, by adding n + 1 − M errors. 3.2.3 A Numerical Example. We now illustrate both cases by means of the numerical example with n = 2 shown in Figure 3. Figure 3a shows the OSH found for the displayed training set with C = 4.0. The support vectors are denoted by the filled circles and triangles (the margin vectors in black, the errors in gray). In accordance with equation 3.18, since there are three margin vectors, the OSH is fixed. Straightforward computations predict that the OSH must remain the same for 2.7 < C ≤ 4.5. This prediction has been verified numerically. Figure 3b shows the new OSH obtained for C just outside the interval (2.7, 4.5] (C = 4.8). Notice that the errors are the same as in Figure 3a, while there are only two margin vectors. The OSH should now change for small variations of C as predicted by equation 3.19. This has been verified numerically, and Figure 3c displays the OSHs obtained from equation 3.19 and from direct solution of the problem P4 for C = 6.7. The two OSH coincide within numerical precision. For a larger variation of C (C > 7.0; see Figure 3d), the number of margin vectors goes back to three, and the solution is again fixed. Notice that in this last transition, one of the errors became a margin vector (the error in the upper part of the margin strip of Figure 3c is a margin vector in Figure 3d). It is worthwhile noticing that the solutions with smaller C (see Figures 3a and 3b) have a larger margin, while the solutions with larger C (see Figures 3c and 3d) have a smaller number of errors. 3.3 Extension to Nonlinear Kernels. We now extend the presented analysis to the case of nonlinear kernels.
Properties of Support Vector Machines
969
(a)
(b)
(c)
(d)
Figure 3: Optimal separating hyperplane. (a) C = 4.0, (b) C = 4.8, (c) C = 6.7, (d) C = 7.5, respectively. Legend as in Figure 2.
3.3.1 Lagrange Multiplier of a Margin Vector. We start by observing that the same decomposition of the Lagrange multiplier of a margin vector derived in the linear case holds true for nonlinear kernels. Note that the matrix H of equation 3.6 rewrites as, ¢ ¡ Hij = yi yj K(xi , xj ) − K(xj , xN ) − K(xi , xN ) + K(xN , xN ) ,
(3.20)
while equations 3.8 to 3.13 remain unchanged. 3.3.2 Orthogonal Decomposition. More care is needed for the extension of the orthogonal decomposition of w and the study of the behavior of the separating surface as a function of C. This is because, in the nonlinear case, it may not be possible to recover an explicit expression for w. However, this
970
Massimiliano Pontil and Alessandro Verri
does not pose major problems because all the expressions involving w are effectively dot products between feature points and can be computed by means of the kernel K. Indeed, if we take the dot product between w and ϕ(x), we obtain w · ϕ(x) =
N X
αi yi K(xi , x),
i=1
which can be written as N X
αi yi K(xi , x) =
i=1
X
ri yi K(xi , x) ! Ã X X yj K(xj , x) + gi yi K(xi , x) . +C i∈I
j∈J
(3.21)
i∈I
The two terms in the right-hand-side (r.h.s.) of equation 3.21 are the counterparts of equations 3.15 and 3.16 defining w1 and w2 , respectively. Note that even if the explicit expression for w1 and w2 cannot be given, the orthogonality relation (see equation 3.17) remains true. This can be seen from the fact that the r.h.s. of equation 3.17 depends on the matrix H, which, in the nonlinear case, is rewritten as in equation 3.20. In this respect, the two terms in the r.h.s. of equation 3.21 can be regarded as orthogonal. 3.3.3 Changing the Regularization Parameter. So far, all the results derived in the linear case carried through the case of nonlinear kernels. For the dependence of the separating surface on the parameter C, instead, it is convenient to distinguish between finite and infinite kernels. For finite kernels, all the results obtained in the linear case are still valid and can be rederived simply by replacing n, dimension of input space, with m, dimension of feature space. For example, if M = m, the OSH in feature space does not change for small changes of C and the second term in the r.h.s of equation 3.21 vanishes for all x. Furthermore, the interval (C1 , C2 ], within which the OSH is fixed, can be determined exactly as in the linear case. For kernels of infinite dimension, instead, a finite number of margin vectors is not sufficient to determine the OSH fully. Consequently and differently from the finite case, the OSH is never fixed, and the second term of equation 3.21 does not vanish. For a small change δC, the dot product w · ϕ(x) changes of the amount ! Ã X X yj K(xj , x) + gi yi K(xi , x) . δC j∈J
i∈I
In summary, all the results derived in the linear case can be extended without major changes to the nonlinear case, with the exception of the prop-
Properties of Support Vector Machines
971
erties depending on the finiteness of the dimension of the linear case, like the upper bound on the number of margin vectors, properties that are still true for finite kernels only.
4 Conclusion In the case of pattern recognition, SVMs depend on only free parameter, the regularization parameter C. In this article, we have discussed some mathematical properties of support vectors useful for characterizing the behavior of the decision surface with respect to C. We have identified a special subset of support vectors, the margin vectors, whose Lagrange multipliers are strictly smaller than the regularization parameter C. We have shown that the margin vectors are always linearly independent and that the decision surface can be written as the sum of two orthogonal terms, the first depending on only the margin vectors and the second proportional to the regularization parameter. For almost all values of the parameter, this enabled us to predict how the decision surface varies for small parameter changes. In general, we found that the solution is usually stable with respect to small changes of C. The obtained results can be more conveniently summarized distinguishing between finite and infinite kernels. For kernels of finite dimension m, it turned out that m + 1 is the least upper bound for the number of margin vectors (M + 1), and the behavior of the OSH as a function of C depends on whether M = m or M < m. If M = m, the M + 1 margin vectors are sufficient to determine fully the equation of the OSH in feature space, and for almost all values of C the OSH does not vary for small changes of C. If M < m, instead, the OSH varies of an amount proportional to the change δC in a direction identified by both the margin vectors and errors. In both cases, it is worthwhile observing that the number of support vectors effectively needed to identify the decision surface is never greater than m + 1. This latter result may be useful to reduce the number of support vectors effectively needed to perform recognition. For infinite kernels, the margin vectors are still linearly independent, but there is no upper bound on their number. For small changes of C, the OSH is not fixed and varies as in the case M < m of finite kernels.
Appendix In this appendix we sketch the proof of the existence of HM−1 . First, we need to transform the original dual problem P4 into a linear complementary problem (LCP), and derive the explicit expression for the matrix G, which defines the polyhedral set on which the solution of the LCP lies. P yi αi Let us define α = (α1 , α2 , . . . , αN−1 ) and be reminded that αN = where the sum ranges over i = 1, 2, . . . , N − 1. We let N1 and N2 be the
972
Massimiliano Pontil and Alessandro Verri
number of points with positive and negative labels, respectively. We start by rewriting problem P4 without the equality constraint: Problem P5 Minimize
X 1 α · Hα − 2 αi 2 i∈I+
subject to
−
N−1 X
yi αi ≤ 0,
i=1
αi ≤ C, αi ≥ 0,
N−1 X i=1
yi αi ≤ C i = 1, 2, . . . , N − 1 i = 1, 2, . . . , N − 1
with I+ the set of indexes corresponding to the αi for which yi = 1. Then we let u+ , u− , u = (u1 , u2 , . . . , uN−1 ), and v = (v1 , v2 , . . . , vN−1 ) be the 2N Lagrange multipliers associated with the constraints of problem P5, respectively. The LCP associated with problem P5 is obtained as follows: 1. Set equal to 0 the gradient of the Lagrangian associated with problem P5, or N−1 X
αj Hji − 1 + yi (u+ − u− ) − yi + ui − vi = 0.
j=1
2. Introduce the N + 1 slack variables4 s+ , s− , and s = (s1 , s2 , . . . , sN−1 ), satisfying s+ +
N−1 X
αi yi = 0,
i=1
s− −
N−1 X
αi yi = C,
i=1
and si + αi = C, along with the associated complementary conditions, s− u− = s+ u+ = 0, si ui = 0, 4 In the constrained optimization jargon, a slack variable is a nonnegative variable that turns an inequality into an equality constraint.
Properties of Support Vector Machines
973
and αi vi = 0, for each i = 1, 2, . . . , N − 1. The solution of problem P5 can be obtained as the solution of the LCP, Problem P6 Solve subject to
t − Mz = q t, z ≥ 0 ti zi = 0,
i = 1, 2, . . . , 2N,
with t = (s− , s+ , s, v), z = (u− , u+ , u, α), ¶ µ 0 −A , M= H A>
−y1 y1 A=
··· ···
IN−1
−yN−1 yN−1 ,
q = (b, k), N+1
N1
N2 −1
z }| { z }| { z }| { b = (0, C, . . . , C), and k = (−2, . . . , −2, 0, . . . , 0). Similar to the case of linear programming, a solution to problem P6 is a vertex of a polyhedral set. In addition, the solution must also satisfy the complementarity conditions. In the case of problem © P6, a solutionªvector p = (t, z) is a vertex of the polyhedral set S = p: Gp = q, p ≥ 0 , with G = [I2N , −M], p = (pB , pN ), pB = B−1 q, pN = 0, and B is the 2N × 2N matrix defined by the columns of G corresponding to the 2N active variables. Through simple but lengthy calculations, it can be seen that the matrix HM is a submatrix of B and HM−1 a submatrix of B−1 . The existence of HM−1 is thus ensured by the existence of B−1 . Acknowledgments Edgar Osuna read the manuscript and made useful remarks. This work has been partially supported by a grant from the Agenzia Spaziale Italiana.
974
Massimiliano Pontil and Alessandro Verri
References Bazaraa, M., & Shetty, C. M. (1979). Nonlinear programming New York: Wiley. Blanz, V., Scholkopf, ¨ B., Bulthoff, H., Burges, C., Vapnik, V. N., & Vetter, T. (1996). Comparison of view-based object recognition algorithms using realistic 3D models. In Proc of ICANN’96, LNCS (Vol. 1112, pp. 251–256). Cortes, C., & Vapnik, V. N. (1995). Support vector network. Machine Learning, 20, 1–25. Courant, R., & Hilbert, D. (1959). Methods of mathematical physics. New York: Wiley. Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. In Proceedings of the International Conference on Computer Vision and Pattern Recognition. Scholkopf, ¨ B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., & Vapnik, V. N. (1996). Comparing support vector machines with gaussian kernels to radial basis function classifiers (AI Memo No. 1599). Cambridge, MA: Massachusetts Institute of Technology. Vapnik, V. N. (1982). Estimation of dependencies based on empirical data. New York: Springer-Verlag. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V. N., & Chervonenkis, A. Ja. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl., 16, 264–280. Received June 11, 1997; accepted August 19, 1997.
LETTER
Communicated by Michael Perrone
Validation of Voting Committees Eric Bax Computer Science Department, California Institute of Technology 256-80, Pasadena, California, 91125, U.S.A.
This article contains a method to bound the test errors of voting committees with members chosen from a pool of trained classifiers. There are so many prospective committees that validating them directly does not achieve useful error bounds. Because there are fewer classifiers than prospective committees, it is better to validate the classifiers individually than use linear programming to infer committee error bounds. We test the method using credit card data. Also, we extend the method to infer bounds for classifiers in general. 1 Introduction Consider the following machine learning problem. There is an unknown boolean-valued target function and a distribution over the input space of the function. For example, the input distribution could consist of images encountered by commuters in a city, and the target function could be 1 if the input image contains a yellow bus and 0 otherwise. We have a set of in-sample data examples with inputs drawn according to the input distribution and outputs determined by the target function. We also have a set of test example inputs drawn according to the input distribution. Our goal is to find a classifier function with a low error rate on the test inputs. (The error rate is the fraction of examples for which the classifier and the target function disagree.) We use a portion of the in-sample data to train a pool of candidate classifiers. We use the remaining in-sample data, called the validation data, to select a voting committee of candidate classifiers. For each input, the committee returns the result shared by the majority of its members. (To avoid confusion, we restrict our attention to committees with odd numbers of members.) In this article, we develop a new method to bound the test error of the selected committee. In the next section, which focuses on validation, we derive VC-type (Vapnik & Chervonenkis, 1971) uniform bounds on test error. To achieve useful error bounds through validation, the number of classifiers must be small. Since the classifier pool is much smaller than the number of prospective commitees, the classifier errors can be bounded more precisely than the committee errors. Neural Computation 10, 975–986 (1998)
c 1998 Massachusetts Institute of Technology °
976
Eric Bax
Next, we show how to infer committee error bounds from member error bounds and test inputs. We use linear programming to find the maximum possible committee error given constraints imposed by member error bounds and the distribution of agreements among members. In the following section, we present some small numerical examples to build intuition regarding the new method. Then we analyze the committee error bound. We compare the linear programming method to direct validation, exploring differences in asymptotic behavior and discussing trade-offs between the methods. In the next section, we extend the linear program to infer error bounds for classifiers in general—not just voting committees. This makes the new method a valuable tool to derive error bounds for stacking classifiers (Wolpert, 1992; Sridhar, Seagrave, & Bartlett, 1996; Kim & Bartlett, 1995; Breiman, 1992; LeBlanc & Tibshirani, 1996). In stacking, the outputs from trained classifiers are fed into higher levels of classifiers. Generalization is improved through fusion and by estimating bias through resampling. Our method allows developers of stacked classifiers to derive test error bounds by the following process: Train the initial layer of classifiers, withholding some validation data. Use these data to compute uniform test error bounds for the classifiers. Use all in-sample data to train subsequent layers. Finally, infer a test error bound for the stacked classifier through linear programming. In the following section, we present tests to compare linear programming and direct validation as methods to compute committee test error bounds. Each example in the data set corresponds to a credit card applicant, and the task is to predict whether the applicant defaults. We find that linear programming produces superior bounds to direct validation. 2 Uniform Test Error Bounds and Validation We develop two uniform upper bounds on the test errors of several classifiers. The first bound is weaker, but it is smooth. We will use it for analysis. The second bound is stronger. We will use it for tests on real data. For the first bound, we modify a simplified treatment of VC error bounds (Vapnik & Chervonenkis, 1971; Abu-Mostafa, 1996). Suppose we have M classifiers. Let d be the number of validation examples and d0 the number 0 the of test examples. Let νm be the validation error of classifier m and νm test error. Let πm be the error rate over the entire input distribution, that is, the expected error rate for a random data set. For a single classifier selected without reference to the validation or test data, Hoeffding’s inequality (Hoeffding, 1963; Vapnik, 1982) implies Pr{πm ≥ νm + ²} ≤ e−2²
2
0 ≥ πm + ²} ≤ e−2² Pr{νm
2 0
d
for ² > 0
(2.1)
for ² > 0.
(2.2)
and d
Validation of Voting Committees
977
Using both bounds,
n n ²o ²o 0 0 + Pr πm ≥ νm + ≥ νm + ²} ≤ Pr νm ≥ πm + Pr{νm 2 2 1 2 ≤ 2e− 2 ² D ,
(2.3)
where D = min(d, d0 ). For a single classifier, the test error rate is less than ² 1 2 greater than the validation error rate with probability at least 1 − 2e− 2 ² D . For uniform bounds over M classifiers selected without reference to the validation or test data, bound the probability union by the sum of probabilities, 0 ≥ νM + ²} ≤ 2Me− 2 ² D . Pr{ν10 ≥ ν1 + ² or . . . or νM 1 2
(2.4)
If validation data are used to choose a classifier, then the single classifier bound (see equation 2.3) does not apply. However, if the set from which the classifier is chosen is developed without reference to the validation data, then the uniform bound, in equation 2.4, applies to the chosen classifier, since uniform bounding implies bounding of the chosen classifier. This is exactly the case when we train a pool of n classifiers, withholding validation ¡ ¢data, then use the data to select a committee of k classifiers. There are M = nk prospective committees. The probability that no committee test ¡ ¢− 1 ² 2 D error is more than ² greater than its validation error is at least 1 − 2 nk 2 . ¡ ¢− 1 ² 2 D in the error bound for the chosen Hence, we have confidence 1 − 2 nk 2 committee. Using uniform bounds for a single classifier and bounding the probability union by the sum of probabilities in equation 2.4 seems quite wasteful. Our bound will reduce the waste by using uniform bound (see equation 2.4) only over the pool of classifiers, then using linear programming to infer uniform bounds over the prospective committees. Now we develop the second error bound, which is a generalization of the bound at the heart of the original VC bound proof (Vapnik & Chervonenkis, 1971). For a single classifier, condition the bound on a given multiset of b = d + d0 inputs composing the validation and test inputs. Since the inputs are drawn independently and identitically distributed (i.i.d.), each partition of the inputs into validation and test sets is equally likely. Let w be the number of inputs for which the classifier produces the incorrect output. The probability that the validation error is r/d is ¶ µ ¶−1 µ ¶µ w b−w b . (2.5) r d−r d If the validation error is r/d, then the test error is (w − r)/d0 . So ¶ µ ¶−1 µ ¶µ X b w b−w 0 . ≥ νm + ² | w} = Pr{νm d−r d r w−r r {r|
d0
≥ d +²}
(2.6)
978
Eric Bax
Bound by maximizing over w, 0 ≥ νm + ²} ≤ Pr{νm
0 max Pr{νm ≥ νm + ² | w}.
w∈{0,...,b}
(2.7)
Refer to the bound as B(²). Note that the bound is constant over all multisets of b = d + d0 inputs. Hence, integration over all possible input multisets removes the conditional nature of the bound. For a single classifier chosen without reference to validation data, 0 ≥ νm + ²} ≤ B(²). Pr{νm
(2.8)
The uniform bound for M classifiers is Pr{ν10 ≥ ν1 + ²
or
. . . or
0 νM ≥ νM + ²} ≤ MB(²).
(2.9)
3 Committee Error Bound We will use the statistics of the voting patterns among committee members to bound committee error. For each subset S of {1, . . . , k}, define aS to be the fraction of test examples for which the classifiers indexed by S return 1 and the other classifiers return 0. For example, a∅ is the fraction of examples for which every committee member returns 0. Likewise, if k = 5, then a{1,3,4} is the fraction of examples for which members 1, 3, and 4 return 1 and members 2 and 5 return 0. Given error bounds on the members and the distribution of agreements, we will use linear programming to find an upper bound for committee error. Let e1 , . . . , ek be upper bounds for the error rates of members. For each subset S of {1, . . . , k}, define xS to be the probability that the committee vote returns the incorrect value, given that the committee members indexed by S return P 1 and the other committee members return 0. The committee error rate is S aS xS , the weighted average of errors over voting patterns. For each member i, define Wi to be the set of sets S for which member i votes with the majority on the examples counted by aS , ¾ ½ ¾ ½ k k ∪ S | i 6∈ S and |S| < . (3.1) Wi = S | i ∈ S and |S| > 2 2 The error rate of member i is X X aS xS + aS (1 − xS ). S∈Wi
(3.2)
S6∈Wi
Hence, an upper bound for the committee error rate can be found by solving the following linear program: X aS xS (3.3) maximize S
Validation of Voting Committees
979
over x∅ , . . . , x{1,...,k} such that X S∈Wi
aS xS +
X
aS (1 − xS ) ≤ ei ∀ i ∈ {1, . . . , k}
(3.4)
S6∈Wi
0 ≤ xS ≤ 1 ∀ S ⊆ {1, . . . , k}.
(3.5)
To use the member bounds derived from validation, set ei = νi + ² for each member, and solve the linear program to error bound the committee. With 1 2 confidence 1 − 2ne− 2 ² D , the linear program solution is an upper bound for the committee error. The linear program, as written, has many variables. In many cases, most of them are unnecessary. For each S, if there is no test example for which each classifier indexed by S returns 1, and each other classifier returns 0, then aS = 0. For each such S, xS plays no role in the linear program. Rewrite the linear program to use only the variables xS for which aS 6= 0. Change all indices of summation to the intersection of the set {S | aS 6= 0} and the present indices. Now the linear program has no more variables than the number of test data examples. To reduce the number of variables further, note that aS and aS count votes with the same members in the majority. Hence, coefficients and variables with complementary subscripts play the same role in the linear program. Rewrite to keep one of each pair: for each S with 1 ∈ S, let aS = aS + aS and xS = xS + xS . We have derived the following procedure to train classifiers, choose a committee using validation data, and compute a test error bound for the committee: 1. Partition the in-sample data into training and validation sets. 2. Train a pool of n classifiers using the training data. 3. Compute test error bounds for the classifiers using the validation data. 1 2 Choose ² to give the desired confidence (1 − 2ne 2 ² D for Hoeffdingstyle bounds or 1 − nB(²) for partition-based bounds). Then compute validation errors. For each classifier, the test error bound is the sum of its validation error and ². 4. Use the validation data to select a committee of k classifiers from the pool. (The training data may also be used.) If feasible, simply evaluate the validation errors of all prospective committees and choose the committee with minimum validation error. If there are too many prospective committees, select one by some other search method. 5. For the committee, let e1 , . . . , ek be the member test error bounds ν1 + ², . . . , νk + ² from step 3. Compute the distribution of agreements a∅ , . . . , a{1,...,k} over the test set inputs. Solve linear program 3.3, 3.4,
980
Eric Bax
and 3.5 for x∅ , . . . , x{1,...,k} . With confidence 1−2ne 2 ² D (or 1−nB(²) for partition-based bounds,) the committee test error is no greater than P S aS xS . 1 2
4 Intuition and Analysis The error bound for a prospective committee is a function of member error bounds and the distribution of agreements. Lower member error bounds produce a lower committee error bound. Also, as the frequency of unanimous and near-unanimous votes decreases, the committee error bound decreases, so linear programming gives low bounds for committees of intelligent and contentious classifiers. To build intuition, examine the linear program error bound in a few cases. First, suppose all members agree on every test input (a∅ + a{1,...,k} = 1). Then the committee error bound is the minimum member error bound. This makes sense; we assume every member bound is correct, so when the committee agrees with all members, we use the tightest member bound for the committee. Now focus on the role of disagreement among classifiers. Suppose there are three committee members with equal error bounds. Suppose the votes are evenly distributed over voting patterns with a two-to-one split (a1 = a2 = a3 = a1,2 = a1,3 = a2,3 = 16 and a∅ = a1,2,3 = 0). When the member error bounds are one-third or less, the committee error bound is zero. Each vote has one-third of the members in dissent, so the errors of each member are “consumed” by votes in which the member is overruled. For member error bounds b ∈ ( 13 , 12 ), the committee error bound is 3b − 1. Over this domain, the committee error bound is lower than the member error bounds. When the member error bounds are one-half, the committee error bound is one-half; random members form a random committee. As problem size grows, there is increasing advantage in bounding committee error by linear programming. Recall Pr{ν10 ≥ ν1 + ²
or . . .
0 or νM ≥ νM + ²} ≤ 2Me− 2 ² D . 1 2
(4.1)
For fixed failure probability c = 2Me− 2 ² D , 1 2
r ²=
2 (ln 2M − ln c) . D
(4.2)
√ So the tightness (²) of the error bound is O( √ln M). Validating individual classifiers to use q the linear program gives O( ln n). Validating committees ¡ ¢ ¡ ¢ q directly gives O( ln nk ). If k = n2 + 1, for example, then nk ∼ π2n 2n (see √ Feller, 1968, p. 180), and ² grows as O( n).
Validation of Voting Committees
981
¡ ¢ If the committee size is fixed, then nk ∼ nk k!1 . In this case, the ratio of ² √ for direct committee evaluation to ² for linear programming goes to k as n → ∞. As a concrete example, suppose we wish to select a committee of k = 21 classifiers from a pool of n = 100. We have 1000 validation examples and want to bound committee errors with no more than c = 5% chance of failure. Then the linear programming error bounds have tightness ² = .13, and the direct committee error bounds have tightness ² = .32. For small problems, the linear programming method will not always give lower or more accurate error bounds than the direct method. The linear program uses relatively tight member bounds and produces the worst-case error given the distribution of agreements over the test data. If the worstcase error is very different from the actual test error, then the linear program bound gives poor information about the test error. The direct method uses the error over the validation data to estimate the error over the test data, and the estimate comes with a relatively loose bound. If the bound is too loose, then the direct method gives poor information about the actual test error. If the test set inputs are not known, the distribution of agreements must be estimated. The in-sample data can be used for this purpose. Also, if the underlying input distribution is known, then the distribution of agreements can be estimated by random sampling of inputs. (Note that example outputs are not needed to compute the distribution of agreements.) Estimation errors will affect the solution of the linear program; the relationship between the errors and the solution can be calculated by perturbation methods (Franklin, 1980). 5 General Classifier Bound The linear programming method can be extended to compute a test error bound for any classifier, not just a voting committee. Let g1 , . . . , gn be classifiers trained without reference to some validation data. Select a confidence level, and hence ². Then use the validation data to compute uniform test error bounds e1 , . . . , en for g1 , . . . , gn using equation 2.9. Let g0 be the classifier for which we will compute a test error bound. Classifier g0 may be selected with reference to the validation data. Let variables x1 , . . . , xd0 represent error “rates” of g0 on test examples:
∀ j ∈ {1, . . . , d0 } xj =
The test error of g0 is
Pd0
0
if g0 produces the correct output for example j.
1
if g0 produces the incorrect output for example j.
1 j=1 d0 xj .
(5.1)
982
Eric Bax
Let Ai ⊆ {1, . . . , d0 } indicate the set of test inputs for which gi and g0 agree. The test error of gi is X 1 X 1 xj + (1 − xj ). 0 d d0 j∈A j6∈A i
(5.2)
i
With confidence 1 − nB(²), all test error bounds e1 , . . . , en hold. Hence, the following program bounds the test error of g0 with the same confidence: d0 X 1 x, maximize 0 j d j=1
(5.3)
over x1 , . . . , xd0 such that X 1 X 1 xj + (1 − xj ) ≤ ei ∀ i ∈ {1, . . . , n} 0 d d0 j∈A j6∈A
(5.4)
xj ∈ {0, 1} ∀ j ∈ {1, . . . , d0 }.
(5.5)
i
i
and
This is an integer linear program (ILP). It may be difficult to solve for large problems, since ILP is NP-hard (Garey & Johnson, 1979; Karp, 1972). Relaxing the integer constraints on the variables to 0 ≤ xj ≤ 1 ∀ j ∈ {1, . . . , d0 }
(5.6)
produces a linear program. The feasible set of the linear program is a superset of the feasible set for the integer linear program, so the linear program solution is an upper bound. Hence, the linear program returns a valid upper bound for g0 with confidence 1 − nB(²). For voting committees, this linear program is equivalent to the one presented earlier, except that this program uses constraints based on all classifiers in the pool, while the earlier program uses constraints based only on committee members. Hence, this program returns a bound at least as strong as the one returned by the earlier program. 6 Tests This section outlines the results of tests on a set of credit card data. Each example corresponds to a credit card user. There are six inputs that correspond to user traits. The traits are unknown because the data provider has chosen to keep them secret. There is a single output that indicates whether the
Validation of Voting Committees
983
credit card user defaulted. The data were obtained from a machine-learning database site at the University of California at Irvine. The discrete-valued traits were removed, leaving the six continuous-valued traits. Of the 690 examples in the original database, 24 examples had at least one trait missing. These examples were removed, leaving 666 examples. The data were cleaned by Joseph Sill. (For further information, see Sill & Abu-Mostafa, 1997.) In each test, the 666 examples were randomly partitioned into 444 training examples, d = 111 validation examples, and d0 = 111 test examples. In each test, a pool of classifiers was trained by early stopping. For each classifier, the training data were randomly partitioned into 400 examples used for actual training and 44 examples used for early stopping. The classifiers are artificial neural networks with six input units, six hidden units, and one output unit. The hidden and output units have tanh activation functions. The initial weights were selected independently and uniformly at random from [−0.1, 0.1]. The networks were trained by gradient descent on mean squared error over training examples, using sequential mode weight updates with random order of example presentation in each epoch. A snapshot of the weights was recorded after each epoch. The snapshot with minimum error on the 44 early stopping examples was returned as the trained classifier. Partition-based uniform error bounds (see equation 2.9) with 90% confidence were used in all tests. Committees were selected by evaluating validation error over all prospective committees. The linear programs used constraints based on error bounds for all classifiers in the pool, not just committee members, as outlined in the previous section. Table 1 shows the results of 10 tests with pools of 15 classifiers and voting committees of 7 members. Since ² = 0.172 is the minmum value for which 1 − 15B(²) ≥ 0.90, this value was used in the uniform error bounds for pool classifiers in the linear program bound; that is, the linear program constraints used error bounds e1 = ν1 + 0.172, . . . , e15 = ν¡15 ¢+ 0.172. For the direct bound, ² = 0.280 is the minimum value with 1 − 15 7 B(²) ≥ 0.90. Hence, the direct bound is the sum of committee validation error and 0.280. Note that the linear programming bound is superior to the direct bound for every test. Table 2 shows the results of 10 tests with pools of 10 classifiers. In each test, the odd-sized committee with minimum validation error was selected. For the linear programming bound, ² = 0.163 is the minimum value for which 1 − 10B(²) ≥ 0.90. This value is used in the uniform bounds over classifiers in the pool. For the direct bound, ² = 0.244 is the minimum value with ·µ 1−
¶ µ ¶ µ ¶ µ ¶ µ ¶¸ 10 10 10 10 10 + + + + B(²) ≥ 0.90. 1 3 5 7 9
(6.1)
So this value is added to the validation error to produce the direct test error
984
Eric Bax
Table 1: Test Results for n = 15 Classifiers in the Pool and Committees with k = 7 Members. Test
Committee Validation Error
LP Test Error Bound
Direct Test Error Bound
1 2 3 4 5 6 7 8 9 10 Average
0.297 0.135 0.198 0.297 0.234 0.297 0.225 0.234 0.189 0.261 0.245
0.487 0.343 0.388 0.478 0.397 0.478 0.433 0.433 0.388 0.469 0.429
0.577 0.415 0.478 0.577 0.514 0.577 0.514 0.514 0.505 0.514 0.525
Table 2: Test Results for n = 10 Classifiers in the Pool and Committees with Odd Numbers of Members. Test
Committee Validation Error
LP Test Error Bound
Direct Test Error Bound
1 2 3 4 5 6 7 8 9 10 Average
0.225 0.198 0.216 0.234 0.225 0.198 0.180 0.216 0.180 0.207 0.208
0.388 0.379 0.379 0.433 0.388 0.361 0.343 0.406 0.352 0.397 0.383
0.469 0.442 0.460 0.478 0.469 0.442 0.424 0.460 0.424 0.451 0.452
bound. Once again, note that the linear programming bound is superior to the direct bound for every test. 7 Discussion We have developed an algorithm to compute test error bounds for voting committees through linear programming. We extended the method to compute error bounds for any classifier, using constraints based on uniformly bounded classifiers. The extended method applies to classifiers constructed by stacking, or fusion of underlying classifiers. The extended method also applies to the classifier chosen by early stopping, through the following process. Partition the in-sample data into training and validation sets. Choose an initial classifier at random, and use an iterative method to fit the classifier
Validation of Voting Committees
985
to the training data. As the classifier evolves during training, record a sequence of snapshots. Select a subset of the snapshots without reference to the validation data, and uniformly bound their test errors using the validation data and equation 2.9. Now, identify the snapshot with minimum validation error. It will be delivered as the result of training. To bound its test error, use the linear program with contraints provided by the uniformly bounded classifiers. For suggested methods of choosing the constraint classifiers, see Bax, Cataltepe, and Sill (1997). The linear program yields strong bounds when the constraint classifiers have low error rates, so the method works best with trained constraint classifiers. Disagreement among constraint classifiers encourages strong bounds. Disagreement among classifiers is also a condition for improving performance through stacking, or fusion, so the linear programming bound is a natural fit for these classifiers. Also, having one or more constraint classifiers with high rates of agreement with the classifier being error bounded yields strong bounds. This occurs when the constraint classifiers are drawn from the training sequence (Bax et al., 1997) to bound the classifier chosen by early stopping. It also occurs for stacking and fusion, both through data fitting and by design (Breiman, 1992). This article focused on binary classification problems. It would be interesting to extend the error bounding method to other problem types, for example, regression problems. Acknowledgments Thanks to Joel Franklin for advice, teaching, and encouragement. Thanks to Yaser Abu-Mostafa for teaching, and to Amir Atiya, Zehra Cataltepe, Malik Magdon-Ismail, Alexander Nicholson, Sam Roweis, Joseph Sill, and Xubo Song for useful pointers and discussions. References Abu-Mostafa, Y. (1996). What you need to know about the VC inequality. Class notes from CS156, California Institute of Technology. Bax, E., Cataltepe, Z., & Sill, J. (1997). A new error bound for the classifier chosen by early stopping. IEEE Pacific Rim Conference on Communications, Computers and Signal Processing, Proceedings, Victoria, B.C., Canada, 811–814. Breiman, L. (1992). Stacked regressions (Tech. Rep. No. 367). Berkeley: Statistics Department, University of California at Berkeley. Feller, W. (1968). An introduction to probability theory and its applications. New York: Wiley. Franklin, J. (1980). Methods of mathematical economics. New York: Springer-Verlag. Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: A guide to the theory of NP-completeness. New York: W. H. Freeman.
986
Eric Bax
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Am. Stat. Assoc. J., 58, 13–30. Karp, R. M. (1972). Reducibility among combinatorial problems. In R. E. Miller & J. W. Thatcher (Eds.), Complexity of computer computations (pp. 85–103). New York: Plenum Press. Kim, K., & Bartlett, E. B. (1995). Error estimation by series association for neural network systems. Neural Computation, 7, 799–808. LeBlanc, M., & Tibshirani, R. (December, 1996). Combining estimates in regression and classification. Journal of the American Statistical Association, 1641– 1650. Sill, J., & Abu-Mostafa, Y. (1997). Monotonicity hints. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 634–660). Cambridge, MA: MIT Press. Sridhar, D. V., Seagrave, R. C., & Bartlett, E. B. (1996). Process modeling using stacked neural networks. AIChE Journal, 42(9), 2529–2539. Vapnik, V. N. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag. Vapnik, V. N., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl., 16, 264–280. Wolpert, D. H. (1992). Stacked generalization. Neural Networks, 5, 241–259. Received January 27, 1997; accepted August 19, 1997.
LETTER
Communicated by Thorsteinn Rognvaldsson
Toward Optimally Distributed Computation Peter J. Edwards Alan F. Murray Department of Electrical Engineering, Edinburgh University, EH9 3JL, UK
This article introduces the concept of optimally distributed computation in feedforward neural networks via regularization of weight saliency. By constraining the relative importance of the parameters, computation can be distributed thinly and evenly throughout the network. We propose that this will have beneficial effects on fault-tolerance performance and generalization ability in large network architectures. These theoretical predictions are verified by simulation experiments on two problems: one artificial and the other a real-world task. In summary, this article presents regularization terms for distributing neural computation optimally. 1 Introduction Optimally distributed computation leads to fault tolerance and follows the spirit of Occam’s razor even in large networks to yield good generalization ability. In neural network models, computation is more or less distributed in the weighted connections. However, the “distributedness” is in no way optimal, and poor performance, in terms of fault tolerance, is not unusual. We have shown that poor generalization ability can arise in a similar manner. Fault tolerance is optimized if the computation is spread thinly and evenly throughout the network (Murray & Edwards, 1994; Edwards & Murray, 1996a). Generalization ability requires that the computational complexity of a network be matched with that required by the task at hand. These are two of the key questions in neural computation that remain open, despite advances in both theoretical (see MacKay, 1992, for example) and empirical methods. In addition, some complex real-world tasks will require customdesigned hardware. In that context, fault tolerance and optimal utilization of resources are key to a robust and usable solution. This article presents regularization terms for distributing neural computation optimally. Fault tolerance in a distributed system such as a neural network model is limited by the critical path through a network. Conventional learning algorithms configure the network weights to be distributed, although nonoptimally, to produce a solution that is not inherently fault tolerant (Carter, 1988; Bolt, 1992). In a network where the computation is optimally distributed, weight saliency is spread thinly and evenly throughout the network, such that, in the ideal case, no one path through the network is more critical than Neural Computation 10, 987–1005 (1998)
c 1998 Massachusetts Institute of Technology °
988
Peter J. Edwards and Alan F. Murray
any other. In this article, we present penalty terms for the regularization of weight saliency and hence achieve fault tolerance. A number of fault-tolerance enhancement schemes have been proposed (see Edwards & Murray, 1996a, for a full discussion). Of the more interesting work, Sequin and Clay (1990) train with injected “stuck-at” faults during learning to imbue the network with an ability to withstand such faults that occur during operation. Similar to the spirit of the work presented here Neti, Schneider, and Young (1992) describe an optimization technique with an incorporated cost function penalty to achieve fault tolerance for single stuck-at faults. We have also studied training with weight-noise (Murray & Edwards, 1994; Edwards & Murray, 1996a) and showed that by incorporating weight-noise into the learning process as a fault-tolerance hint, the resulting solution is more likely to have a smooth solution locus. In other words, solutions are preferred where a small change in the weights produces only a small change in the output and, therefore, the error. This can be thought of as a general extension of Sequin and Clay’s work, which aims to obtain fault tolerance for small perturbation faults that we have found to be a more realistic model of hardware errors (Edwards & Murray, 1995). We will return to the important issue of what constitutes a realistic fault model in the following section. We are particularly interested in large networks—much larger than conventional architectures—for the following reasons (discussed also by Neal, 1996): 1. The universal approximation capability of neural models is exhibited only if the network has adequate degrees of freedom or parameters (see Funahashi, 1989, for example). Therefore, a larger network is more likely than a small one to be capable of representing the true underlying function that has generated a real data set. This inevitably raises the specter of “overfitting” and generalization ability. We will return to these, noting that a large, constrained network does not necessarily have the ability to overfit. It is a particular form of constraint that forms the substance of this article. 2. Large networks have a greater potential for fault tolerance. If it can be arranged that computation is spread thinly and evenly, then we can infer that larger networks have a greater potential for fault tolerance. 3. Generally hardware implementations have a fixed number of synapses, and all of these must be assigned values, which may be zero. Conventionally, “surplus” weights are set to zero and therefore contribute nothing except noise to the output. Here we suggest that all the available synapses should be incorporated into a large, nonminimal network and this used to provide the greater potential for fault tolerance. Optimal generalization occurs when the available complexity in the net-
Toward Optimally Distributed Computation
989
work is matched to that required to model the true function. This balancing act has been described in various ways such as “the bias-variance dilemma” (Geman, Bienenstock, & Doursat, 1992) and Occam’s razor (Bishop, 1995a) where the most common interpretation is to choose the simplest architecture that will provide a solution. One practical approach to this dilemma is to favor parsimony by restricting the complexity of the model adopted by the network. This group of techniques makes use of regularization constraints to control model complexity, starting with a network that has the potential for being more complex than required. The class of functions that it can model is then limited by regularization constraints. The most common form of constraint is weight decay (Hinton & Sejnowski, 1986), which reappears within the Bayesian framework as the effect of a gaussian prior (MacKay, 1992; Neal, 1996). Roughness penalties, used to smooth network function, are an alternative example. This approach achieves a parsimonious model, reducing the scope of the possible functions that a network can model (Bishop, 1990). Again, there is an equivalent within the probabilistic regime (Buntine & Weigend, 1991), where Buntine and Weigend show how a quadratic smoothness penalty equates to a prior distribution in the gamma family. Perhaps more directly related to the idea of complexity regularization is the approach of Barron and Cover (1991) in their work with complexity terms applied to minimum encoding methods. Here we incorporate a complexity measure as a penalty term that regularizes weight saliency. By limiting the individual and collective importance of the weights, we achieve improved generalization in very large networks. In a related article, Hochreiter and Schmidhuber (1997) propose an algorithm for finding neural networks that have their solution in a flat area of the solution space, with the express aim of lowering overall complexity. Building on the theory of a minimum description length–based Bayesian approach, they argue that “flat” solutions will correspond to low-complexity networks with good generalization ability. A subsequent empirical investigation shows that their algorithm does indeed produce networks with a flat solution locus and good generalization ability. Their algorithm builds on a number of approaches, including our earlier work (Murray & Edwards, 1994), by introducing two extra cost function penalties, or flatness conditions. The first is equivalent to one of the terms we implement and investigate in this article and is effectively a fault-tolerance term. Their second penalty is one that regularizes the variance in flatness in different directions at the solution. There are therefore several areas of commonality with our earlier work and that reported in this article. The difference between the approach in Hochreiter and Schmidhuber (1997) and in this article is that we interpret the fault-tolerance penalty in terms of distributed computation. We show that computation can be spread over a large number of parameters in expanded networks; while Hochreiter and Schmidhuber include pruning stages in their algorithm, we expand the network and then regularize the overcom-
990
Peter J. Edwards and Alan F. Murray
plexity to achieve extra fault-tolerance potential and generalization ability. In addition we also investigate a second-order penalty in the same family and show that it can be even more effective as a regularizer of weight saliency. While introducing a variance penalty is justified by Hochreiter and Schmidhuber, and intrinsically interesting, we have taken a more pragmatic approach. We show that we can achieve good improvement without the added complexity, and computational overhead, associated with a variance term. This article describes a technique for constraining weight saliency to enhance fault tolerance and generalization performance. A fault model is presented that is realistic for implementations of neural network hardware. Following this, we consider network augmentation, a technique for obtaining weights in very large network architectures. Based on these ideas, penalty terms are introduced and simulation experiments carried out on two classification problems—one artificial, the other a real-world task. Therefore this article presents constraints regularizing weight saliency to achieve optimally distributed computation. 2 A Fault Model for Hardware Errors In much of the work previously carried out in the area of fault tolerance in neural networks, performance has been assessed in terms of “single-stuckat” faults (Sequin & Clay, 1990; Neti et al., 1992). A stuck-at fault simply fixes a point in a circuit or network at a binary 1 or 0 and assesses the effect of this fault on the circuit or network’s performance. This crude form of fault model has its roots in the field of digital chip testing, where test vectors derived to expose stuck-at faults have been found historically to be satisfactory in testing against real physical circuit and fabrication faults. The unrealistic nature of the fault model does not, surprisingly, seem to matter in digital systems, where cross-coupling faults, multiple faults, and other complicated fault mechanisms are common. However, it is clear that such a fault model is unrealistic for most current implementations of neural networks, particularly analog implementations. While such extreme faults may occasionally occur, it is clear that within normal working conditions, such faults will be uncommon and thus form a poor criterion against which to assess fault tolerance. Here we consider some of the more common areas of hardware error and substantiate the above statements, proposing our own small perturbation fault model. For a more detailed discussion on analog VLSI implementations see Lehmann (1994) and Edwards and Murray (in press), or for the digital case, Edwards and Murray (1996a). Errors occur at many levels of hardware implementation. In analog VLSI, it has been suggested (Lehmann, 1994) that computational hardware is inherently limited to a relative precision of about 1%. This will result from a combination of factors, such as process variation. This leads to component mismatch and changes in threshold voltages and other parameters between
Toward Optimally Distributed Computation
991
chips or on the same chip. At a higher level, decisions must be made about circuit and system-level errors. For digital networks, a bit resolution must be chosen. For analog VLSI, the reliability of the weight storage mechanism is one of the most important issues. In addition, there are many questions that apply to all implementations, such as the accuracy required for functionality. Many hardware functions can only approximate a required operation, and a trade-off may be necessary, between functional accuracy and, for example, silicon area. It is clear, therefore, that there is a level of unavoidable hardware error. The highly parallel nature and the need for cell replication (synaptic multipliers, for example) in neural network implementations exacerbate the problem. All the errors discussed here will inevitably occur and will be of the form of small signal variation (small, but potentially deleterious, weight errors, or output offsets in neural network parlance). This article therefore concentrates on this form of error: small perturbation fault tolerance rather than stuck-at faults. 3 Fault tolerance via Augmentation One technique that has been proposed to enhance fault tolerance is known as network augmentation (Emmerson, 1992; Phatak & Koren, 1995). This process involves expanding the hidden layer, posttraining, replicating hiddeninput weights N times and scaling the output-hidden weights by a hidden layer expansion factor of 1/N. In so doing, the augmented network has been shown to have a greater fault tolerance than the original network for a stuckat zero fault model. The reasons are intuitively obvious. Augmentation is shown diagrammatically in Figure 1. This technique is based on two premises: (1) by increasing the size of a network, its potential for fault tolerance is also increased, and (2) weight saliency is proportional to weight magnitude (Emmerson, 1992; Phatak & Koren, 1995). The first premise is correct, but we believe that the second is misguided, as it assumes that small weights are intrinsically less important than large weights, without taking into account any information about input and output variables. This simple approximation performs poorly in practice, as we will show later. Bishop (1995a) describes a better, if approximate, measure of weight saliency as, δE =
1X Hii δw2i , 2 i
(3.1)
where Hii is the diagonal component of the error Hessian matrix. In other words, weight saliency is proportional to the curvature of the solution locus with respect to the weight in question. Although this ignores perhaps significant off-diagonal terms in the Hessian (Hassibi & Stork, 1993), it is more accurate than a magnitude-based saliency measure while still be-
992
Peter J. Edwards and Alan F. Murray
Original hidden-input weights
Original hidden units
Replications
Replications
Scaled output-hidden weights
Figure 1: Augmentation of a network with two inputs, four hidden units, and two outputs. Hidden units are replicated, and connecting weights are adjusted accordingly.
ing feasible to implement in the framework we explain in the following section. Augmentation also highlights the limits of the architectural expansion. In the introduction to this article, the use of large, nonminimal networks was justified by using the simplistic intuitive idea that large networks have the greatest potential for fault tolerance. This is strictly true only for an ideal task that can be spread over an infinite array of computing elements. In practice, there will be a limit imposed by the amenability of the task to distribution. This may become evident as a limit is reached, perhaps inherent in the nature of the task or its implementation. In our previous work on noisy
Toward Optimally Distributed Computation
993
training, the limit became evident when using the small perturbation fault model, where the increasing number of noisy components in the model had an additive effect as the size of the model increased. At some point, a balance was reached with the effect of an ever-increasing level of noise and the potential fault tolerance gains due to architectural expansion (Edwards & Murray, in press). The use of large networks is therefore a useful technique by which the potential for fault-tolerance performance can be increased. However, it is a nontrivial task to find the optimally fault-tolerant weight configuration for expanded networks and conventional techniques perform poorly in practice. In the following section, we propose regularization terms that incorporate a correct measure of weight saliency and show how they produce optimally distributed computation in large network architectures. 4 Smoothing the Solution Locus Distributed computation can be achieved by smoothing the solution locus. Since the error Hessian measures weight saliency, its regularization will achieve distributed computation by smoothing the curvature of the solution locus. In addition, by viewing the smoothness of the solution locus as a route to distributed computation, we can introduce smoothness constraints that are amenable to implementation to distribute the computation optimally. In this section, therefore, we build on our previous work and show how weight-noise can be viewed as a smoothness constraint. We then propose two deterministic approaches to smoothing, one equivalent to using weight-noise under ideal conditions and the other a more statistically feasible solution. When analyzing weight-noise in multilayer perceptron (MLP) models, it is useful to use a Taylor expansion. For the general (additive weight-noise)1 case we expand the output error to second order, ¡ ¢ ¡ ¢ ¡ ¢ X ∂² w ∂ 2² w 1X ξa + ξa ξb , ² w, ξ = ² w + ∂wa 2 A,B ∂wa ∂wb A ³
´
(4.1)
´ ³ where ² w, ξ is the error for a network whose weight vector, w, is perturbed by some noise vector, ξ , and ξa is the perturbation on parameter wa , that is, any weight. Let the noise source be stationary, have zero mean, and be uncorrelated and have equal variance across the weights, that is, hξa i = 0, and hξa ξb i = ξ 2 δab .
(4.2)
1 For the case of multiplicative weight-noise, the equation is similar but with extra terms denoting the weights associated with each noise component in each part of the expansion. See Edwards and Muarry (1996a) for the full derivation.
994
Peter J. Edwards and Alan F. Murray
Taking the time average, therefore, the noise-augmented error function becomes ( ¡ ¢) ´E D ³ ξ 2 X ∂ 2² w , (4.3) ² w, ξ = h²(w)i + 2 A ∂wa 2 that is, the increase in the cost function error due to weight-noise is equal to a factor proportional to the error Hessian. We now consider the particular case of least-squares training, where the error Hessian becomes, Ã ( ) ( !) X µ ∂ykp ¶2 ¡ ¢2 ¢ ∂ 2 ykp 1 X¡ ∂2 ykp − tkp + ykp − tkp = , (4.4) ∂wa 2 2 K,P ∂wa ∂wa 2 K,P where ykp is number k of K components of the output vector y for pattern p of the data and tkp is the respective target output value. Training with least squares will lead to the network with its solution at the conditional average of the target data (Ruck, Rogers, Kabriskey, Oxley, & Suter, 1990; Edwards & Murray, 1996b). Given sufficient training examples, at that point the residual errors in the second term in equation 4.4 average to zero, and the error function becomes µ ¶ ´E D ³ ξ 2 X ∂ykp 2 . ² w, ξ = h²(w)i + 2 A,K,P ∂wa
(4.5)
This equation shows that training with weight-noise in effect changes the error function to include this extra penalty term. This term has been discussed in detail in our earlier work (Edwards & Murray, 1996a) and can be thought of as a mechanism for obtaining distributed computation. It has also been implemented as one of a pair of terms integral to the “flat minima search” algorithm (Hochreiter & Schmidhuber, 1997). It is important to note at this stage that the weight-noise analyzed above could equally have been input-noise and a similar error function produced (Bishop, 1995b). The extra term in the input-noise error function is one of a series of terms known as Tikhonov regularizers. Such terms are commonly incorporated directly into the learning to smooth the network function and are of the form Ä=
X µ ∂ m ykp ¶2 I,K,P
∂xi m
,
(4.6)
where the Tikhonov term, Ä, is of order m and xi is the number i of I components of the input vector x. These terms are then incorporated into the learning cost function, ²˜ = ² + λÄ,
(4.7)
Toward Optimally Distributed Computation
995
where λ controls the influence of the regularization term in the overall error reduction. Comparing our noise-induced penalty term in equation 4.5 with this generalized Tikhonov regularizer, it is clear that equation 4.5 is also a regularization term (where m = 1) although of the weight-output relationship rather than the input-output relationship. It is therefore a smoother of the solution locus as opposed to network function. In addition, statistical theory would suggest that for a Tikhonov regularizer, m = 2 provides the lowest order at which any visual smoothness will be seen (Seber & Wild, 1989). In this article, we consider two penalty terms for fault tolerance that are smoothness regularizers of this kind, although perhaps smoothers of the solution locus rather than network function. They are of the form ÄFT =
X µ ∂ m ykp ¶2 ∂wa m A,K,P
(4.8)
and of order m = 1 and m = 2. We shall subsequently refer to these roughness penalties as M1 and M2, respectively. In the following sections, two application problems, one artificial and the other a real-world task, are incorporated into simulation experiments using MLPs trained with the roughness penalties. 5 Simulation Experiments The experiments described here assess the regularization constraints defined above and are based on two data sets. The first problem is an artificial, two-class classification task in two dimensions. Samples for both classes are taken from normal distributions with means of 0.2 and 0.3, where both distributions are circular with variance 0.05 and 0.15, respectively. The task is depicted in Figure 2. In these experiments, 200 patterns are used as the training pattern set and a further 9800 for testing. Initially, the network architecture had an I:J:K structure of 2:4:2, that is, four sigmoidal hidden units and the outputs were encoded to classify the two classes as one out of two. This task requires the network to approximate the Bayesian optimal decision boundary, which is a circle in this case (see Figure 2). This is a difficult task especially for MLP networks with sigmoidal neurons (perhaps inappropriately) used here. The initial four hidden units were expanded to 16 to draw out the effects within large networks. The second test problem, link admission control, is a real task, using real data. Link admission control is a function of an asychronous transfer mode (ATM) communications network router. The task is to decide whether a particular link in the network can accept a new link. An estimate of the probability of losing a data packet is used to make a decision. This estimate is based on the current load present on the link and traffic parameters characterizing the new connection. The neural solution has been developed by Nordstrom ¨
996
Peter J. Edwards and Alan F. Murray
0.8 class1 class2 boundary
0.7 0.6
Dimension 1
0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Dimension 2
Figure 2: An artificial two-dimensional classification problem with two normally distributed classes and the optimal Bayes decision boundary.
and G¨allmo at Uppsala University in Sweden (Nordstrom, ¨ G¨allmo, Asplund, Gustafsson, & Eriksson, 1992; G¨allmo, Nordstrom, ¨ Gustafsson, & Asplund, 1993). Here 9 input fields are used to classify a link to be admissible or not based on the probability of loss. We use a one-out-of-two encoding of the two output classes and, initially, six hidden units in an MLP network with sigmoidal neurons. The hidden layer was later increased to 16 units for experiments with expanded networks. The data include 500 training vectors and an additional 4000 test vectors. For the purposes of training, a simple steepest-descent technique with incorporated line search was used and was applied until a minimum error was reached in a validation set of 2000 patterns for the two-class problem and 1000 patterns for the link admission task. For the purpose of the experiments carried out for this article, the mean square error (MSE) was used as our cost function error, making the (generally naive) assumption that the MSE minimum will be equivalent to the classification minimum.
Toward Optimally Distributed Computation
997
6 Results This section presents the results of experiments carried out on the two tasks. For each, the fault tolerance and generalization ability are assessed without the penalty terms M1 and M2 and then including them with increasing values for λ1 and λ2 . Fault tolerance is measured in terms of the mean error Hessian (H), where ¡ ¢ 1 X ∂ 2² w . (6.1) H= PK A,K,P ∂wa 2 In addition all the results presented are averaged over 30 independently trained networks, all launched from a random position in weight space. 6.1 Experiment 1: Assessment of Augmentation. The first experiment carried out was to augment an ensemble of 30 networks trained to solve the artificial two-class problem and the link admission task by expanding the hidden layer as described in section 3. The augmented networks were then assessed for fault tolerance performance2 using the mean error Hessian measure—that is, a small perturbation fault model. The results are given in Figures 3 and 4, respectively. As augmentation increases, the networks exhibit a smoother solution locus and, hence, better fault tolerance. However, at higher levels, the mean error Hessian measure begins to rise, indicating that the arguments behind the definition of weight saliency that underlie this technique begin to break down. Although this method reportedly performs well for stuck-at fault models, here it is less successful, indicating that this form of simple augmentation is unlikely to improve the tolerance of real VLSI neural networks to probable circuit faults. The following two sections discuss the experiments carried out with the M1 and M2 regularization terms for the two applications described above. 6.2 Experiment 2 : Artificial Two-Class Problem. In this section we discuss experiments carried out on the artificial, difficult two-class problem to assess the performance of the M1 and M2 regularization terms. The results are presented in Figures 5 and 6, showing the effect of these terms on fault tolerance and generalization ability, respectively. It is clear that the M1 and M2 weight saliency regularizers perform well in reducing the mean curvature of the solution locus and can be seen to increase fault tolerance. Figure 5 shows that for this task, there is a natural limit of perhaps 0.04 in terms of the mean error Hessian measure for networks with even very large (J = 16) numbers of hidden units. This figure for the lowest achievable mean Hessian (and therefore the maximally distributed solution) compares favorably 2 Note that the augmentation technique does not change the functionality of the network, but rather the parameterization and hence generalization ability remains constant.
998
Peter J. Edwards and Alan F. Murray
Mean Hessian Measure
0.3 0.25 0.2 0.15 0.1 0.05 0.0
0
4
8
12 16 20 24 Hidden Units (J)
28
32
36
Figure 3: Variation in the mean error-Hessian (see equation 6.1) for augmented networks trained on the artificial two-class problem. The results are collated for 30 different networks, and the mean is shown along with error bars at two times the standard deviation.
with even the “best” augmented network in Figure 3, indicating that true weight saliency regularization is far more successful in the presence of a realistic fault model. It is interesting to note that increased numbers of hidden units do not produce ever smoother solutions and ever more distributed computation for this task. Sixteen hidden units offer a potential improvement over the original four, but the results show that the potential cannot be realized. Clearly, no realistic task can ever yield to a totally smooth solution; some curvature must exist. It is believed that the figure of 0.04 represents the minimal level of roughness that is consistent with a true solution to the classification problem. Despite this, it is encouraging to note that for both network ensembles (J = 4 and J = 16), a significant improvement is seen over a range of regularization coefficients. The generalization ability of trained networks with M1 and M2 regularizers included are compared in Figure 6. These results are more equivocal, although a number of important observations can be made: 1. The effects of regularization can be viewed as occurring in three phases, most clearly illustrated by the J = 16 curve for the M1 regularizer (see Figure 6). At zero regularization, some overfitting is certain— minimized but not eradicated by the validation scheme used to stop training. As regularization increases, the resulting network models
Toward Optimally Distributed Computation
999
0.035
Mean Hessian Measure
0.03 0.025 0.02 0.015 0.01 0.005
6
12
18
24
30
36
42
48
Hidden Units (J)
Figure 4: Variation in the mean error-Hessian (see equation 6.1) for augmented networks trained on the link admission task. The results are collated for 30 different networks, and the mean is shown along with error bars at two times the standard deviation.
0.3
0.25
Mean Hessian Measure
Mean Hessian Measure
0.3
J=16
0.2 0.15
J=4
0.10 0.05
0.25 0.2
J=16
0.15 0.10 0.05
0.0
J=4
0.0 0.0
0.02
0.04
0.06
0.08
M1 Coefficient
0.1
0.12
0.0
0.1
0.2 0.3
0.4
0.5 0.6
0.7 0.8
M2 Coefficient
Figure 5: The effect of the M1 and M2 smoothness constraints on fault tolerance (inversely proportional to the error Hessian measure [see equation 6.1]) for networks trained on the artificial two-class problem. Mean results for two network architectures are shown, where the hidden layer size J = 4 and J = 16, along with error bars at two times the standard deviation.
Peter J. Edwards and Alan F. Murray
12.9
12.9
12.8
12.8
Generalization Error (%)
Generalization Error (%)
1000
12.7 12.6
J=4
12.5 J=16 12.4
12.7 12.6
J=4
12.5 12.4
12.3
12.3
12.2
12.2
J=16 0.0
0.02
0.04
0.06
0.08
M1 Coefficient
0.1
0.12
0.0
0.1
0.2 0.3
0.4
0.5 0.6
0.7 0.8
M2 Coefficient
Figure 6: The effect of the M1 and M2 smoothness constraints on generalization ability for networks trained on the artificial two-class problem. Mean results for two network architectures are shown, where the hidden layer size J = 4 and J = 16, along with error bars at two times the standard deviation.
will achieve some optimal complexity, matching that required by the task, and generalization error reaches a minimum. Overregularization then takes place and the network is no longer sufficiently flexible to meet the needs of the problem. Overregularization occurs when the network has the potential to provide the complexity required by the task but is prevented from demonstrating it by too-strict regularization. The three-stage process is generally corroborated by the results and indicates that it is difficult to choose optimal levels for the M1 and M2 coefficients. This is especially the case where there are multiple minima. 2. The introduction of more hidden units, and hence network complexity, is desirable for this problem, which benefits from a larger number of hidden-unit decision boundaries when approximating the optimal (circular) Bayes boundary. Even with no complexity constraint, the benefit of a larger number of hidden units is clear. As regularization is introduced, the generalization ability improves further as the regularizer removes some residual overfitting that escapes the validation process. 3. We can compare the M1 and M2 constraints to assess the trade-off between fault-tolerance enhancement and generalization ability. Clearly the choice of regularization coefficient is more critical in terms of generalization performance. For the M2 term, however, particularly at J = 16, the choice appears less critical, and an optimal choice is easier to make.
Toward Optimally Distributed Computation
0.035
0.035
0.03
Mean Hessian Measure
Mean Hessian Measure
0.03 J=16
0.025 0.02 0.015
1001
J=6
J=16 0.025 0.02 J=6 0.015 0.01
0.01
0.005
0.005 0.0
0.1
0.2
0.3
0.4
M1 Coefficient
0.5
0.6
0
2
4
6
8
10
12
M2 Coefficient
Figure 7: The effect of the M1 and M2 smoothness constraints on fault tolerance (inversely proportional to the error Hessian measure [see equation 6.1]) for networks trained on the link admission task. The mean results for two network architectures are shown, where the hidden layer size J = 6 and = 16, along with error bars at two times the standard deviation.
Therefore, it is clear that for this problem, both fault tolerance and generalization ability are enhanced by incorporating the M1 and particularly the M2 complexity constraints. In addition, the introduction of extra hidden units is beneficial, and the gains can be maximized using these regularization terms. 6.3 Experiment 3: The ATM Link Admission Task. This final experiment was carried out using the link admission task and real data. Network ensembles were trained as before and results for the fault tolerance and generalization ability of the solutions are shown in Figures 7 and 8, respectively. It is clear that a trade-off is also taking place here. The regularization terms produce significant improvements in fault-tolerance performance (see Figure 7). As before, the introduction of more hidden units does not produce a dramatically different result in terms of curvature of the solution locus. There again seems to be a fundamental limit to improvements in the distributedness of the solution. However, the fault-tolerance enhancement for both network architectures is both striking and significant. Figure 8 also shows that the three-stage process defined above again appears to take place: performance is enhanced, before it is degraded as the network loses the ability to capture the inherent curvature in the task. It is important that we highlight here the effect of an increased number of hidden units in the network architecture. With no regularization, the extra complexity has a deleterious effect on both performance metrics. However, when the M1 and M2 complexity constraints are introduced, the peak performances of network ensembles (J = 6 and J = 16) are approximately
1002
Peter J. Edwards and Alan F. Murray
3.5
Generalization Error (%)
Generalization Error (%)
3.5 3.4 3.3 3.2
J=16
3.1
3.4 3.3 3.2 J=16 3.1
J=6 3.0
J=6
3.0 0.0
0.1
0.2
0.3
0.4
M1 Coefficient
0.5
0.6
0
2
4
6
8
10
12
M2 Coefficient
Figure 8: The effect of the M1 and M2 smoothness constraints on generalization ability for networks trained on the link admission task. The mean results for two network architectures are shown, where the hidden layer size J = 6 and = 16, along with error bars at two times the standard deviation.
equal. This is encouraging, as it shows that the introduction of a high level of available complexity can be controlled by the regularization. In the previous artificial and highly nonlinear task, the introduction of more hidden units had a positive effect on generalization ability. In the ATM task, the difference is slight, and we can infer that six hidden units are sufficient to solve this problem. We can also see that the M2 penalty provides a greater degree of control and better generalization ability than does the M1 term over a wider range of coefficients, although the difference is less marked than before. Therefore, for the two experiments discussed here and, we believe, in many others, the M1 and M2 penalties can be used to increase fault tolerance and generalization ability in overcapacity networks. 7 Discussion The results presented show that the inclusion of penalty terms in the learning process can control weight saliency directly, across the entire network. This has a deterministic and direct influence on fault tolerance, maximizing the distributedness of the network computation. In addition, weight saliency penalties act as a complexity constraint, reducing overfitting, in even large network architectures. It is therefore possible to increase the available complexity of a network prior to training in the knowledge that its ability to generalize can be restored by this form of regularization. In addition, a trade-off can be seen to take place between fault tolerance and generalization ability. In the experiments described, the results show that
Toward Optimally Distributed Computation
1003
increasing generalization performance passes through an optimal value and then increases. The results also offer a hint that for some problems, there may be more than one minimum. This means that it is not trivially easy to optimize generalization ability by adjusting the level of weight-saliency regularization. However, the results show that the M2 regularization term provides a greater reduction in generalization error and a wider region over which fault tolerance and generalization ability are both enhanced. The M2 regularizer is therefore to be preferred, although there is a hidden cost in computational overhead for this more awkward term. In experiment 1, the augmentation technique was assessed. The underlying premise of a magnitude-based weight saliency measure has been shown to be flawed when tested using our realistic fault model. The gains in fault tolerance due to this method are not as great as those obtainable through weight saliency regularization using our penalty terms. However, there is no trade-off with generalization, and the augmentation process is computationally less expensive. We have shown that weight-saliency regularization terms are highly useful mechanisms for improving network performance, but further work must be done to facilitate choice of the coefficients. One approach is to consider this work within the probabilistic framework. In this way, the regularization terms can be thought of as priors and coefficients calculated dynamically during the Bayesian inference process, providing a natural way of including these and other regularizers. In addition, this would allow for a comparison to be carried out between regularizers of functional smoothness as opposed to network complexity. We will return to these issues in a later article. 8 Conclusions This article has introduced optimally distributed computation. By including weight saliency constraints during learning (as penalty terms), computation can be spread thinly and evenly throughout a network. In this way, the average curvature of the solution locus is reduced, leading to dramatic improvements in fault tolerance for a realistic “small perturbation” fault model. In addition, we have shown that such a constraint can also be used to regularize network complexity even for very large networks, resulting in better generalization ability. Our algorithm can therefore be used to constrain large, overcapacity networks to achieve distributed computation, which also leads to a flattening of the solution locus. A flat minima search is the explicit aim of Hochreiter and Schmidhuber (1997). Although any quantitative comparison of the two approaches is impossible without intensive simulation, qualitatively both perform some kind of smoothing of the solution locus. Further work is required to explore the importance of Hochreiter and Schmidhuber’s variance term discussed earlier, and the relative advantages of their architecturally minimal networks achieved via pruning and the constrained overcapacity networks we consider here. In
1004
Peter J. Edwards and Alan F. Murray
addition, we note the value of using a second-order penalty (M2) and that there is some evidence that even higher-order terms may be beneficial (Seber & Wild, 1989). These results are important if neural networks are ever to become truly parallel architectures in hardware implementations and to be able to solve complex real-world problems with limited data. Optimally distributed computation allows us to maximize the fault-tolerance performance of a given architecture, for realistic hardware faults. In addition, the complexity constraints allow us to produce parsimonious models even in large network architectures. Acknowledgments This work was supported by EPSRC grant number GR/J 44539. In addition, we thank the anonymous reviewers of this article for their many helpful and insightful comments and suggestions. References Barron, A., & Cover, T. (1991). Minimum complexity density estimation. IEEE Trans. Information Theory, 37(4), 1034–1054. Bishop, C. (1990). Curvature-driven smoothing in backpropagation neural networks. Proc. International Joint Conference on Neural Networks, 2, 749–752. Bishop, C. (1995a). Neural networks for pattern recognition. New York: Oxford University Press. Bishop, C. (1995b). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116. Bolt, G. (1992). Fault tolerance in artificial neural networks. Unpublished doctoral dissertation, University of York. Buntine, W., & Weigend, A. (1991). Bayesian back-propagation. Complex Systems, 5, 603–643. Carter, M. (1988). The “illusion” of fault tolerance in neural networks for pattern recognition and signal processing. In Proc. Technical Session on Fault-Tolerant Integrated Systems. Durham, NH: University of New Hampshire. Edwards, P., & Murray, A. (1995). Can deterministic penalty terms model the effects of synaptic weight noise on network fault-tolerance? International Journal of Neural Systems, 6(4), 401–416. Edwards, P., & Murray, A. (1996a). Analogue imprecision in MLP training. Singapore: World Scientific. Edwards, P., & Murray, A. (1996b). Modelling weight- and input-noise in MLP learning. In Proc. International Conference on Neural Networks (Vol. 1, pp. 78– 83). Washington, D.C. Edwards, P., & Murray, A. (in press). Fault-tolerance via weight-noise in analogue VLSI implementations—a case study with EPSILON. In IEEE Proceedings on Circuits and Systems II: Analog and Digital Signal Processing. Emmerson, M. (1992). Fault tolerance and redundancy in neural networks. Unpub-
Toward Optimally Distributed Computation
1005
lished doctoral dissertation, University of Southampton. Funahashi, K. (1989). On the approximate realization of continuous mappings by neural networks. Neural Networks, 2, 183–192. G¨allmo, O., Nordstrom, ¨ E., Gustafsson, M., & Asplund, L. (1993). Neural networks for preventive traffic control in broadband ATM networks. In International Workshop on Mechatronical Computer Systems for Perception and Action (MCPA-93) (pp. 139–145). Halmstad, Sweden. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. Hassibi, B., & Stork, D. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In Proc. Neural Information Processing Systems (NIPS) Conference. (pp. 164–171). San Mateo, CA: Morgan Kaufmann. Hinton, G., & Sejnowski, T. (1986). Learning and relearning in Boltzmann machines (pp. 282–317). Cambridge, MA: MIT Press. Hochreiter, S., & Schmidhuber, J. (1997). Flat minima. Neural Computation, 9, 1–42. Lehmann, T. (1994). Hardware learning in analogue VLSI neural networks. Unpublished doctoral dissertation, Technical University of Denmark. MacKay, D. (1992). Bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472. Murray, A., & Edwards, P. (1994). Synaptic weight noise during MLP training: Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Trans. Neural Networks, 5(5), 792–802. Neal, R. (1996). Bayesian learning for neural networks. New York: Springer-Verlag. Neti, C., Schneider, M., & Young, E. (1992). Maximally fault tolerant neural networks. IEEE Trans. Neural Networks, 3(1), 14–23. Nordstrom, ¨ E, G¨allmo, O., Asplund, L, Gustafsson, M., & Eriksson, B. (1992). Neural networks for admission control in an ATM network. In L. Niklasson & M. Bod´en (Eds.), Connectionism in a broad perspective: Selected papers from the Swedish Conference on Connectionism—1992 (pp. 239–250). Ellis Horwood. Phatak, D., & Koren, I. (1995). Complete and partial fault tolerance of feedforward neural nets. IEEE Trans. Neural Networks, 6(2), 446–456. Ruck, D., Rogers, S., Kabrisky, M., Oxley, M., & Suter, B. (1990). The multilayer perceptron as an approximation to a Bayes optimal disciminant function. IEEE Trans. Neural Networks, 1(4), 296–298. Seber, G., & Wild, C. (1989). Nonlinear regression. New York: Wiley. Sequin, C., & Clay, R. (1990). Fault tolerance in artificial neural networks. Proc. International Joint Conference on Neural Networks (Vol. 1, pp. 703–708). San Diego. Received March 27, 1997; accepted September 3, 1997.
LETTER
Communicated by Scott Fahlman
Efficient Adaptive Learning for Classification Tasks with Binary Units J. Manuel Torres Moreno Mirta B. Gordon D´epartement de Recherche Fondamentale sur la Mati`ere Condens´ee, CEA Grenoble, 38054 Grenoble Cedex 9, France
This article presents a new incremental learning algorithm for classification tasks, called NetLines, which is well adapted for both binary and real-valued input patterns. It generates small, compact feedforward neural networks with one hidden layer of binary units and binary output units. A convergence theorem ensures that solutions with a finite number of hidden units exist for both binary and real-valued input patterns. An implementation for problems with more than two classes, valid for any binary classifier, is proposed. The generalization error and the size of the resulting networks are compared to the best published results on well-known classification benchmarks. Early stopping is shown to decrease overfitting, without improving the generalization performance.
1 Introduction Feedforward neural networks have been successfully applied to the problem of learning pattern classification from examples. The relationship of the number of weights to the learning capacity and the network’s generalization ability is well understood only for the simple perceptron, a single binary unit whose output is a sigmoidal function of the weighted sum of its inputs. In this case, efficient learning algorithms based on theoretical results allow the determination of the optimal weights. However, simple perceptrons can generalize only those (very few) problems in which the input patterns are linearly separable (LS). In many actual classification tasks, multilayered perceptrons with hidden units are needed. However, neither the architecture (number of units, number of layers) nor the functions that hidden units have to learn are known a priori, and the theoretical understanding of these networks is not enough to provide useful hints. Although pattern classification is an intrinsically discrete task, it may be cast as a problem of function approximation or regression by assigning real values to the targets. This is the approach used by backpropagation and c 1998 Massachusetts Institute of Technology Neural Computation 10, 1007–1030 (1998) °
1008
J. Manuel Torres Moreno and Mirta B. Gordon
related algorithms, which minimize the squared training error of the output units. The approximating function must be highly nonlinear because it has to fit a constant value inside the domains of each class and present a large variation at the boundaries between classes. For example, in a binary classification task in which the two classes are coded as +1 and −1, the approximating function must be constant and positive in the input space regions or domains corresponding to class 1 and constant and negative for those of class −1. The network’s weights are trained to fit this function everywhere—in particular, inside the class domains—instead of concentrating on the relevant problem of the determination of the frontiers between classes. Because the number of parameters needed for the fit is not known a priori, it is tempting to train a large number of weights that can span, at least in principle, a large set of functions expected to contain the “true” one. This introduces a small bias (Geman, Bienenstock, & Doursat, 1992), but leaves us with the difficult problem of minimizing a cost function in a highdimensional space, with the risk that the algorithm gets stuck in spurious local minima, whose number grows with the number of weights. In practice, the best generalizer is determined through a trial-and-error process in which both the numbers of neurons and weights are varied. An alternative approach is provided by incremental, adaptive, or growth algorithms, in which the hidden units are successively added to the network. One advantage is fast learning, not only because the problem is reduced to training simple perceptrons but also because adaptive procedures do not need the trial-and-error search for the most convenient architecture. Growth algorithms allow the use of binary hidden neurons, well suited for building hardware-dedicated devices. Each binary unit determines a domain boundary in input space. Patterns lying on either side of the boundary are given different hidden states. Thus, all the patterns inside a domain in input space are mapped to the same internal representation (IR). This binary encoding is different for each domain. The output unit performs a logic (binary) function of these IRs, a feature that may be useful for rule extraction. Because there is not a unique way of associating IRs to the input patterns, different incremental learning algorithms propose different targets to be learned by the appended hidden neurons. This is not the only difference. Several heuristics exist that generate fully connected feedforward networks with one or more layers, and treelike architectures with different types of neurons (linear, radial basis functions). Most of these algorithms are not optimal with respect to the number of weights or hidden units. Indeed, growth algorithms have often been criticized because they may generate networks that are too large, generally believed to be poor generalizers because of overfitting. This article presents a new incremental learning algorithm for binary classification tasks that generates small feedforward networks. These networks have a single hidden layer of binary neurons fully connected to the inputs and a single output neuron connected to the hidden units. We call it NetLines, for Neural Encoder Through Linear Separations. During the
Classification Tasks with Binary Units
1009
learning process, the targets that each appended hidden unit has to learn help to decrease the number of classification errors of the output neuron. The crucial test for any learning algorithm is the generalization ability of the resulting network. It turns out that the networks built with NetLines are generally smaller and generalize better than the best networks found so far on well-known benchmarks. Thus, large networks do not necessarily follow from growth heuristics. On the other hand, although smaller networks may be generated with NetLines through early stopping, we found that they do not generalize better than the networks that were trained until the number of training errors vanished. Thus, overfitting does not necessarily spoil the network’s performance. This surprising result is in good agreement with recent work on the bias-variance dilemma (Friedman, 1996) showing that, unlike in regression problems where bias and variance compete in the determination of the optimal generalizer, in the case of classification they combine in a highly nonlinear way. Although NetLines creates networks for two-class problems, multiclass problems may be solved using any strategy that combines binary classifiers, like winner-takes-all. We propose a more involved approach, through the construction of a tree of networks, that may be coupled with any binary classifier. NetLines is an efficient approach for creating small, compact classifiers for problems with binary or continuous inputs. It is best suited for problems requiring a discrete classification decision. Although it may estimate posterior probabilities, as discussed in section 2.6, this requires more information than the bare network’s output. Another weakness of NetLines is that it is not simple to retrain the network when new patterns are available or class priors change over time. In section 2, we give the basic definitions and present a simple example of our strategy, followed by the formal presentation of the growth heuristics and the perceptron learning algorithm used to train the individual units. In section 3 we compare NetLines to other growth strategies. The construction of trees of networks for multiclass problems is presented in section 4. A comparison of the generalization error and the network’s size, with results obtained with other learning procedures, is presented in section 5. The conclusions are set out in section 6. 2 The Incremental Learning Strategy 2.1 Definitions. We are given a training set of P input-output examples µ µ µ {ξE µ , τ µ }, where µ = 1, 2, . . . , P. The inputs ξE µ = (1, ξ1 , ξ2 , . . . , ξN ) may be µ binary or real valued N+1 dimensional vectors. The first component ξ0 ≡ 1, the same for all the patterns, allows us to treat the bias as a supplementary weight. The outputs are binary, τ µ = ±1. These patterns are used to learn the classification task with the growth algorithm. Assume that, at a given stage of the learning process, the network already has h binary neurons
1010
J. Manuel Torres Moreno and Mirta B. Gordon
in the hidden layer. These neurons are connected to the N + 1 input units E k = (wk0 , wk1 · · · wkN ), 1 ≤ k ≤ h, wk0 being the through synaptic weights w bias. Then, given an input pattern ξE , the states σk of the hidden neurons (1 ≤ k ≤ h) given by à ! N X E k · ξE ) wki ξi ≡ sign(w (2.1) σk = sign i=0
define the pattern’s h-dimensional IR, σE (h) = (1, σ1 , . . . , σh ). The network’s output ζ (h) is: ! Ã h h i X E Wk σk ≡ sign W(h) · σE (h) (2.2) ζ (h) = sign k=0
E where W(h) = (W0 , W1 , . . . , Wh are the output unit weights. Hereafter, µ µ µ σE (h) = (1, σ1 , . . . , σh ) is the h-dimensional IR associated by the network of h hidden units to pattern ξE µ . During the training process, h increases through the addition of hidden neurons, and we denote the final number of hidden units as H. 2.2 Example. We first describe the general strategy on a schematic example (see Figure 1). Patterns in the gray region belong to class τ = +1, the others to τ = −1. The algorithm proceeds as follows. A first hidden unit is trained to separate the input patterns at best and finds one solution, say E 1 , represented on Figure 1 by the line labeled 1, with the arrow pointing w into the positive half-space. Because training errors remain, a second hidden neuron is introduced. It is trained to learn targets τ2 = +1 for patterns well classified by the first neuron and τ2 = −1 for the others (the opposite convention could be adopted, both being strictly equivalent), and suppose E 2 is found. Then an output unit is connected to the two hidthat solution w den neurons and is trained with the original targets. Clearly it will fail to separate all the patterns correctly because the IR (−1, 1) and (+−) are not faithful, as patterns of both classes are mapped onto them. The output neuron is dropped, and a third hidden unit is appended and trained with targets τ3 = +1 for patterns that were correctly classified by the output neuron and E 3 is found, and it is easy to see that now τ3 = −1 for the others. Solution w the IRs are faithful, that is, patterns belonging to different classes are given different IRs. The algorithm converged with three hidden units that define three domain boundaries determining six regions or domains in the input space. It is straightforward to verify that the IRs corresponding to each domain on Figure 1 are linearly separable. Thus, the output unit will find the correct solution to the training problem. If the faithful IRs were not linearly separable, the output unit would not find a solution without training errors, and the algorithm would go on appending hidden units that should learn
Classification Tasks with Binary Units
1011
3
-++ -+++-
1 +++
2
+-+-+
Figure 1: Patterns inside the gray region belong to one class, those in the white region to the other. The lines (labeled 1, 2, and 3) represent the hyperplanes found with the NetLines strategy. The arrows point into the correspondent positive half-spaces. The IRs of each domain are indicated (the first component, σ0 = 1, is omitted for clarity).
targets τ = 1 for well-learned patterns, and τ = −1 for the others. A proof that a solution to this strategy with a finite number of hidden units exists is left to the appendix. 2.3 The Algorithm NetLines. Like most other adaptive learning algorithms, NetLines combines a growth heuristics with a particular learning algorithm for training the individual units, which are simple perceptrons. In this section, we present the growth heuristics first, followed by the description of Minimerror, our perceptron learning algorithm. We first introduce the following useful remark: if a neuron has to learn a target τ , and the learned state turns out to be σ , then the product σ τ = 1 if the target has been correctly learned, and σ τ = −1 otherwise. Given a maximal accepted number of hidden units, Hmax , and a maximal number of tolerated training errors, Emax , the Netlines algorithm may be summarized as follows: Algorithm. • Initialize h = 0; µ set the targets τh+1 = τ µ for µ = 1, . . . , P;
1012
J. Manuel Torres Moreno and Mirta B. Gordon
• Repeat 1. /* train the hidden units */ h = h + 1; /* connect hidden unit h to the inputs */ µ learn the training set {ξE µ , τh }, µ = 1, . . . , P; µ E h · ξE µ ), µ = 1, . . . , P; after learning, σh = sign(w if h = 1 /* for the first hidden neuron */ µ
µ
if σ1 = τ1 ∀µ then stop. /* the training set is LS */; µ
µ
else set τh+1 = σh τ µ for µ = 1, . . . , P; go to 1;
end if 2. /* learn the mapping between the IRs and the outputs */ connect the output neuron to the h trained hidden units; learn the training set {E σ µ (h),³ τ µ }; µ = 1, ´ . . . , P; µ µ E · σE , µ = 1, . . . , P; after learning, ζ (h) = sign W(h) µ
set τh+1 = ζ µ τ µ for µ = 1, . . . , P; P µ count the number of training errors e = µ (1 − τh+1 )/2; • Until (h = Hmax or e ≤ Emax ); The generated network has H = h hidden units. In the appendix we present a solution to the learning strategy with a bounded number of hidden units. In practice, the algorithm ends up with much smaller networks than this upper bound, as will be shown in section 5. 2.4 The Perceptron Learning Algorithm. The final number of hidden neurons, which are simple perceptrons, depends on the performance of the learning algorithm used to train them. The best solution should minimize the number of errors. If the training set is LS, it should endow the units with the lowest generalization error. Our incremental algorithm uses Minimerror (Gordon & Berchier, 1993) to train the hidden and output units. Minimerror is based on the minimization of a cost function E that depends on the E through the stabilities of the training patterns. If the perceptron weights w input vector is ξE µ and τ µ the corresponding target, then the stability γ µ of pattern µ is a continuous and derivable function of the weights, given by: γ µ = τµ
E · ξE µ w , E kwk
(2.3)
√ E · w. E The stability is independent of the norm of the weights E = w where kwk E It measures the distance of the pattern to the separating hyperplane, kwk. E it is positive if the pattern is well classified, negative which is normal to w;
Classification Tasks with Binary Units
1013
otherwise. The cost function E is: E=
¸ P · γµ 1X . 1 − tanh 2 µ=1 2T
(2.4)
The contribution to E of patterns with large negative stabilities is ' 1, that is, they are counted as errors, whereas the contribution of patterns with large, positive stabilities is vanishingly small. Patterns at both sides of the hyperplane within a window of width ≈ 4T contribute to the cost function even if they have positive stability. The properties of the global minimum of equation 2.4 have been studied theoretically with methods of statistical mechanics (Gordon & Grempel, 1995). It was shown that in the limit T → 0, the minimum of E corresponds to the weights that minimize the number of training errors. If the training set is LS, these weights are not unique (Gyorgyi & Tishby, 1990). In that case, there is an optimal learning temperature such that the weights minimizing E at that temperature endow the perceptron with a generalization error numerically indistinguishable from the optimal (Bayesian) value. The algorithm Minimerror (Gordon & Berchier, 1993; Raffin & Gordon, 1995) implements a minimization of E restricted to a subspace of normalized weights, through a gradient descent combined with a slow decrease of the temperature T, which is equivalent to a deterministic annealing. It has been shown that the convergence is faster if patterns with negative stabilities are considered at a temperature T− larger than those with positive stabilities, T+ , with a constant ratio θ = T− /T+ . The weights and the temperatures are iteratively updated through: # " X X τ µ ξE µ τ µ ξE µ E =² + (2.5) δ w(t) 2 µ 2 µ µ/γ µ ≤0 cosh (γ /2T− ) µ/γ µ >0 cosh (γ /2T+ ) −1 −1 (t + 1) = T+ (t) + δT−1 ; T− = θ T+ ; T+ p E + δ w(t) E w(t) E + 1) = N + 1 . w(t E E kw(t) + δ w(t)k
(2.6) (2.7)
Notice from equation 2.5 that only the incorrectly learned patterns at distances shorter than ≈ 2T− from the hyperplane, and those correctly learned lying closer than ≈ 2T+ , contribute effectively to learning. The contribution of patterns outside this region is vanishingly small. By decreasing the temperature, the algorithm selects to learn patterns increasingly localized in the neighborhood of the hyperplane, allowing for a highly precise determination of the parameters defining the hyperplane, which are the neuron’s weights. Normalization 2.7 restricts the search to the subspace with √ E = N + 1. kwk The only adjustable parameters of the algorithm are the temperature ratio θ = T− /T+ , the learning rate ², and the annealing rate δT−1 . In principle,
1014
J. Manuel Torres Moreno and Mirta B. Gordon
they should be adapted to each√specific problem. However, as a result of our normalizing the weights to N + 1 and to data standardization (see the next section), all the problems are brought to the same scale, simplifying the choice of the parameters. 2.5 Data Standardization. Instead of determining the best parameters for each new problem, we standardize the input patterns of the training set through a linear transformation, applied to each component: µ ξ˜i =
µ
ξi − hξi i ; 1 ≤ i ≤ N. 1i
(2.8)
The mean hξi i and the variance 42i , defined as usual, hξi i =
P 1X µ ξi P µ=1
1i 2 =
P P 1X 1X 2 µ µ (ξi − hξi i) = (ξi )2 − (hξi i)2 , P µ=1 P µ=1
(2.9)
(2.10)
need only a single pass of the P training patterns to be determined. After learning, the inverse transformation is applied to the weights,
w˜ 0 =
w˜ i =
p N + 1 rh p N + 1 rh
w0 −
N P
wi hξi i/1i
i=1
w0 −
PN
j=1 wj hξj i/1j
i2
+
PN
j=1 (wj /1j
(2.11) )2
wi /1i , i2 P PN N 2 w0 − j=1 wj hξj i/1j + j=1 (wj /1j )
(2.12)
so that the normalization (see equation 2.8) is completely transparent to the user: with the transformed weights (see equations 2.11 and 2.12), the neural classifier is applied to the data in the original user’s units, which do not need to be renormalized. As a consequence of the weights scaling (see equation 2.7) and the inputs standardization (see equation 2.8), all the problems are automatically rescaled. This allows us to use always the same values of Minimerror’s parameters: the standard values ² = 0.02, δT−1 = 10−3 , and θ = 6. They were used throughout this article, the reported results being highly insensitive to slight variations of them. However, in some extremely difficult cases, like learning the parity in dimensions N > 10 and finding the separation of the sonar signals (see section 5), larger values of θ were needed.
Classification Tasks with Binary Units
1015
2.6 Interpretation. It has been shown (Gordon, Peretto, & Berchier, 1993) that the contribution of each pattern to the cost function of Minimerror, [1 − tanh(γ µ /2T)]/2, may be interpreted as the probability of misclassification at the temperature T at which the minimum of the cost function has been determined. By analogy, the neuron’s prediction on a new input ξE may be given a confidence measure by replacing the (unknown) pattern stabilE · ξE k/kwk, E which is its distance to the ity by its absolute value kγ k = kw hyperplane. This interpretation of the sigmoidal function tanh(kγ k/2T) as the confidence on the neuron’s output is similar to the one proposed earlier (Goodman, Smyth, Higgins, & Miller, 1992) within an approach based on information theory. The generalization of these ideas to multilayered networks is not straightforward. An estimate of the confidence on the classification by the output neuron should include the magnitude of the weighted sums of the hidden neurons, as they measure the distances of the input pattern to the domain boundaries. However, short distances to the separating hyperplanes are not always correlated to low confidence on the network’s output. For an example, we refer again to Figure 1. Consider a pattern lying close to hyperplane 1. A small, weighted sum on neuron 1 may cast doubt on the classification if the pattern’s IR is (− + +) but not if it is (− + −), because a change of the sign of the weighted sum in the latter case will map the pattern to the IR (+ + −) which, being another IR of the same class, will be given the same output by the network. It is worth noting that the same difficulty is met by the interpretation of the outputs of multilayered perceptrons, trained with backpropagation, as posterior probabilities. We do not explore this problem any further because it is beyond the scope of this article. 3 Comparison with Other Strategies There are few learning algorithms for neural networks composed of binary units. To our knowledge, all of them are incremental. In this section, we give a short overview of some of them, in order to put forward the main differences with NetLines. We discuss the growth heuristics and then the individual unit training algorithms. The Tiling algorithm (M´ezard & Nadal, 1989) introduces hidden layers, one after the other. The first neuron of each layer is trained to learn an IR that helps to decrease the number of training errors; supplementary hidden units are then appended to the layer until the IRs of all the patterns in the training set are faithful. This procedure may generate very large networks. The Upstart algorithm (Frean, 1990) introduces successive couples of daughter hidden units between the input layer and the previously included hidden units, which become their parents. The daughters are trained to correct the parents’ classification errors, one daughter for each class. The obtained network has a treelike architecture. There are two different algorithms implementing the Tilinglike Learning in the Parity Machine (Biehl & Opper,
1016
J. Manuel Torres Moreno and Mirta B. Gordon
1991), Offset (Martinez & Est`eve, 1992), and MonoPlane (Torres Moreno & Gordon, 1995). In both, each appended unit is trained to correct the errors of the previously included unit in the same hidden layer, a procedure that has been shown to generate a parity machine: the class of the input patterns is the parity of the learned IRs. Unlike Offset, which implements the parity through a second hidden layer that needs to be pruned, MonoPlane goes on adding hidden units (if necessary) in the same hidden layer until the number of training errors at the output vanishes. Convergence proofs for binary input patterns have been produced for all these algorithms. In the case of real-valued input patterns, a solution to the parity machine with a bounded number of hidden units also exists (Gordon, 1996). The rationale behind the construction of the parity machine is that it is not worth training the output unit before all the training errors of the hidden units have been corrected. However, Marchand, Golea, and Ruj´an (1990) pointed out that it is not necessary to correct all the errors of the successively trained hidden units. It is sufficient that the IRs be faithful and LS. If the output unit is trained immediately after each appended hidden unit, the network may discover that the IRs are already faithful and stop adding units. This may be seen in Figure 1. None of the parity machine implementations would find the solution represented on the figure, because each of the three perceptrons systematically unlearns part of the patterns learned by the preceding one. To our knowledge, Sequential Learning (Marchand et al., 1990) is the only incremental learning algorithm that might find a solution equivalent (although not the same) to the one of Figure 1. In this algorithm, the first unit is trained to separate the training set keeping one “pure” half-space— containing patterns of only one class. Wrongly classified patterns, if any, must all lie in the other half-space. Each appended neuron is trained to separate wrongly classified patterns with this constraint of always keeping one pure, error-free half-space. Thus, neurons must be appended in a precise order, making the algorithm difficult to implement in practice. For example, Sequential Learning applied to the problem of Figure 1 needs to impose that E 3 , the only solution satisfying the purity the first unit finds the weights w restriction. Other proposed incremental learning algorithms strive to solve the problem with different architectures, and/or with real valued units. For example, in the algorithm Cascade Correlation (Fahlman & Lebiere, 1990), each appended unit is selected among a pool of several real-valued neurons, trained to learn the correlation between the targets and the training errors. The unit is then connected to the input units and to all the other hidden neurons already included in the network. Another approach to learning classification tasks is through the construction of decision trees (Breiman, Friedman, Olshen, & Stone, 1984), which hierarchically partition the input space through successive dichotomies. The neural networks implementations generate treelike architectures. Each neu-
Classification Tasks with Binary Units
1017
ron of the tree introduces a dichotomy of the input space, which is treated separately by the children nodes, which eventually produce new splits. Besides the weights, the resulting networks need to store the decision path. The proposed heuristics (Sirat & Nadal, 1990; Farrell & Mammone, 1994; Knerr, Personnaz, & Dreyfus, 1990) differ in the algorithm used to train each node and/or in the stopping criterion. In particular, Neural-Trees (Sirat & Nadal, 1990) may be regarded as a generalization of Classification and Regression Trees (CART) (Breiman et al., 1984) in which the hyperplanes are not constrained to be perpendicular to the coordinate axis. The heuristics of the Modified Neural Tree Network (MNTN) (Farrell & Mammone, 1994), similar to Neural-Trees, includes a criterion of early stopping based on a confidence measure of the partition. As NetLines considers the whole input space to train each hidden unit, it generates domain boundaries that may greatly differ from the splits produced by trees. We are not aware of any systematic study or theoretical comparison of both approaches. Other algorithms, like Restricted Coulomb Energy (RCE) (Reilly, Cooper, & Elbaum, 1982), Grow and Learn (GAL) (Alpaydin, 1990), Glocal (Depenau, 1995), and Growing Cells (Fritzke, 1994), propose to cover or mask the input space with hyperspheres of adaptive size containing patterns of the same class. These approaches generally end up with a very large number of units. Covering Regions by the LP Method (Mukhopadhyay, Roy, Kim, & Govil, 1993) is a trial-and-error procedure devised to select the most efficient masks among hyperplanes, hyperspheres, and hyperellipsoids. The mask’s parameters are determined through linear programming. Many incremental strategies use the Pocket algorithm (Gallant, 1986) to train the appended units. Its main drawback is that it has no natural stopping condition, which is left to the user’s patience. The proposed alternative algorithms (Frean, 1992; Bottou & Vapnik, 1992) are not guaranteed to find the best solution to the problem of learning. The algorithm used by the MNTN (Farrell & Mammone, 1994) and the ITRULE (Goodman et al., 1992) minimize cost functions similar to equation 2.4, but using different misclassification measures at the place of our stability (see equation 2.3). The essential difference with Minimerror is that none of these algorithms is able to control which patterns contribute to learning, as Minimerror does with the temperature. 4 Generalization to Multiclass Problems The usual way to cope with problems having more than two classes is to generate as many networks as classes. Each network is trained to separate patterns of one class from all the others, and a winner-takes-all (WTA) strategy based on the value of the output’s weighted sum in equation 2.2 is used to decide the class if more than one network recognizes the input pattern. In our case, because we use normalized weights, the output’s weighted sum is merely the distance of the IR to the separating hyperplane. All the pat-
1018
J. Manuel Torres Moreno and Mirta B. Gordon
terns mapped to the same IR are given the same output’s weighted sum, independent of the relative position of the pattern in input space. A strong weighted sum on the output neuron is not inconsistent with small weighted sums on the hidden neurons. Therefore, a naive WTA decision may not give good results, as shown in the example in section 5.3.1. We now describe an implementation for the multiclass problem that results in a treelike architecture of networks. It is more involved than the naive WTA and may be applied to any binary classifier. Suppose that we have a problem with C classes. We must choose in which order the classes will be learned, say (c1 , c2 , . . . , cC ). This order constitutes a particular learning sequence. Given a particular learning sequence, a first network is trained to separate class c1 , which is given output target τ1 = +1, from the others (which are given targets τ1 = −1). The opposite convention is equivalent and could equally be used. After training, all the patterns of class c1 are eliminated from the training set, and we generate a second network trained to separate patterns of class c2 from the remaining classes. The procedure, reiterated with training sets of decreasing size, generates C − 1 hierarchically organized tree of networks (TON): the outputs are ordered sequences ζE = (ζ1 , ζ2 , . . . , ζC−1 ). The predicted class of a pattern is ci , where i is the first network in the sequence having an output +1 (ζi = +1 and ζj = −1 for j < i), the outputs of the networks with j > i being irrelevant. The performance of the TON may depend on the chosen learning sequence. Therefore, it is convenient that an odd number of TONs, trained with different learning sequences, compete through a vote. We verified empirically, as is shown in section 5.3, that this vote improves the results obtained with each of the individual TONs participating in the vote. Notice that our procedure is different from bagging (Breiman, 1994); all the networks of the TON are trained with the same training set, without the need of any resampling procedure. 5 Applications Although convergence proofs of learning algorithms are satisfactory on theoretical grounds, they are not a guarantee of good generalization. In fact, they demonstrate only that correct learning is possible; they do not address the problem of generalization. This last issue still remains quite empirical (Vapnik, 1992; Geman et al., 1992; Friedman, 1996), and the generalization performance of learning algorithms is usually tested on well-known benchmarks (Prechelt, 1994). We first tested the algorithm on learning the parity function of N bits for 2 ≤ N ≤ 11. It is well known that the smallest network with the architecture considered here needs H = N hidden neurons. The optimal architecture was found in all the cases. Although this is quite an unusual performance, the parity is not a representative problem: learning is exhaustive, and generalization cannot be tested. Another test, the classification of sonar signals
Classification Tasks with Binary Units
1019
(Gorman & Sejnowski, 1988), revealed the quality of Minimerror, as it solved the problem without hidden units. In fact, we found that not only the training set of this benchmark is linearly separable, a result already reported (Hoehfeld & Fahlman, 1991; Roy, Kim, & Mukhopadhyay, 1993), but that the complete database—the training and the test sets together—is also linearly separable (Torres Moreno & Gordon, 1998). We next present our results, generalization error ²g and number of weights, on several benchmarks corresponding to different kinds of problems: binary classification of binary input patterns, binary classification of real-valued input patterns, and multiclass problems. These benchmarks were chosen because they have already served as a test for many other algorithms, providing us with unbiased results for comparison. The generalization error ²g of NetLines was estimated as usual, through the fraction of misclassified patterns on a test set of data. The results are reported as a function of the training sets sizes P whenever these sizes are not specified by the benchmark. Besides the generalization error ²g , averaged over a (specified) number of classifiers trained with randomly selected training sets, we also present the number of weights of the corresponding networks which is a measure of the classifier’s complexity, as it corresponds to the number of its parameters. Training times are usually cited among the characteristics of the training algorithms. Only the numbers of epochs used by backpropagation on two of the studied benchmarks have been published; we restrict the comparison to these cases. As NetLines updates only N weights per epoch, whereas backpropagation updates all the network’s weights, we compare the total number of weights updates. They are of the same order of magnitude for both algorithms. However, these comparisons should be taken with caution. NetLines is a deterministic algorithm; it learns the architecture and the weights through a single run, whereas with backpropagation several architectures must be previously investigated, and this time is not included in the training time. The following notation is used: D is the total number of available patterns, P the number of training patterns, and G the number of test patterns. 5.1 Binary Inputs. The case of binary input patterns has the property, not shared by real-valued inputs, that every pattern may be separated from the others by a single hyperplane. This solution, usually called grandmother, needs as many hidden units as patterns in the training set. In fact, the convergence proofs for incremental algorithms in the case of binary input patterns are based on this property. 5.1.1 Monk’s Problem. This benchmark, thoroughly studied with many different learning algorithms (Trhun et al., 1991), contains three distinct problems. Each has an underlying logical proposition that depends on six discrete variables, coded with N = 17 binary numbers. The total number of
1020
J. Manuel Torres Moreno and Mirta B. Gordon
possible input patterns is D = 432, and the targets correspond to the truth table of the corresponding proposition. Both NetLines and MonoPlane found the underlying logical proposition of the first two problems; they generalized correctly, giving ²g = 0. In fact, these are easy problems: all the neural network–based algorithms, and some nonneural learning algorithms were reported to generalize them correctly. In the third Monk’s problem, 6 patterns among the P3 = 122 examples are given wrong targets. The generalization error is calculated over the complete set of D = 432 patterns, that is, including the training patterns, but in the test set all the patterns are given the correct targets. Thus, any training method that learns the training set correctly will make at least 1.4% of generalization errors. Four algorithms specially adapted to noisy problems were reported to reach ²g = 0. However, none of them generalizes correctly the two other (noiseless) Monk’s problems. Besides them, the best performance, ²g = 0.0277, which corresponds to 12 misclassified patterns, is reached only by neural networks methods: backpropagation, backpropagation with weight decay, cascade correlation, and NetLines. The number of hidden units generated with NetLines (58 weights) is intermediate between backpropagation with weight decay (39) and cascade correlation (75) or backpropagation (77). MonoPlane reached a slightly worse performance (²g = 0.0416, or 18 misclassified patterns) with the same number of weights as NetLines, showing that the parity machine encoding may not be optimal. 5.1.2 Two or More Clumps. In this problem (Denker et al., 1987) the network has to discriminate if the number of clumps in a ring of N bits is strictly smaller than 2 or not. One clump is a sequence of identical bits bounded by bits of the other kind. The patterns are generated through a Monte Carlo method in which the mean number of clumps is controlled by a parameter k (M´ezard & Nadal, 1989). We generated training sets of P patterns with k = 3, corresponding to a mean number of clumps of ≈ 1.5, for rings of N = 10 and N = 25 bits. The generalization error corresponding to several learning algorithms, estimated with independently generated testing sets of the same sizes as the training sets, G = P, are displayed in Figure 2 as a function of P. Points with error bars correspond to averages over 25 independent training sets. Points without error bars correspond to best results. NetLines, MonoPlane, and Upstart for N = 25 have nearly the same performances when trained to reach error-free learning. We tested the effect of early stopping by imposing on NetLines a maximal number of two hidden units (H = 2). The residual training error ²t is plotted on Figure 2, as a function of P. Note that early stopping does not help to decrease ²g . Overfitting, which arises when NetLines is applied until error-free training is reached, does not degrade the network’s generalization performance. This behavior is very different from the one of networks trained with backpropagation. The latter reduces classification learning to a regression problem, in which the generalization error can be decomposed in two
Classification Tasks with Binary Units
1021
RU PRUH FOXPSV 1
RU PRUH FOXPSV 1 7LOLQJ*URZWK
%DFNSURS
6WHSZLVH
εJ
0RQR3ODQH
1HW/LQHV +
1HW/LQHV 8SVWDUW
1HW/LQHV
1HW/LQHV
εW
3
3
Figure 2: Two or more clumps for two ring sizes, N = 10 and N = 25. Generalization error ²g versus size of the training set P, for different algorithms. N = 10: backpropagation (Solla, 1989), Stepwise (Knerr et al., 1990). N = 25: Tiling (M´ezard & Nadal, 1989), Upstart (Frean, 1990). Results with the Growth Algorithm (Nadal, 1989) are indistinguishable from those of Tiling at the scale of the figure. Points without error bars correspond to best results. Results of MonoPlane and NetLines are averages over 25 tests.
competing terms: bias and variance. With backpropagation, early stopping helps to decrease overfitting because some hidden neurons do not reach large enough weights to work in the nonlinear part of the sigmoidal transfer functions. All the neurons working in the linear part may be replaced by a single linear unit. Thus, with early stopping, the network is equivalent to a smaller one with all the units working in the nonlinear regime. Our results are consistent with recent theories (Friedman, 1996) showing that, contrary to regression, the bias and variance components of the generalization error in classification combine in a highly nonlinear way. The number of weights used by the different algorithms is plotted on a logarithmic scale as a function of P in Figure 3. It turns out that the strategy of NetLines is slightly better than that of MonoPlane with respect to both generalization performance and network size. 5.2 Real Valued Inputs. We tested NetLines on two problems that have real valued inputs (we include graded-valued inputs here). 5.2.1 Wisconsin Breast Cancer Database. The input patterns of this benchmark (Wolberg & Mangasarian, 1990) have N = 9 attributes characterizing
1022
J. Manuel Torres Moreno and Mirta B. Gordon
1XPEHU RI ZHLJKWV
RU PRUH FOXPSV 1 %DFNSURSDJDWLRQ
RU PRUH FOXPSV 1
6WHSZLVH
0RQR3ODQH
1HW/LQHV
8SVWDUW
1HW/LQHV
3
3
Figure 3: Two or more clumps. Number of weights (logarithmic scale) versus size of the training set P, for N = 10 and N = 25. Results of MonoPlane and NetLines are averages over 25 tests. The references are the same as in Figure 2.
samples of breast cytology, classified as benign or malignant. We excluded from the original database 16 patterns that have the attribute ξ6 (“bare nuclei”) missing. Among the remaining D = 683 patterns, the two classes are unevenly represented, 65.5% of the examples being benign. We studied the generalization performance of networks trained with sets of several sizes P. The P patterns for each learning test were selected at random. In Figure 4a, the generalization error at classifying the remaining G ≡ D − P patterns is displayed as a function of the corresponding number of weights in a logarithmic scale. For comparison, we included in the same figure results of a single perceptron trained with P = 75 patterns using Minimerror. The results, averaged values over 50 independent tests for each P, show that both NetLines and MonoPlane have lower ²g and fewer parameters than other algorithms on this benchmark. The total number of weights updates needed by NetLines, including the weights of the dropped output units, is 7 · 104 ; backpropagation needed ≈ 104 (Prechelt, 1994). The trained network may be used to classify the patterns with missing attributes. The number of misclassified patterns among the 16 cases for which attribute ξ6 is missing is plotted as a function of the possible values of ξ6 on Figure 4b. For large values of ξ6 , there are discrepancies between the medical and the network’s diagnosis on half the cases. This is an example of the kind of information that may be obtained in practical applications.
Classification Tasks with Binary Units
1023
%UHDVW FDQFHU D
%UHDVW FDQFHU E
0LQLPHUURU 3
0RQR3ODQH
1HW/LQHV 3
1HW/LQHV
εJ
0RQR3ODQH 3
1HW/LQHV 0RQR3ODQH
1XPEHU RI ZHLJKWV
3RVVLEOH YDOXHV RI DWWULEXWH
ξ
Figure 4: Breast cancer classification. (a) Generalization error ²g versus number of weights (logarithmic scale), for P = 525. 1–3: Rprop with no shortcuts (Prechelt, 1994); 4–6: Rprop with shortcuts (Prechelt, 1994); 7: Cascade Correlation (Depenau, 1995). For comparison, results with smaller training sets, P = 75 (single perceptron) and P = 160, are displayed. Results of MonoPlane and NetLines are averages over 50 tests. (b) Classification errors versus possible values of the missing attribute bare nuclei for the 16 incomplete patterns, averaged over 50 independently trained networks.
5.2.2 Diabetes Diagnosis. This benchmark (Prechelt, 1994) contains D = 768 patterns described by N = 8 real-valued attributes, corresponding to ≈ 35% of Pima women suffering from diabetes, 65% being healthy. Training sets of P = 576 patterns were selected at random, and generalization was tested on the remaining G = 192 patterns. The comparison with published results obtained with other algorithms tested under the same conditions, presented in Figure 5, shows that NetLines reaches the best performance published so far on this benchmark, needing many fewer parameters. Training times of NetLines are of ≈ 105 updates. The numbers of updates needed by Rprop (Prechelt, 1994) range between 4 · 103 and 5 · 105 , depending on the network’s architecture. 5.3 Multiclass Problems. We applied our learning algorithm to two different problems, both of three classes. We compare the results obtained with a WTA classification based on the results of three networks, each independently trained to separate one class from the two others, to the results of the TON architectures described in section 4. Because the number of classes is low, we determined the three TONs, corresponding to the three possible
1024
J. Manuel Torres Moreno and Mirta B. Gordon
,QGLDQV 3LPD 'LDEHWHV
εJ
1HW/LQHV
1XPEHU RI ZHLJKWV
Figure 5: Diabetes diagnosis: Generalization error ²g versus number of weights. Results of NetLines are averages over 50 tests. 1–3: Rprop no shortcuts, 4–6: Rprop with shortcuts (Prechelt, 1994).
learning sequences. The vote of the three TONs improves the performances, as expected. 5.3.1 Breiman’s Waveform Recognition Problem. This problem was introduced as a test for the algorithm CART (Breiman et al., 1984). The input patterns are defined by N = 21 real-valued amplitudes x(t) observed at regularly spaced intervals t = 1, 2, . . . , N. Each pattern is a noisy convex linear combination of two among three elementary waves (triangular waves centered on three different values of t). There are three possible combinations, and the pattern’s class identifies from which combination it is issued. We trained the networks with the same 11 training sets of P = 300 examples, and generalization was tested on the same independent test set of G = 5000, as in Gascuel (1995). Our results are displayed in Figure 6, where only results of algorithms reaching ²g < 0.25 in Gascuel (1995) are included. Although it is known that due to the noise, the classification error has a lower bound of ≈ 14% (Breiman et al., 1984), the results of NetLines and MonoPlane presented here correspond to error-free training. The networks generated by NetLines have between three and six hidden neurons, depending on the training sets. The results obtained with a single perceptron trained with Minimerror and with the perceptron learning algorithm, which may be considered the extreme case of early stopping, are hardly improved by the more complex networks. Here again the overfitting produced by error-free learning with NetLines does not cause the generalization per-
Classification Tasks with Binary Units
1025
%UHLPDQ V :DYHIRUPV
0RQR3ODQH :7$
εJ
0LQLPHUURU 1HW/LQHV 9RWH
7KHRUHWLFDO OLPLW
1XPEHU RI SDUDPHWHUV
Figure 6: Breiman waveforms: Generalization error ²g averaged over 11 tests versus number of parameters. Error bars on the number of weights generated by NetLines and MonoPlane are not visible at the scale of the figure. 1: linear discrimination; 2: perceptron; 3: backpropagation; 4: genetic algorithm; 5: quadratic discrimination; 6: Parzen’s kernel; 7: K-NN; 8: constraint (Gascuel, 1995).
formance to deteriorate. The TONs vote reduces the variance but does not decrease the average ²g . 5.3.2 Fisher’s Iris Plants Database. In this classic three-class problem, one has to determine the class of iris plants based on the values of N = 4 realvalued attributes. The database of D = 150 patterns contains 50 examples of each class. Networks were trained with P = 149 patterns, and the generalization error is the mean value of all the 150 leave-one-out possible tests. Results of ²g are displayed as a function of the number of weights in Figure 7. Error bars are available for only our own results. In this difficult problem, the vote of the three possible TONs trained with the three possible class sequences (see section 4) improves the generalization performance. 6 Conclusion We presented an incremental learning algorithm for classification, which we call NetLines. It generates small feedforward neural networks with a single hidden layer of binary units connected to a binary output neuron. NetLines allows for an automatic adaptation of the neural network to the complexity of the particular task. This is achieved by coupling an error-correcting strategy for the successive addition of hidden neurons with Minimerror, a very
1026
J. Manuel Torres Moreno and Mirta B. Gordon
,5,6 GDWDEDVH
1HW/LQHV :7$
εJ
1HW/LQHV YRWH
1XPEHU RI ZHLJKWV
Figure 7: Iris database: Generalization error ²g versus number of parameters. 1: offset, 2: backpropagation (Martinez & Est`eve, 1992); 4,5: backpropagation (Verma & Mulawka, 1995); 3,6: gradient-descent orthogonalized training (Verma & Mulawka, 1995).
efficient perceptron training algorithm. Learning is fast not only because it reduces the problem to that of training single perceptrons, but mainly because there is no longer a need for the usual preliminary tests required to determine the correct architecture for the particular application. Theorems valid for binary as well as for real-valued inputs guarantee the existence of a solution with a bounded number of hidden neurons obeying the growth strategy. The networks are composed of binary hidden units whose states constitute a faithful encoding of the input patterns. They implement a mapping from the input space to a discrete H-dimensional hidden space, H being the number of hidden neurons. Thus, each pattern is labeled with a binary word of H bits. This encoding may be seen as a compression of the pattern’s information. The hidden neurons define linear boundaries, or portions of boundaries, between classes in input space. The network’s output may be given a probabilistic interpretation based on the distance of the patterns to these boundaries. Tests on several benchmarks showed that the networks generated by our incremental strategy are small, in spite of the fact that the hidden neurons are appended until error-free learning is reached. Even when the networks obtained with NetLines are larger than those used by other algorithms, its generalization error remains among the smallest values reported. In noisy or difficult problems, it may be useful to stop the network’s growth before
Classification Tasks with Binary Units
1027
the condition of zero training errors is reached. This decreases overfitting, as smaller networks (with less parameters) are thus generated. However, the prediction quality (measured by the generalization error) of the classifiers generated with NetLines is not improved by early stopping. Our results were obtained without cross-validation or any data manipulation like boosting, bagging, or arcing (Breiman, 1994; Drucker, Schapire, & Simard, 1993). Those costly procedures combine results of very large numbers of classifiers, with the aim of improving the generalization performance through the reduction of the variance. Because NetLines is a stable classifier, presenting small variance, we do not expect that such techniques would significantly improve our results. Appendix In this appendix we exhibit a particular solution to the learning strategy of NetLines. This solution is built in such a way that the cardinal of a convex subset of well-learned patterns, Lh , grows monotonically upon the addition of hidden units. Because this cardinal cannot be larger than the total number of training patterns, the algorithm must stop with a finite number of hidden units. Suppose that h hidden units have already been included and that the output neuron still makes classification errors on patterns of the training set, called training errors. Among these wrongly learned patterns, let ν be the E h , called hyperplane-h hereafter. one closest to the hyperplane normal to w We define Lh as the subset of (correctly learned) patterns lying closer to hyperplane-h than ξE ν . Patterns in Lh have 0 < γh < |γhν |. The subset Lh and at least pattern ν are well learned if the next hidden unit, h + 1, has weights: E h · ξE ν )ˆe0 , E h − (1 − ²h )τhν (w E h+1 = τhν w w
(A.1)
where eˆ0 ≡ (1, 0, . . . , 0). The conditions that both Lh and pattern ν have positive stabilities (are correctly learned) impose that µ
0 < ²h < min µ∈Lh
|γhν | − γh . |γhν |
(A.2)
The following weights between the hidden units and the output will give the correct output to pattern ν and to the patterns of Lh : W0 (h + 1) = W0 (h) + τ ν Wi (h + 1) = Wi (h) for 1 ≤ i ≤ h Wh+1 (h + 1) = −τ ν .
(A.3) (A.4) (A.5)
Thus, card(Lh+1 ) ≥ card(Lh ) + 1. As the number of patterns in Lh increases monotonically with h, convergence is guaranteed with less than P hidden units.
1028
J. Manuel Torres Moreno and Mirta B. Gordon
Acknowledgments J.M. thanks Consejo Nacional de Ciencia y Tecnolog´ıa and Universidad Autonoma ´ Metropolitana, M´exico, for financial support (grant 65659). References Alpaydin, E. A. I. (1990). Neural models of supervised and unsupervised learning. Unpublished doctoral dissertation, Ecole Polytechnique F´ed´erale de Lausanne, Switzerland. Biehl, M., & Opper, M. (1991). Tilinglike learning in the parity machine. Physical Review A, 44, 6888. Bottou, L., & Vapnik, V. (1992). Local learning algorithms. Neural Computation, 4(6), 888–900. Breiman, L. (1994). Bagging predictors (Tech. Rep. No. 421). Berkeley: Department of Statistics, University of California at Berkeley. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth and Brooks/Cole. Denker, J., Schwartz, D., Wittner, B., Solla, S., Howard, R., Jackel, L., & Hopfield, J. (1987). Large automatic learning, rule extraction, and generalization. Complex Systems, 1, 877–922. Depenau, J. (1995). Automated design of neural network architecture for classification. Unpublished doctoral dissertation, Computer Science Department, Aarhus University. Drucker, H., Schapire, R., & Simard, P. (1993). Improving performance in neural networks using a boosting algorithm. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 42– 49). San Mateo, CA: Morgan Kaufmann. Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 524–532). San Mateo: Morgan Kaufmann. Farrell, K. R., & Mammone, R. J. (1994). Speaker recognition using neural tree networks. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems, 6 (pp. 1035–1042). San Mateo, CA: Morgan Kaufmann. Frean, M. (1990). The Upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation, 2(2), 198–209. Frean, M. (1992). A “thermal” perceptron learning rule. Neural Computation, 4(6), 946–957. Friedman, J. H. (1996). On bias, variance, 0/1-loss, and the curse-of-dimensionality (Tech. Rep.) Stanford, CA: Department of Statistics, Stanford University. Fritzke, B. (1994). Supervised learning with growing cell structures. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 255–262). San Mateo, CA: Morgan Kaufmann. Gallant, S. I. (1986). Optimal linear discriminants. In Proc. 8th. Conf. Pattern Recognition, Oct. 28–31, Paris, vol. 4.
Classification Tasks with Binary Units
1029
Gascuel, O. (1995). Symenu. Collective Paper (Gascuel O. Coordinator) (Tech. Rep.). 5`emes Journ´ees Nationales du PRC-IA Teknea, Nancy. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. Goodman, R. M., Smyth, P., Higgins, C. M., & Miller, J. W. (1992). Rule-based neural networks for classification and probability estimation. Neural Computation, 4(6), 781–804. Gordon, M. B. (1996). A convergence theorem for incremental learning with realvalued inputs. In IEEE International Conference on Neural Networks, pp. 381– 386. Gordon, M. B., & Berchier, D. (1993). Minimerror: A perceptron learning rule that finds the optimal weights. In M. Verleysen (Ed.), European Symposium on Artificial Neural Networks (pp. 105–110). Brussels: D Facto. Gordon, M. B., & Grempel, D. (1995). Optimal learning with a temperature dependent algorithm. Europhysics Letters, 29(3), 257–262. Gordon, M. B., Peretto, P., & Berchier, D. (1993). Learning algorithms for perceptrons from statistical physics. Journal of Physics I (France), 3, 377–387. Gorman, R. P., & Sejnowski, T. J. (1988). Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks, 1, 75–89. Gyorgyi, G., & Tishby, N. (1990). Statistical theory of learning a rule. In W. K. Theumann & R. Koeberle (Eds.), Neural networks and spin glasses. Singapore: World Scientific. Hoehfeld, M., & Fahlman, S. (1991). Learning with limited numerical precision using the cascade correlation algorithm (Tech. Rep. No. CMU-CS-91-130). Pittsburgh: Carnegie Mellon University. Knerr, S., Personnaz, L., & Dreyfus, G. (1990). Single-layer learning revisited: A stepwise procedure for building and training a neural network. In J. H´erault & F. Fogelman (Eds.), Neurocomputing, algorithms, architectures and applications (pp. 41–50). Berlin: Springer-Verlag. Marchand, M., Golea, M., & Ruj´an, P. (1990). A convergence theorem for sequential learning in two-layer perceptrons. Europhysics Letters, 11, 487–492. Martinez, D., & Est`eve, D. (1992). The offset algorithm: Building and learning method for multilayer neural networks. Europhysics Letters, 18, 95–100. M´ezard, M., & Nadal, J.-P. (1989). Learning in feedforward layered networks: The Tiling algorithm. J. Phys. A: Math. and Gen., 22, 2191–2203. Mukhopadhyay, S., Roy, A., Kim, L. S., & Govil, S. (1993). A polynomial time algorithm for generating neural networks for pattern classification: Its stability properties and some test results. Neural Computation, 5(2), 317–330. Nadal, J.-P. (1989). Study of a growth algorithm for a feedforward neural network. Int. J. Neur. Syst., 1, 55–59. Prechelt, L. (1994). PROBEN1—A set of benchmarks and benchmarking rules for neural network training algorithms (Tech. Rep. No. 21/94). University of Karlsruhe, Faculty of Informatics. Raffin, B., & Gordon, M. B. (1995). Learning and generalization with Minimerror, a temperature dependent learning algorithm. Neural Computation, 7(6), 1206– 1224.
1030
J. Manuel Torres Moreno and Mirta B. Gordon
Reilly, D. E, Cooper, L. N., & Elbaum, C. (1982). A neural model for category learning. Biological Cybernetics, 45, 35–41. Roy, A., Kim, L., & Mukhopadhyay, S. (1993). A polynomial time algorithm for the construction and training of a class of multilayer perceptron. Neural Networks, 6(1), 535–545. Sirat, J. A., & Nadal, J.-P. (1990). Neural trees: A new tool for classification. Network, 1, 423–438. Solla, S. A. (1989). Learning and generalization in layered neural networks: The contiguity problem. In L. Personnaz & G. Dreyfus (Eds.), Neural Networks from Models to Applications. Paris: I.D.S.E.T. Torres Moreno, J.-M., & Gordon, M. B. (1995). An evolutive architecture coupled with optimal perceptron learning for classification. In M. Verleysen (Ed.), European Symposium on Artificial Neural Networks. Brussels: D Facto. Torres Moreno, J.-M., & Gordon, M. B. (1998). Characterization of the sonar signals benchmark. Neural Proc. Letters, 7(1), 1–4. Trhun, S. B., et al. (1991). The monk’s problems: A performance comparison of different learning algorithms (Tech. Rep. No. CMU-CS-91-197). Pittsburgh: Carnegie Mellon University. Vapnik, V. (1992). Principles of risk minimization for learning theory. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 831–838). San Mateo, CA: Morgan Kaufmann. Verma, B. K., & Mulawka, J. J. (1995). A new algorithm for feedforward neural networks. In M. Verleysen (Ed.), European Symposium on Artificial Neural Networks (pp. 359–364). Brussels: D Facto. Wolberg, W. H., & Mangasarian, O. L. (1990). Multisurface method of pattern separation for medical diagnosis applied to breast cytology. In Proceedings of the National Academy of Sciences, USA, 87, 9193–9196. Received February 13, 1997; accepted September 4, 1997.
LETTER
Communicated by John Platt
A Systematic and Effective Supervised Learning Mechanism Based on Jacobian Rank Deficiency Guian Zhou Jennie Si Department of Electrical Engineering, Arizona State University, Tempe, AZ 852875706, U.S.A.
Most neural network applications rely on the fundamental approximation property of feedforward networks. Supervised learning is a means of implementing this approximate mapping. In a realistic problem setting, a mechanism is needed to devise this learning process based on available data, which encompasses choosing an appropriate set of parameters in order to avoid overfitting, using an efficient learning algorithm measured by computation and memory complexities, ensuring the accuracy of the training procedure as measured by the training error, and testing and cross-validation for generalization. We develop a comprehensive supervised learning algorithm to address these issues. The algorithm combines training and pruning into one procedure by utilizing a common observation of Jacobian rank deficiency in feedforward networks. The algorithm not only reduces the training time and overall complexity but also achieves training accuracy and generalization capabilities comparable to more standard approaches. Extensive simulation results are provided to demonstrate the effectiveness of the algorithm. 1 Introduction Multilayer feedforward networks have been applied successfully to solve such difficult and diverse problems as nonlinear system identification and control problems (Narendra & Parthasarathy, 1990; Hertz, Krogh, & Palmer, 1991; White & Sofge, 1992; Haykin, 1994). In supervised learning, one starts with a training set and uses certain numerical procedures, which usually solve the nonlinear optimization problem deduced by supervised learning, to compute the network parameters (weights) by loading training data pairs into the network. The hope is that the network will generalize, meaning that the input-output relationship computed by the network is within certain expectations as measured by testing error, for some input-output data pairs never used in training the network. A poor generalization may occur due to overfitting. Generalization is generally affected by three factors: the size and efficiency of the training data set, the complexity of the network as repc 1998 Massachusetts Institute of Technology Neural Computation 10, 1031–1045 (1998) °
1032
Guian Zhou and Jennie Si
resented by number of weights, and the complexity of the physical problem. In most neural network application problems, the size of the training set is given; it may be large or small. The issue addressed in this article is how to determine a set of network parameters for achieving good generalization. This may include choosing the right network size and using an efficient algorithm for determining the weights for the network to achieve desired accuracy for training, validation, and testing. Several pruning methods have been proposed in recent years in the effort to select a correct network size. The network is first trained by a usual network training procedure, such as backpropagation, and then the network size is adjusted using a pruning algorithm. Usually these algorithms require retraining of the network, which results in extensive computation. Skeletonization (Mozer & Smolensky, 1989) removes the nodes that have the least influence on the output error, according to the derivatives of the output error with respect to the coefficients representing the attentional strengths of the nodes. This requires a complete retraining of the network since correlations between neuron activities are not taken into account. Optimal Brain Damage (Le Cun, Denker, & Sola, 1990) deletes the least salient weights based on a “diagonal” approximation of the Hessian, which assumes that the increment of the output error caused by deleting several parameters is the sum of the increments of the errors caused by each parameter individually; nondiagonal elements of the Hessian matrix are neglected. The diagonal assumption may result in deleting the wrong weights. Optimal Brain Surgeon (Hassibi & Stork, 1993) requires computing the entire Hessian matrix, which is costly in computation for large networks. Principal Components Pruning (Levin, Leen, & Moody, 1994) is based on principal component analysis of the linear node activations of successive layers of a network. But the influence of pruning the least salient eigennodes and related network parameters in a nonlinear network is uncertain, unlike in linear systems. Therefore it may require some more theoretical and experimental justifications on the effectiveness of the algorithm. Some other algorithms combine training and pruning into one comprehensive process. Weight Decay (Hinton, 1986) and Weight Elimination (Weigend, Rumelhart, & Huberman, 1991) add a term to the training objective function that penalizes large network complexities. The proper coefficient for this term is not known a priori. Therefore, it usually takes many repeated trials with different coefficient values to determine this extra term. The systematic approach we introduce, taking into account training and pruning in each iteration, is based on Jacobian rank deficiency. Saarinen, Bramley, and Cybenko (1991) observe that the Jacobian matrix derived from a feedforward network training is commonly and highly rank deficient because of saturation characteristics of node sigmoid functions, linear dependencies among node outputs, linear dependencies among local gradients, and so forth. Experiments have revealed that the rank of a deficient Jacobian matrix is about 60% to 80% of the size of the Jacobian on average,
Supervised Learning Mechanism
1033
and it may reach as low as 20%. Jacobian matrices with deficient ranks on one hand result in the Gauss-Newton and certain high-order algorithms not applicable, and on the other hand indicate that some weights in the network are redundant. Some modifications—for example, the LevenbergMarquardt algorithm—are designed to make the cross-product matrices of the Jacobian matrix positive definite. Although satisfactory convergence properties are obtained in network training (Hagen & Menhaj, 1994), the computational complexities are usually high and the weight redundancy issue is ignored. We propose a new training algorithm that updates the network weights corresponding to a more “efficient” Jacobian matrix (less rank deficient) and prunes the rest of the weights in each iteration. This overcomes overfitting introduced by a highly complex network and reduces computation and memory complexities in network training. The algorithm presented in this article is derived in the spirit of the Gauss-Newton method. It has similar convergence properties to the Levenberg-Marquardt algorithm. The details of the derivations of the training algorithm are described in the next two sections, and implementation is given in section 4. Further experimental evaluations of the algorithm are provided in section 5. Some final remarks are made in section 6. 2 The Gauss-Newton Algorithm under Jacobian Rank Deficiency The objective of training a feedforward network is to associate input-output training pairs {(x(1) , t(1) ), (x(2) , t(2) ), . . ., (x(m) , t(m) )} by properly adjusting the weights in the network such that a sum of squared error function given in the following is minimized, E(w) =
m m 1X 1X e2(i) = (yˆ (i) − t(i) )2 , 2 i=1 2 i=1
(2.1)
where w ∈ Rn denotes the weight vector, yˆ (i) represents the network output when the input is x(i) , and e(i) is the error between yˆ (i) and t(i) . We have implied in equation 2.1 that the network has a single output to simplify the notation. The results we obtained are also applicable to multiple output cases by properly enumerating the summation terms in equation 2.1. When an iterative Gauss-Newton update rule (Dennis & Schnabel, 1983) is employed, the weight change 1w(k) at the kth iteration from w(k) is computed by (JT J)1w(k) = −JT e,
(2.2)
where e = [e(1) , e(2) , . . . , e(m) ]T and J ∈ Rm×n is the Jacobian matrix at the kth iteration whose elements are defined as the derivatives of each error with
1034
Guian Zhou and Jennie Si
respect to each of the n weights. Obviously, when J is rank deficient, often the case for neural network problems, JT J is not invertible and equation 2.2 cannot be applied directly. To solve for 1w(k) , we make a diagonal pivoting triangular factorization (Dongarra, Bunch, Moler, & Stewart, 1979, pp. 8.3–8.5) on JT J, that is, (JP)T (JP) = LT DL, and thereby obtain ˜ (k) = −(JP)T e, LT DL1w
(2.3)
˜ (k) = P−1 1w(k) , L ∈ Rn×n is an where P ∈ Rn×n is a pivoting matrix, 1w upper triangular matrix with unit diagonal elements, and D ∈ Rn×n is a diagonal matrix of the form D = diag(d1 , d2 , . . . , dr , 0, . . . , 0), (di ≥ dj , for i < j), where r is the rank of J. Let b = (LT )−1 (JP)T e; we have DL1w ˜ (k) = −b.
(2.4)
Since D is not full rank, we represent D, L, 1w ˜ (k) , and b in appropriate block matrix forms as · D=
¸
D1 ·
and b =
0 b1 b2
· ,
L=
L1
¸ ,
L2 L3
"
¸ ,
(k)
1w ˜
=
1w ˜ (k) 1 1w ˜ (k) 2
# , (2.5)
where D1 ∈ Rr×r is a nonzero diagonal matrix, L1 ∈ Rr×r is an upper triangular matrix, and other submatrices and subvectors have proper dimensions. Substituting the block matrices of equation 2.5 into equation 2.4, one can verify that b2 = 0. Equation 2.4 then becomes ˜ (k) ˜ (k) D1 (L1 1w 1 + L2 1w 2 ) = −b1 .
(2.6)
The above derivation, in the spirit of the Gauss-Newton algorithm under Jacobian rank deficiency, divides the network weights into two subsets: ˜ (k) 1w ˜ (k) 1 and 1w 2 . The former is a principal component subset extracted from the entire weight set and therefore has major influence on the network at the kth iteration; the latter is insignificant in the sense that L2 is usually ˜ (k) dependent on L1 . In the following, we provide means of updating 1w 1 while eliminating 1w ˜ (k) simultaneously. 2
Supervised Learning Mechanism
1035
3 A Systematic Training Algorithm Based on Jacobian Rank Deficiency Recall equation 2.6. Let 1w ˜ (k) 2 = 0; we obtain −1 −1 1w ˜ (k) 1 = −L1 D1 b1 .
(3.1)
However, this update rule often provides inexact or inaccurate search directions since it is very sensitive to the elements of D1 with small values. On the other hand, the inner product of the search direction and the gradient JT e is often small, which implies the two are near orthogonal. Even with adjustable step sizes, we still observe poor convergence properties. To improve upon this situation, we modify equation 3.1 by adding an rth-order symmetric positive definite matrix S as follows, T −1 T 1w ˜ (k) 1 = −(L1 D1 L1 + µS) L1 b1 ,
(3.2)
where µ is a positive trial-and-error parameter decreased after each successful step and increased only if a step increases the training error. Let S = LT1 L1 . Equation 3.2 then becomes T T −1 T 1w ˜ (k) 1 = −(L1 D1 L1 + µL1 L1 ) L1 b1 −1 = −L−1 1 (D1 + µI1 ) b1 ,
(3.3)
where I1 is an identity matrix with proper dimension. It can be seen from equation 3.3 that if a given µ within an iteration causes an increase in the training error, one has to solve the equation only one more time for 1w ˜ (k) 1 for an increased µ by backward substitution with a computational complexity of only O(r2 /2). More important, equation 3.3 is much more numerically stable and able to provide satisfactory convergence properties in network training. When the above modification is extended to the general case, that is, 1w ˜ (k) 2 6= 0 in equation 2.6, we have −1 −1 ˜ (k) 1w ˜ (k) 1 = −L1 (D1 + µI1 ) (b1 + D1 L2 1w 2 ).
(3.4)
1w ˜ (k) 2 needs to be predetermined in equation 3.4. This provides an opportunity of eliminating w ˜ (k) 2 such that the network size is reduced by properly (k) designing 1w ˜ 2 . Toward this goal, let ˜ ˜ (k) 1w ˜ (k) 2 = −3(k) w 2 ,
(3.5)
1036
Guian Zhou and Jennie Si
or, equivalently, ˜ ˜ (k) 1w˜ (k) 2i = −λi (k)w 2i ,
for i = 1, . . . , n − r,
(3.6)
˜ where 3(k) is a diagonal matrix with diagonal elements 0 ≤ λ˜ i (k) ≤ 1, ˜ (k) ˜ (k) ˜ (k) i = 1, . . . , n − r; 1w˜ (k) 2i and w 2i are the ith elements in 1w 2 and w 2 , respectively. Note that the components and the dimensions of the weight subsets (k) ˜ (k) w ˜ 1 and w 2 sometimes vary in different iterations, since each element of the weights may move from one subset to another. Obviously a weight component should not be eliminated if it is in w ˜ (k) 1 , but once the weight (k) (k) ˜ 2 , it should be reduced; if the weight has stayed moves from w ˜ 1 into w in w ˜ (k) 2 for some successive iterations, it should be reduced in magnitude even more significantly. A formula to incorporate the ideas is proposed to compute λ˜ i (k) (i = 1, . . . , n − r) as follows, k ≤ σ˜ i (k) 0, (3.7) λ˜ i (k) = (k − σ˜ i (k))/N, σ˜ i (k) < k ≤ σ˜ i (k) + N 1, k > σ˜ i (k) + N, where N is a positive integer, σ˜ i (k) is a parameter depending on which subset the weight component 1w˜ (k) 2i belonged to during the previous iteration, σ˜ i (k) =
σ˜ i (k − 1) k + σ˜ 0
if 1w˜ (k) ˜ (k−1) 2i belonged to 1w 2 in the (k − 1)th iteration, else
(3.8)
where σ˜ 0 is an integer. Hence, a systematic neural network training algorithm is obtained from equations 3.4 and 3.5. At first, equation 3.5 is used to determine 1w ˜ (k) 2 such is reduced, and then equation 3.4 is applied to obtain a successful that w ˜ (k) 2 solution of 1w ˜ (k) 1 . This process is repeated until the training and validation errors are less than their tolerances and most of the weights in w ˜ (k) 2 are eliminated. In the above procedure, L2 in equation 3.4 has to be computed exactly. A similar but simpler training procedure is the combination of equations 3.3 and 3.5, that is, within one iteration, 1w ˜ (k) 2 is first assumed to be 0; equation 3.3 is then used to obtain a successful solution of 1w ˜ (k) 1 , which makes (k) the training error decrease; 1w ˜ 2 is finally updated by equation 3.5 such tend to zero in the long run. that w ˜ (k) 2 Thus, we in fact have obtained two alternative algorithms for neural network training. The weight updating rules within one iteration are highlighted in the following:
Supervised Learning Mechanism
1037
Table 1: An Approximate Estimation on Computational Complexities. PTNT Algorithm Memory
Multiplications Additions
r2 /2 ∼ rn-r2 /2
for L1 , L2
m(r2 /2 ∼ rn-r2 /2) for (JP)T JP r3 /6 ∼ nr2 /2-r3 /3 for L1 , L2 Same as above
Levenberg-Marquardt Algorithm n2 /2 n2 /2
For JT J for the factor of (JT J + µI)
mn2 /2 For JT J 3 n /3 For twice Same as above factorizations
Version I: Network Tuning and Parameter Tuning (NTPT) ˜ ˜ (k) 1. 1w ˜ (k) 2 = −3(k) w 2 ; −1 −1 ˜ (k) 2. 1w ˜ (k) 1 = −L1 (D1 + µI1 ) (b1 + D1 L2 1w 2 ).
Version II: Parameter Tuning and Then Network Tuning (PTNT) −1 −1 ˜ (k) 1. 1w ˜ (k) 1 = −L1 (D1 + µI1 ) b1 (implying 1w 2 = 0 at this step);
˜ ˜ (k) 2. 1w ˜ (k) 2 = −3(k) w 2 . It can be seen that L2 is “swept out” in version II, which may provide certain reduction in computational complexities. Therefore, the remainder of the article concentrates on the PTNT algorithm. 4 Implementation of the PTNT Algorithm In the following, we describe the implementation of the PTNT algorithm developed in the previous section in feedforward network training. 1. The diagonal pivoting triangular factorization on JT J. Ideally, the factorization proceeds row by row until the pth element dp in D is equal to 0. However, dp is hardly zero in digital simulation. In implementation, once dp ≤ ² d1 (² is small), it is considered that an effective value for the rank of D has been reached, that is, r = p − 1. The factorization process can be terminated at this point. Major computational requirements in the PTNT algorithm are due to the factorization. Regardless of the computational requirements for the network inputs, node outputs, and node local gradients in a network, the computational requirements for the PTNT in one iteration can be estimated as in Table 1. As a comparison, the computational requirements of the LevenbergMarquardt algorithm are also listed in the same table. Note that the
1038
Guian Zhou and Jennie Si
Levenberg-Marquardt algorithm usually requires two trials in µ within each iteration. Thus, it requires n2 in memory to store both the lower triangular part of JT J and the decomposed upper triangular factor of (JT J + µI). It also takes two factorizations on (JT J + µI), which demands n3 /3 multiplications and the same number of additions. Consequently, the complexity of the PTNT algorithm is much lower than that of the Levenberg-Marquardt in each iteration. 2. The trial-and-error parameter µ. The parameter is initialized to a small value µmin and decreased by γ (> 1) times after each successful step that makes the training error function decrease. If the present value of µ within an iteration causes an increase in the training error, µ is increased by γ times, and one has to solve equation 3.3 only one more time for 1w ˜ (k) 1 by a backward substitution. But µ has an upper bound, µmax . If µ > µmax (this ˜ (k) situation rarely happens), we could reset µ = µmin and change w 2 with (k) equation 3.5 or disturb w ˜ 2 with a small noise such that the optimization process could escape from the operating point. 3. The decaying parameters N and σ0 . Apparently a very small N implies significant changes of the weights in w ˜ (k) 2 , which may result in the training error function’s increasing so much that the training process may not converge. However, a too large N may hardly reduce the weights to zero. An experiential choice of N = 20 ∼ 30 usually enables the training process to converge smoothly to the desired accuracy, and many weights in w ˜ (k) 2 approach zero. σ0 is also an important parameter, which is usually chosen ˜ (k) as σ0 ≥ 0. This parameter works as a threshold to keep a weight in w 2 unvaried in a certain number of iterations in order to maintain an overall accuracy and reduce redundancy. 4. Stopping criteria. The error tolerance ε and the minimum and maximum iteration numbers Kmin and Kmax serve as stopping criteria. Once the error function E(w(k) ) is lower than ε and the network complexity has reduced to an acceptable level, the neural network training process can be terminated. In implementation, the objective that redundant weights be removed can be realized by setting the iteration number larger than a given value Kmin (≥ σ0 + N). If E(w(k) ) cannot be lower than ε in Kmax iterations, the training process is considered failed and is terminated. Real-world problems are often noisy. The above stopping criteria may sometimes not be enough to enable trained networks to have good generalization. Thus, an additional validation error stopping criterion may also be considered in training processes. That is, the validation error function EV (w(k) ), which is similar to equation 2.1 for a given validation data set, should also be less than a given tolerance, δ. Summarizing, given ², ε, δ, γ , µmin , µmax , σ0 , N, Kmin , and Kmax , a complete training process can be implemented as follows: 1. Initialization: 0 ⇒ k, µmin ⇒ µ, and randomly initialize w(0) .
Supervised Learning Mechanism
1039
2. Forward propagation: Propagate function signals from the input layer to the output layer in the network to obtain yˆ (i) and e(i) (i = 1, 2, . . . , m) as well as E(w(k) ) in equation 2.1. 3. Stopping criteria: If E(w(k) ) ≤ ε, EV (w(k) ) ≤ δ, and k ≥ Kmin , then we have reached a desired solution w∗ = w(k) and stop. Or if k > Kmax , the process has failed and should be stopped. Else go to step 4. 4. Back propagation: Backpropagate error signals from the output layer to the input layer to obtain the local gradients of each node. 5. Factorization: Compute the pth element dp in D, the pth element in (JP)T e, and the right half of the pth row from the diagonal element in (JP)T JP, and then factorize them into the corresponding elements of D, b, and L. Proceed with the factorization row by row until dp ≤ ² d1 . −1 ˜ (k) ˜ (k) 6. Searching for 1w ˜ (k) 1 : Set 1w 2 = 0 and compute 1w 1 = −L1 (D1 + (k) ¤ £ 1w ˜ ˜ (k) µI1 )−1 b1 . If E(w(k) +P 0 1 ) < E(w(k) ), that means 1w 1 is a successful update; go to step 7. Otherwise, µ · γ ⇒ µ, and repeat this step. But if µ > µmax , µmin ⇒ µ, go to step 7.
˜ 7. Reducing w ˜ (k) 2 : Compute 3(k) by equations 3.7 and 3.8, and then reduce (k) (k) ˜ ˜ 2 = −3(k) w ˜ (k) w ˜ 2 by 1w 2 . £1w˜ (k) ¤ 1 8. Updating weights: w(k+1) = w(k) + P (k) . 1w ˜2
9. Continuation: k + 1 ⇒ k and max(µ/γ , µmin ) ⇒ µ. Then go back to step 2. 5 Applications and Performance Evaluation In the following, we use three examples to evaluate quantitatively the effectiveness of the PTNT algorithm in terms of training time, training and testing accuracies, and √ network complexity. The corresponding parameters are chosen as γ = 10, µmin = 10−2 (except in example 2), µmax = 106 , N = 25, and σ0 = 0. As a basis of comparison, the fast backpropagation (FBP) and the Levenberg-Marquardt (LM) as considered in the Neural Network Toolbox of Matlab, and a similar weight elimination algorithm (SWE)1 to Weigend et al. (1991) are used to train the same problems. The parameters in the SWE, such as momentum constant α and the adaptive learning rate β, 1
SWE update rule: 1w(k) = α1w(k−1) − β
∂(E(w(k) ) + 12 η
Pn i=1
(k) 2 2 (w(k) i ) /(1 + (wi ) ))
∂w(k)
.
1040
Guian Zhou and Jennie Si
Targel + Output −
1 0.5 0 −0.5 −1 −1
−0.8
−0.6
−0.4
−0.2
0 Input
0.2
0.4
0.6
0.8
1
Figure 1: A curve-fitting problem.
are chosen the same way as those in the FBP in principle; the regularization parameter η is from trial and error. The parameter µ in the LM is updated in the same way as that in the PTNT. Since the training time of the PTNT depends not only on the error tolerance but also on Kmin , which reflects the desired network reduction, the PTNT algorithm without network tuning (the PT), that is, 1w ˜ (k) 2 = 0 throughout the training process, was also used for training networks to give a lower bound on training times for the PTNT. All of the algorithms are coded in Fortran 77 and implemented on a Digital ALPHA-600 computer. Neural networks considered in this article are two-layer feedforward type with sigmoid hidden nodes and linear output nodes. Network weights are initialized with uniformly distributed random numbers between [−0.5, 0.5]. The training process is terminated if E(w(k) ) ≤ ε and EV (w(k) ) ≤ δ, or the iterations reach Kmax = 10, 000 for the FBP and the SWE or Kmax = 100 for the LM and the PTNT and the PT. Kmin = 25 is employed in the PTNT for all examples. 5.1 Example 1: A Curve-Fitting Problem. This example, also considered in the Neural Network Toolbox of Matlab, is to fit 21 points shown with + signs in Figure 1. A relatively large network N[1×20×1] of 61 weights was used to demonstrate the applicability of the PTNT. Fifty experimental runs were carried out for different sets of arbitrary initial weights. The average initial error E(w(0) ) over the 50 runs was 9.3, and the training objective of the error function was 0.01 (a training result with the training error slightly less than 0.01 is shown as a continuous curve in Figure 1). Only training accuracy is considered for this example. Table 2 summarizes the average results over the corresponding runs. The PTNT provided better or comparable training and pruning results than the others, except in one case when it did not converge to the desired
Supervised Learning Mechanism
1041
Table 2: Simulations for a Curve-Fitting Problem.
PTNT(c) (n) PT(c) LM(c) FBP(c) (n) SWE(n)
Run
Iterations
Rank
Removed Weights
Training Error
CPU(s)
49 1 50 50 43 7 50
32 51 24 12 4472 10,000 10,000
9.5 7.0 9.5 61.0 N.A. N.A. N.A.
50 54 N.A. N.A. N.A. N.A. 27
0.0077 0.7217 0.0085 0.0078 0.0100 0.0118 0.0249
0.033 0.047 0.031 0.053 1.457 3.215 3.261
Note: (c): converged to the required accuracies in training and validation; (n): not (c).
training accuracy because its rank is too low. It removed 50 weights in the network on average, with a maximum of 51. The maximum Jacobian rank in the PTNT was 14. The PTNT took (14 × 61 − 142 /2) × 4 more bytes to store L1 and L2 than the FBP or the SWE but much less CPU time than both. Forty-three runs converged to desired training accuracy by using the FBP. η in the SWE was chosen to be relatively large so that more weights could be eliminated. In this example, η = 0.05 was chosen out of 10 different experiments on η. The SWE removed 27 weights in the network on average, with a maximum of 48, but the corresponding training error became larger than the required accuracy. The LM took the least iterations but the highest memory complexity, (O(612 )). Because the LM was operating on a larger rank matrix than the PTNT, the computation time for the LM was still more than that of the PTNT for this example. 5.2 Example 2: Sunspot Time Series. The sunspot series has served as a benchmark in the statistics and neural networks literature. As in Weigend, Huberman, & Rumelhart (1992), we use the data from 1700 through 1920, from 1921 to 1955, and from 1956 to 1979 for training, validation, and testing, respectively. A network N[12×8×1] with 113 weights was trained to predict current strength of the sunspots by feeding 12 past records. The ith training pattern is formulated as the input x(i) = [yi−1 , yi−2 , . . . , yi−12 ]T and the target output t(i) = yi , where yi is the average sunspot strength in the ith year. The year corresponding to i = 1 is 1712. Fifty experimental runs were carried out by various algorithms for different sets of arbitrary initial weights. The average initial error E(w(0) ) of the 50 runs was 56.23. The training and validation tolerances are ε = 0.85 and δ = 0.15, respectively. For this set of real-life data, it was observed that the validation criterion was usually met at first, and as the training proceeded, the training error reduced gradually, while the validation error often increased. This phenomenon was obvious for the PTNT, PT, and LM due to their relatively
1042
Guian Zhou and Jennie Si
Table 3: Simulation Results for the Sunspot Data. η PTNT(c) N.A. (n) SWE(c) 0.075 (n) SWE(c) 0.10 (n) SWE(n) 0.15 SWE(n) 0.20
Runs Iterations Rank Removed Training Validation Testing CPU Weights Error Error Error (s) 30 20 42 8 6 44 50 50
30 100 3173 10,000 3950 10,000 10,000 10,000
46 40 N.A. N.A. N.A. N.A. N.A. N.A.
62 73 0 3 0 7 56 78
0.780 0.699 0.849 0.840 0.850 0.861 0.897 0.914
0.138 0.204 0.148 0.174 0.148 0.176 0.181 0.180
0.406 0.650 0.405 0.506 0.417 0.506 0.529 0.533
0.96 2.24 8.55 26.72 10.63 26.70 26.51 26.38
Note: For abbreviations, see Table 2.
fast convergence characteristics in reducing training error. This may imply that due to overfitting, the PTNT, PT, and LM have been fitting through noise in the training data. One way to circumvent this was to increase µmin to 0.5 in this example to decrease their learning speed such that training and validation errors and network size may reach a satisfactory level after a certain number of iterations. Simulation results with the PTNT are listed in Table 3. The maximum Jacobian rank in the PTNT was 63 in this example. The PTNT took (63 × 113 − 632 /2) × 4 more bytes to store L1 and L2 than the SWE but with much less CPU time. Thirty runs converged with the PTNT. As a singled-out case, we have obtained 0.846, 0.109, and 0.251 as training, validation, and testing errors, respectively, in one of the 30 converged runs. In the same case, the number of removed weights reached 79. More than 20 values of the regularization parameter η were experimented with for the SWE in order to choose a better objective function. Some simulation results with respect to several values of η are listed in Table 3. The performance of the SWE is obviously quite sensitive to η. We now ignore the pruning process and compare the PT, the FBP, and the LM. Table 4 summarizes the average training, validation, and testing results over the 50 runs. The LM is faster but costs (1132 ×4) more bytes than the FBP, for example. The PT costs less CPU time than the FBP or the LM. The PT requires more memory than the FBP but much less than the LM. In principle, the PT has the same memory complexity as the PTNT when they have same ranks. But for this example, the memory complexity of the PT is more than that of the PTNT since the maximum Jacobian rank of the PT is 71 rather than that of the PTNT, which is 63. 5.3 Example 3: A Fistillation Process Identification. A distillation column that separates methanol-isopropanol binary mixtures is considered in
Supervised Learning Mechanism
1043
Table 4: Simulation Results for the Sunspot Data. Runs Iterations Rank Training Validation Testing CPU Error Error Error (s) PT(c) (n) LM(c) (n) FBP(c) (n)
29 21 37 13 42 8
22 100 16 100 1,841 10,000
44 54 113 113 N.A. N.A.
0.791 0.575 0.785 0.539 0.838 0.676
0.142 0.207 0.146 0.194 0.146 0.181
0.411 0.729 0.421 0.552 0.418 0.542
0.64 3.63 1.12 6.90 4.94 26.72
Note: For abbreviations, see Table 2.
Table 5: Simulation Results for a Distillation Process. Runs Iterations Rank Removed Training Validation Testing CPU Weights Error Error Error (s) PTNT(c) PT(c) LM(c) FBP(n)
30 30 30 30
39 23 16 10,000
150 173 441 N.A.
252 N.A. N.A. N.A.
0.0323 0.0388 0.0368 0.3965
0.0254 0.0243 0.0202 0.1568
0.0291 0.0268 0.0254 0.2069
50.7 34.5 72.7 456.9
Note: For abbreviations, see Table 2.
this example. A mathematical model of the 26-tray column is used to generate data for training and testing neural networks. The basic equations describing the processes are similar to those in Luyben (1990, Chap. 3). The inputs are generated from gaussian distributions with modulated magnitude. A network N[20×20×1] is used as the identification model. The ith training pattern is given as x(i) = [yi−1 , . . . , yi−10 , ui , ui−1 , . . . , ui−9 ]T , and the target output t(i) = yi , where yi and ui represent the ith distillate composition and reflux flow rate, respectively. Nine hundred, 300 and 300 patterns were used for training, validation and testing, respectively. Thirty experimental runs were carried out by the FBP, the PTNT, the PT, and the LM, respectively, for different sets of arbitrary initial weights. The SWE were taken out from the comparison because of poorer convergence performances than that of the FBP. The average initial error E(w(0) ) over the 30 runs was 1860. The goal was to make the training and validation error functions below ε = 0.04 and δ = 0.04, respectively. The averaged results over the 30 runs are tabulated in Table 5. Similar observations to Examples 1 and 2 still hold in this example. The PTNT costs less time than the LM and removed 252 weights in the network on average, with a maximum of 302. The PT still costs much less time than the LM. The memory complexity of the PT or the PTNT is also much less than half of (4412 × 4) bytes used by the LM.
1044
Guian Zhou and Jennie Si
6 Conclusions We developed a systematic procedure for supervised learning with multiple objectives in achieving desired training, validation, and testing accuracies, low memory and computation costs, and selecting networks with the right sizes to overcome overfitting. Unlike many other results in the literature, our approach addresses these issues in an integrated and systematic manner. In the implementation of the algorithm, each of the desired aspects of this integration is taken into consideration within each iteration. Our simulation results show that in addition to being able to address all the issues in one procedure, the new algorithm (the PTNT) outperforms the backpropagation algorithm and its variants in overall training time and accuracy. In terms of convergence properties, although the LM sometimes has an even higher chance of convergence to the desired accuray in training, validation, and testing, the memory complexity of the PTNT is much less than half that of the LM. The overall training time for the PTNT is less than or at least comparable to that of the LM. Another variant of the PTNT, parameter updating only (without network updating), outperforms the LM in both training time and memory complexity to achieve reasonable training, validation, and testing accuracy, respectively. Although the training algorithm is developed for neural networks applications, it can also be extended to general nonlinear least-squares problems. Acknowledgments This work was supported in part by NSF under grant ECS-9553202, by EPRI under grant RP8015-03, and by Motorola. References Dennis, J. E., & Schnabel, R. B. (1983). Numerical methods for unconstrained optimization and nonlinear equations. Englewood Cliffs, NJ: Prentice-Hall. Dongarra, J. J, Bunch, J. R., Moler, C. B., & Stewart, G. W. (1979). LINPACK: Users’ guide. Philadelphia: SIAM. Hagen, M. T., & Menhaj, M. B. (1994). Training feedforward networks with the Levenberg-Marquardt algorithm. IEEE Trans. on Neural Networks, 5, 989–993. Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal Brain Surgeon. In S. J. Hanson (Ed.), Advances in neural information processing systems 5 (pp. 164–171). San Mateo, CA: Morgan Kaufmann. Haykin, S. (1994). Neural networks: A comprehensive foundation. New York: Macmillan. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Hinton, G. E. (1986). Learning distributed representations of concepts. In Proceedings of 8th Annual Conference of the Cognitive Science Society (pp. 1–12).
Supervised Learning Mechanism
1045
Hillsdale, NJ: Erlbaum. Le Cun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal Brain Damage. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 598–605). San Mateo, CA: Morgan Kaufmann. Levin, A. U., Leen, T. K., & Moody, J. E. (1994). Fast pruning using principal components. In G. Tesauro (Ed.), Advances in neural information processing systems 6 (pp. 35–42). San Mateo, CA: Morgan Kaufmann. Luyben, W. L. (1990). Process modeling, simulation and control for chemical engineering. New York: McGraw-Hill. Mozer, M. C., & Smolensky, P. (1989). Skeletonization: A technique for trimming the fat from a network via relevance assessment. In D. S. Touretzky (Ed.), Advances in neural information processing systems 1 (pp. 107–115). San Mateo, CA: Morgan Kaufmann. Narendra, K. S., & Parthasarathy, K. (1990). Identification and control of dynamical systems using neural networks. IEEE Trans. on Neural Networks, 1, 4–27. Saarinen, S., Bramley, R. B., & Cybenko, G. (1991). The numerical solution of neural network training problems (CRSD Rep. No. 1089). Urbana: Center for Supercomputing Research and Development, University of Illinois. Weigend, A. S., Huberman, B. A., & Rumelhart, D. E. (1992). Predicting sunspots and exchange rates with connectionist networks. In M. Casdagli & S. Eubank (Eds.), Nonlinear modeling and forecasting (pp. 395–432). Redwood City, CA: Addison-Wesley. Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991). Generalization by weight-elimination with application to forecasting. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing dystems 3 (pp. 857–882). San Mateo, CA: Morgan Kaufmann. White, D. A., & Sofge, D. A. (Eds.) (1992). Handbook on intelligent control: Neural, fuzzy, and adaptive approaches. New York: Van Nostrand-Reinhold. Received October 15, 1996; accepted October 2, 1997.
ARTICLE
Communicated by John Rinzel
Dynamics of Membrane Excitability Determine Interspike Interval Variability: A Link Between Spike Generation Mechanisms and Cortical Spike Train Statistics Boris S. Gutkin G. Bard Ermentrout Program in Neurobiology and Department of Mathematics, University of Pittsburgh, Pittsburgh, PA 15260, U.S.A.
We propose a biophysical mechanism for the high interspike interval variability observed in cortical spike trains. The key lies in the nonlinear dynamics of cortical spike generation, which are consistent with type I membranes where saddle-node dynamics underlie excitability (Rinzel & Ermentrout, 1989). We present a canonical model for type I membranes, the θ-neuron. The θ-neuron is a phase model whose dynamics reflect salient features of type I membranes. This model generates spike trains with coefficient of variation (CV) above 0.6 when brought to firing by noisy inputs. This happens because the timing of spikes for a type I excitable cell is exquisitely sensitive to the amplitude of the suprathreshold stimulus pulses. A noisy input current, giving random amplitude “kicks” to the cell, evokes highly irregular firing across a wide range of firing rates; an intrinsically oscillating cell gives regular spike trains. We corroborate the results with simulations of the Morris-Lecar (M-L) neural model with random synaptic inputs: type I M-L yields high CVs. When this model is modified to have type II dynamics (periodicity arises via a Hopf bifurcation), however, it gives regular spike trains (CV below 0.3). Our results suggest that the high CV values such as those observed in cortical spike trains are an intrinsic characteristic of type I membranes driven to firing by “random” inputs. In contrast, neural oscillators or neurons exhibiting type II excitability should produce regular spike trains.
1 Introduction The statistical nature of single neuron response has been a widely recognized feature of neural information processing. Historically, a number of preparations yielded spike trains with a large degree of variability (Burns & Webb, 1976; Kroner & Kaplan, 1993). Spike trains with high coefficients of variation (CV) have been reported for a wide range of stimulus-evoked activity of nonbursting pyramidal neurons in visual cortical areas of monkeys (Softky & Koch, 1993; Dean, 1981; McCormick, Connors, Lighthall, & c 1998 Massachusetts Institute of Technology Neural Computation 10, 1047–1065 (1998) °
1048
Boris S. Gutkin and G. Bard Ermentrout
Prince, 1985). Such high in vivo interspike interval variability is contrasted with highly reproducible in vitro response of neurons to depolarizing current steps (Holt, Softky, Koch, & Douglas, 1996), aperiodic stimuli (Mainen & Sejnowski, 1995) and robust spike timing for high-contrast visual stimuli in vivo (Reich, Victor, Knight, Ozaki, & Kaplan, 1997). Softky and Koch (1993) presented an analysis of cortical spike trains, showing that neither the changes in the mean firing rate nor spike frequency adaptation could account for high CVs. To date, several hypotheses have been proposed to explain these seemingly paradoxical findings. Classical stochastic models presented neurons as temporal integrators with spike generation as a random walk with an absorbing boundary (Stein, 1965; Ricciardi, 1994). Numerous variants of these random walk or stochastic integrate-and-fire (IF) models strove to account for nonstationary excitability of the neuron, nonlinear summation of the synaptic inputs, and multimodal output distributions (Smith, 1992; Wilbur & Rinzel, 1983). Softky and Koch (1993) suggested that the high CVs are inconsistent with temporal integration of randomly arriving excitatory postsynaptic potentials (EPSPs). Based on studies of compartmental models with HodgkinHuxley spike-generating currents, they proposed that several mechanisms, most notably active dendritic processes, amplify weak temporal correlations in the input and produce highly variable input currents at the soma. The cell acts as a coincidence detector and produces noisy output. An alternative hypothesis states that sufficiently variable input currents can be generated under a balance of excitatory and inhibitory inputs. The neuron then remains close to the threshold, and firing reflects the fine temporal fluctuations in the input current. Shadlen and Newsome (1994) found high CVs for the balanced Stein model and Bell, Mainen, Tsodyks, and Sejnowski (1995) found similar results for a Hodgkin-Huxley neuron under specific parameter choices. Networks with connectivity that ensures the excitatory-inhibitory balance also produce highly variable firing in noisy IF neurons (Usher, Stemmler, Koch, & Olami, 1994) and chaotic threshold elements (van Vreeswijk & Sompolinsky, 1996). Recently Troyer and Miller (1997) showed that input-output properties of the neuron can strongly influence the integration of noisy inputs. They modified the leaky IF neuron to include a partial postspike reset voltage. Using the reset as a free parameter, they fitted IF neurons to real pyramidal in vitro response frequency to input current (FI) curves. They found that such IF neurons (termed “high gain”)1 give high CVs, and do so without a balance of inputs. Neurons where FI gain is much lower than seen in data did not produce high CVs. The explanation was that the “high-gain” IF neuron hovers near a steady state (set by the reset voltage) and remains sensitive to temporal fluctuations in the random inputs. The “low-gain” neuron spends
1
Gain is defined as the slope of the FI curve.
Interspike Interval Variability
1049
much more time depolarizing toward the threshold and thus damps out the input variability. The emphasis of previous studies (with the notable exception of Troyer & Miller, 1997) has been mainly on mechanisms that generate sufficiently variable input currents at the soma. By using either the simplest point neurons or a standard Hodgkin-Huxley soma, most of the authors referred to omitted from their analysis the nonlinear spike-generating dynamics and their role in spiking statistics. In this article, we assume that the input at the soma is variable. Focusing on the nonlinear dynamics of spike-generating mechanism, we show how properties of neural membranes dominated by saddle-node dynamics (type I) yield firing statistics observed in in vivo recordings and in vitro input-output characteristics of cortical neurons. We argue that the key is the sensitive dependence of spike latency to the amplitude of the suprathreshold inputs evident in type I membranes. We review the salient characteristics of type I membranes and contrast these with type II membranes. In section 3 we present the canonical model for type I membranes, the θ -neuron. We show that high CV spike trains arise for the θ-neuron in the excitable regime. The oscillating θ-neuron produces low CV firing patterns. We follow with an example of a more detailed spiking neural model (Morris-Lecar, M-L) in section 4. Simulations of type I Morris-Lecar corroborate our θ-neuron findings. Type II M-L yields low CV spike trains, suggesting that type II membranes do not produce highly variable spike trains. 2 Type I vs. Type II Neural Membrane Dynamics Our major assumption is type I membrane excitability for the spikegenerating soma. The general idea is to classify the cells by the dynamical structure that underlies the onset of autonomous periodic firing. A more complete discussion of this classification can be found in Rinzel and Ermentrout (1989). The classification, based on observations of squid axons, was proposed by Hodgkin (1948), who found arbitrarily low response frequencies and spike latencies for some axons (Type I) and a narrow range of responses with no spike delay for others (Type II). We use the M-L model to illustrate type I and type II characteristics. Observationally, a type I membrane is recognized by a continuous FI curve that shows oscillations arising with arbitrarily low frequencies (see Figure 1a). It shows that the type I cell is capable of a wide range of firing frequencies and that near the threshold, the input-output gain is infinite. This suggests that in the excitable regime, a number of dynamical behaviors is possible depending on the strength of the time-dependent stimulus input. In Figure 1b we see that the spike latency for a type I neural model (here M-L) is strongly sensitive to the magnitude of the suprathreshold stimulus.
1050
Boris S. Gutkin and G. Bard Ermentrout
40
30 frequency (Hz)
delay (msec)
30
20
20
10
0 30
60
90 Ι (µΑ/cm^2) 20
0
120
[B]
120
160 200 Ι (µΑ/cm^2)
40
44
48 52 Ι (µΑ /cm^2)
56
frequency (Hz)
[A]
10
10
[C]
0 80
Figure 1: (a) Firing frequency to input current plot for type I membrane shows oscillations appearing with arbitrarily low frequencies. (b) The delay to spike in the type I model depends on the amplitude of the suprathreshold stimulus. (c) Firing frequency to input current plot for the type II shows oscillations arising with nonzero frequency.
Consider that temporal integration by the neural membrane acts to translate randomly timed synaptic inputs into a background DC bias (perhaps due to distant synapses) plus current with randomly varying amplitude (perhaps due to more proximal ones). Then a type I membrane, with its high spike latency sensitivity, converts the variability in the input current to variability in output spike timing. The phase plane for type I membrane helps us understand why this happens (see Figure 2a). The voltage nullcline intersects the activation nullcline, forming an attracting node and a saddle that acts as a threshold for the spike generation. A subthreshold stimulus would not evoke much response. Any stimulus pushing the voltage past
Interspike Interval Variability
1051
the saddle node will result in a spike of constant shape, but with a varying delay. This delay results from the fact that voltage trajectory near the threshold hugs the stable manifold of the saddle, moving rather slowly away from the threshold. Also, near the threshold, the membrane is most sensitive to small size inputs. As the voltage increases, the velocity of motion increases, and the membrane becomes insensitive to inputs because the active conductances dominate the dynamics. A fast spike is produced, followed by a refractory period and repolarization to the rest state. The important notion is the nonuniformity of motion around the phase plane. This is reminiscent of cortical neurons that, given a suprathreshold pulse stimulus, will slowly depolarize and then produce a fast spike.2 We also note that a type I model spends most of its time near the steady state, just like the “high-gain” neuron of Troyer and Miller (1997). We can change the excitability of the cell by increasing the bias current, which lifts the voltage nullcline. This lowers the threshold and reduces the region in the phase space where the membrane is most sensitive to input perturbations. With still more positive bias, the rest state disappears, and a limit cycle is left behind. This limit cycle is of constant amplitude but with a period dependent on the bias. In contrast, type II membranes are characterized by discontinuous FI curves with the oscillations arising with a nonzero frequency. These oscillations are due to a subcritical Hopf bifurcation. The response frequency range is narrow and largely independent of the bias (see Figure 1c). There is also no true threshold for the appearance of spikes, which are not an allor-nothing phenomenon, but with amplitude that can depend on the size of the pulse stimulus (see Figure 2c). The delay to spike is not sensitive to the size of the suprathreshold stimulus, and the long prespike depolarization is absent. Several widely used cortical models are of type I—for example, Traub’s model (Traub & Miles, 1991) and the Bower model (Wilson & Bower, 1989). Examples of type II membrane include the standard Hodgkin-Huxley model (Hodgkin & Huxley, 1952) and the FitzHugh-Nagumo reduced model (FitzHugh, 1961). The M-L model can be put into either the type I or the type II regime. 3 The θ-Neuron: A Canonical Model for Type I Membranes We present a reduced neural phase model (θ -neuron) capable of reproducing spike-train statistics at a wide range of mean firing rates. The θ -neuron is a canonical model for type I membranes resulting from formal mathematical reduction of multidimensional neural models exhibiting type I dynamics. That is, every neural model with type I dynamics can be reduced to the 2
This is thought to be due to the A-current.
1052
Boris S. Gutkin and G. Bard Ermentrout
Figure 2: Phase plane for type I and type II neural membranes. Here we use the Morris-Lecar model as an example, with w being the activation variable and parameters set as in Rinzel and Ermentrout (1989). (a) Phase plane for a type I membrane in the excitable regime. Note that the stimulus-induced processions around the phase plane are of constant size and profile. Such processions r periodic solutions are said to live on an invariant circle. However, the rise time of spikes depends on stimulus amplitude. Here R is the attracting rest state, T is the saddle, and U is an unstable steady state. (b) Phase plane for type I membrane in oscillatory regime. Once again the spikes are of constant amplitude and live on the invariant circle. The voltage nullcline has been lifted by the added constant bias current. (c) Phase plane for type II membrane in the excitable regime. Note that the spikes are of variable amplitude.
θ-neuron. The parameters of the θ-neuron can be quantitatively related to physiologically observable quantities, and the dynamics reflect the nonlinear properties of the neuronal membrane. We describe the neuron by a phase-variable θ . This phase represents the location of the voltage and activation state vector along the spike trajectory (see Figure 3a). The dynamics of the phase under noisy input are governed
Interspike Interval Variability
1053
Figure 3: (a) Phase evolution on a circle and its analog in state of the membrane voltage. Note that the spike occupies a small region near π , and the model spends most of its time near the threshold. A suprathreshold stimulus pushes θ past the threshold and into the excited region. Here the regenerative dynamics that summarize active conductances carry the phase through the spike. (b) A representative spike train for the phase model excited by a random stimulus. Here we plot not θ but ν = 1 − cos(θ), which makes the spikes apparent.
by the following stochastic (Langevin) differential equation: dθ/dτ = (1 − cos θ) + (1 + cos θ )(β + σ Wτ )
θ ∈ [0, 2π ],
(3.1)
for white noise input Wτ with intensity σ .3 Here β is the bias parameter, which controls the excitability of the cell. The critical value is at β = 0. In the excitable regime, where β is negative, the model has an attracting rest state and a repelling threshold.4 In case of a subthreshold stimulus, the phase returns passively to rest, while a suprathreshold pulse causes the phase to rotate around the circle and produce a spike. In fact, if we plot the time evolution of ν = (1 − cos(θ )) we can readily see the fast spike (see Fig. 3b). 3.1 Reduction to θ -Neuron. We present an outline of the mathematical reduction process, a more detailed description has been published in Ermentrout (1996b) and a complete mathematical treatment in Hoppensteadt and Izhkevich (1997). The reduction relies on perturbation methods for the saddle-node bifurcation inherent in type I membranes. Heuristically we can describe the 3 W is constructed by generating gaussian random deviates with variance proport √ tional to dτ ; the integral of Wt gives a Wiener process. 1+β 1+β 4 These are given by θ rest = − arccos 1−β and θthreshold = arccos 1−β , respectively.
1054
Boris S. Gutkin and G. Bard Ermentrout
behavior by a phase variable because the oscillations in the type I neuron are of invariant amplitude (i.e., the cell produces spikes of constant shape). Let us consider a generic conductance model: dV/dt = F0 (V) + ² 2 N(V).
(3.2)
Here V is the vector of dynamical variables of the model (e.g., membrane voltage, activation variables), F0 (V) is the nonlinear function that includes the membrane properties of the conductance model, and N(V) is the input; ² is small. We assume that when ² = 0, there exists an invariant circle around a single fixed point, which persists on both sides of the bifurcation. Then let the saddle node appear at the critical value V ∗ . We linearize F0 (V) around that value and note that the Jacobian of F0 (V) at V ∗ has a zero eigenvalue. Letting V = V ∗ + ²zEe, where eE is the eigenvector corresponding to the zero eigenvalue, the dynamics of equation 3.2 near the bifurcation are governed by dz/dt = ²(η + qz2 ) + h.o.t.,
(3.3)
which is the normal form for saddle-node dynamical systems. We now make a change of coordinates τ =²t and z=tan(θ /2), and setting q to unity without loss of generality, we arrive at dθ/dτ = (1 − cos θ) + η(1 + cos θ ), θ ∈ [0, 2π ], θ(0) = θ (2π ),
(3.4)
where η is proportional to the inputs in the original model.5 The reduction method determines how we can include a noise term to model the influence of a large number of positive and negative inputs of random strength arriving at random times. The additive input N(V) in the full model (in equation 3.2) is reduced to η in the θ-neuron. Letting η = (β + σ Wτ ) where β is the bias and Wτ models white noise, we arrive at the appropriate model for the random inputs. The o.d.e. for the phase in equation 3.4 then becomes the Langevin d.e.: dθ/dτ = (1−cos θ)+(1+cos θ )(β +σ Wτ ), θ ∈ [0,2π ], θ(0) = θ (2π ). (3.5) Note that this noise model is based solely on the mathematics of the reduction and reflects several important characteristics about how a neuron responds to inputs. The neuron is most sensitive to its inputs when at rest and most insensitive during the spike—when voltage is dominated by spikegenerating currents—and the refractory periods that follow. The inputs in the θ-neuron are scaled by the (1 + cos(θ)) and have the most effect on the 5 In general, η and q can depend on time and phase and can be calculated directly from the original neural model; see Ermentrout (1996b).
Interspike Interval Variability
1055
Figure 4: Saddle-node dynamics on an invariant circle. The upper trace shows the location of critical points on the invariant circle, the middle trace shows the behavior of the phase variable, and the lower trace is the trace of (1 − cos(θ)) showing the spikes. The horizontal axis in the lower two traces gives time in ms. (a) Excitable regime with β = −0.3, the spike is evoked by a suprathreshold pulse stimulus marked by the triangle. (b) Bifurcation with saddle-node point, β = 0. The homoclinic trajectory (one that joins a critical point to itself) has an infinite period. (c) Oscillatory regime with β = 0.3. Autonomous periodic processions in the phase variable and spikes in (1 − cos(θ)) are present.
phase θ when the cell is close to the resting potential, and little or no effect when the cell is traversing through the spike. In this work, we present results for the white noise inputs, although similar arguments can lead to an appropriate model for Poisson-timed excitatory and inhibitory inputs.6 The deterministic behavior of the model has been described in detail in Ermentrout (1996b). The main point is that the θ-neuron reflects all the salient characteristics of the dynamics of the original full model. Since the θ neuron is a canonical model for type I neural models, its dynamics reflect the saddle-node-based spike-generating behavior of any type I neural model, including the spike latency sensitivity to stimulus amplitude. The θ-neuron exhibits both excitable and tonically oscillating regimes, depending on the bias β. Figure 4 summarizes the behavior of the model for different bias values.
η = (β + ge dNex + gi dNin ), where dNex dNin are unit events with arrival times given by Poisson processes with intensity λex and λin , respectively. The amplitudes of ESPSs and IPSPs are given by gex and gin . We should note that the bias + white noise input model would not work in the limit of low EPSP amplitudes, long EPSP duration, and high arrival rates where the net effect would be a mean DC current. However, for this work, we start by assuming that the inputs to the soma carry a significant degree of variability. 6
1056
Boris S. Gutkin and G. Bard Ermentrout
4 Results of Numerical Experiments for θ-Neuron To study the stochastic dynamics of model 3.1, we carried out numerical simulations using the XPPAUT differential equation exploration package (Ermentrout, 1996a) The equations were integrated on a circle with period of 2π using a stochastic version of the Euler method with time step of 0.05 ms. The noise was generated by XPP using standard algorithm for construction of Wiener process (see footnote 3 and Kloeden & Platen, 1994). The voltage time series was converted into spike-train data for which we computed ISI CVs and histograms. The interspike interval (ISI) data were examined to ensure stationarity. 4.1 The Excitable Regime: Noise-Induced High-CV Firing. In the excitable regime (β below 0), firing is induced purely by the noise inputs. ISI histograms of the noise-driven excitable θ-neuron show a characteristic peak. That is, for a given β, the noise induces a characteristic mean ISI. The mean ISI is controlled by both the constant bias and the amplitude of the noise process. As the β becomes more positive or σ increases, the peak in the ISI histogram moves to the left (see Figures 5a and b) increasing the mean firing rate. The noise and bias have similar effects on the firing rate. However, they have differential effects on the ISI variability. Increasing the amplitude of the noise inputs while holding the bias constant has comparatively little effect on the scale of the ISI histogram. For β = −0.3, the high-noise histogram has qualitatively equal mass in the tail as the low-noise one (see Figure 5a). Consequently, when the firing rate is controlled by the noise intensity, the CV remains high and does so across a wide range of mean ISIs, with a slight downward trend toward the shorter ISIs (see Figure 6c). We propose the following explanation. With a constant negative bias, the distance to the threshold and the region where the motion is slow are held constant. The firing rate then depends on the mean frequency of random crossings of the threshold. Because of the spike latency characteristics of type I membranes, the variance of the ISIs depends on the threshold crossings, and also on the amplitudes of the suprathreshold inputs. As the variance of the noise input goes up, the variability in the amplitude of the suprathreshold shocks increases, thereby driving the variability of the spike latencies up. Thus, the CV remains largely invariant for a wide range of firing rates. As the firing rate becomes very large, the refractory period exerts a regularizing influence, and the CV begins to decrease. On the other hand, when we hold the noise amplitude fixed, increasing β gives spike trains with lower CVs and ISI histograms with progressively shorter tails (see Figure 5b). We suggest that this happens because as the rest and threshold approach each other, the active spike currents (here intrinsic regenerative behavior of the phase) are much easier to activate. These currents then drown out the variability in the inputs. To put it another way, the
Interspike Interval Variability
1057
Figure 5: Normalized histograms for θ-neuron, ISIs plotted on log scale. (a) β = −0.3 and noise amplitude varied. Note that the left tail of the histogram appears to shorten as σ increases; partially due to the log scale and also because with higher input variance, the mode moves to closer to the smallest possible ISI. (b) Noise amplitude = 1 and β varied. The mean ISI of deterministic oscillations for β = 1 case is 3.14 ms.
more excitable cell is much less dominated by the slow dynamics near the rest state, and the range of inputs that would cause highly variable spike is decreased. 4.2 The Oscillatory Regime: Noise-Modulated Periodic Firing. At supercritical bias values (β above 0), the model fires periodically even without the noisy inputs. In this regime, the noise modulates the mean frequency of firing, while the firing is comparatively regular with low CVs and shorttailed ISI histograms. Once again our explanation holds. The dynamics of the oscillating cell are dominated by the active currents, and the noise inputs exert a comparatively weak influence on the firing behavior. 5 Numerical Simulation of Stochastic Morris-Lecar To corroborate our findings in the θ -neuron, we studied the M-L model. The M-L has the advantage of being a conductance-based model that can be put
1058
Boris S. Gutkin and G. Bard Ermentrout
Figure 6: CV results for θ -model. (a) CV remains high across a wide range of noise parameters. (b) CV decreases linearly with β. (c) CV remains high across a wide range of firing rates, when such are controlled by noise. Here β = −1, σ varied.
into both type I and type II regimes depending on the chosen parameters. We examined the hypothesis that for equivalent excitability, the type I model yields high CVs under a variety of random input conditions, while the type II model gives a regular firing pattern. For our simulations, both type I and type II models were parameterized by the deviation from the critical input bias current, (Ibias − Icritical ), to fix the excitability conditions.7 Equations and exact parameter values for both models are given in the appendix. For both models, the negative deviation indicates the excitable regime, while the positive corresponds to an intrinsic oscillator. The inputs consist of instantaneous excitatory and inhibitory postsynaptic currents (EPSCs and IPSCs, respectively) with Poisson-distributed arrival times. The amplitudes of EP-
7
For type I Morris-Lecar Icritical = 40 mV, type II, Icritical = 100 mV.
Interspike Interval Variability
1059
SCs were set to ensure integration of many inputs to generate a spike. Both inputs were modeled as simple exponentials with time constants of 1 ms. The IPSC arrival rate for all simulations was kept constant, while the rate of EPSCs was varied.8 The random arrival times were generated by using twostate Markov chains with transition probabilities set to give Poisson arrival proccesses for inhibition and excitation with desired means (this feature is built into XPP). 5.1 Morris-Lecar in Type I Regime. Type I Morris-Lecar shows high CV behavior. The type I model yields a wide range of firing rates for different excitability and input parameters. There is a strong dependence of the CV on the excitability of the cell (see Figure 7, upper trace). In the excitable regime, CV values close to unity are observed, with the CV clearly decreasing as the model passes into the oscillatory regime, yet the CV is always above 0.3. In the oscillatory regime, the inputs are dominated by the intrinsic dynamics of the oscillating membrane, and the mean ISI reflects the intrinsic frequency of oscillation. On the other hand, for a model in the excitable regime, the CV value is comparatively insensitive to the rate of arrivals of EPSPs. Consequently the CV remains consistently high for a wide range of firing rates when these are controlled by the variability of inputs. Also the CVs for type I spike train for the M-L model are largely independent of the inhibition-to-excitation ratio. 5.2 Morris-Lecar in Type II Regime. We now modify the M-L model to reflect type II excitability. Compared to type I, this model in the excitable regime exhibits a narrow range of ISIs, which are in fact close to the frequency of oscillations at criticality. The CV is not sensitive to the bias current (see Figure 7, lower trace) and at no value of EPSP rate does the model exhibit CVs above 0.5. In fact, the only way to achieve high CV value for this model is to resize the amplitude of EPSP to make the threshold near to one event, with the model doing no integration. 6 Summary In this work we asked whether dynamics of spike generation consistent with cortical neuron data can account for the statistics of cortical spike trains. Using a canonical model, we show that type I cells driven to firing by the noisy inputs give highly variable spike trains. On the other hand, intrinsic oscillators have a much more regular spiking behavior. Furthermore, type I M-L results clearly show high CV levels, with excitation dominating the random inputs, while type II models give regular firing. Our results are 8 The ratio of inhibition to excitation as defined in Troyer and Miller (1997) R = (gi |V − Vi | ∗ ratei )/ge |V − Ve | ∗ ratee ) varied from 0.1 to 2.
1060
Boris S. Gutkin and G. Bard Ermentrout
Figure 7: CV results for Morris-Lecar models. The upper trace shows type I behavior, and the lower trace is type II. Note that IPSC rate was 0.3 for all simulations. EPSC rate varied.
in contrast to the suggestions by Softky and Koch (1993), who imply rather specific synaptic organization of inputs and restrictive parameter regime for the dendritic spikes, or to the Bell et al. model (1995), which requires not only a balance of inhibitory and excitatory inputs but also a very narrow subset of the parameter space. We are also able to reproduce neural variability without reliance on network dynamics. We provide a formal mathematical method to derive the θ-neuron from conductance-based models, and thus can give specific physiological meaning to the parameters. High-CV spike trains are observed for the excitable θ-neuron and type I Morris-Lecar. The saddle-node bifurcation characteristic of such membranes underlies the observed results. In particular, we suspect that the CVs are high because the noise randomly “samples” large period orbits, leading to the spike latencies that are strongly dependent on the size of the time-dependent stimulus. Thus a stimulus that provides suprathreshold shocks of random amplitudes will evoke strongly variable ISIs for type I neurons. In contrast are membranes with type II dynamics, where the oscillations arise with a nonzero frequency through a Hopf bifurcation. In such a system there are no long-period orbits for the noise to sample, and the spike latency for suprathreshold stimuli is bounded above. We observe generically low
Interspike Interval Variability
1061
CV values for the type II M-L even with inhibition and excitation balanced in the random inputs. We expect that type II neurons in general do not exhibit high CVs except under some very special conditions that ensure bistability (Bell et al., 1995). We also observe that type I neurons that are driven to firing by random inputs exhibit high CV values generically, without the balance of excitation and inhibition. We suggest that the high “physiological gain” condition proposed by Troyer and Miller (1997) is a natural consequence of type I membrane dynamics. In fact, for type I models, gain is infinite near the threshold for onset of oscillations. The key, we suggest, is not the gain as such, but the range of input amplitudes that leads to significantly variant spike latencies. Furthermore, just like the “high”-gain IF neuron, type I models spend most of their time near a steady state and not depolarizing toward the threshold. Essentially the dynamics of the saddle-node-induced firing imply a highly nonuniform motion for the voltage-activation trajectory with very slow motion near the rest state.9 Then the cell spends most of its time near the rest and is pushed to firing by the fast swings in the input current. In this way, the spike-generating mechanism acts as a de facto high-pass filter. It is particularly interesting to note that Hodgkin in his 1948 paper observed that type I spikers (axons in his case) produced a much more variable firing pattern than the type II spikers. Arbitrary spike delay latencies were also reported in the same work. At the same time θ-neuron and the M-L are not coincidence detectors in the sense that both code the mean input rate with a mean output rate. This means that high CVs cannot be used as a litmus test to solve the rate versus coincident coding dilemma. However, the delay to spike characteristics of type I neurons suggests that cortical neurons can act as amplitude to spike latency converters, and perhaps pass information about the temporal structure of the stimulus not only in the firing rate but also in the relative timing of individual spikes. In this way, the spike train as a whole would look very noisy, yet the information about the stimulus would be encoded quite precisely by the timing of spikes. In order for this coding mechanism to work, the cells must respond robustly to aperiodic inputs, and in fact data from Mainen and Sejnowski (1995) show highly reproducible responses to repeated noiselike stimuli in vitro. Robust spike timing despite noisy firing was also recently reported by Reich et al. (1997) in visually stimulated in vivo cat retinal ganglion cells and lateral geniculate nucleus for high-contrast stimuli, while low-contrast stimuli seemed to lead to less 9
Such “slowdown” near the rest can result in a real neuron or biophysical model from a sodium current that is slow to activate at the beginning of the spike generation, with a steeply increasing voltage-dependent time constant. Alternatively, the same effect can be generated by an interaction of a slow potassium current that is partially activated at rest (e.g., Im ) and a fast sodium spike-generating current, or an inactivating potassium current (e.g., IA ).
1062
Boris S. Gutkin and G. Bard Ermentrout
robust responses. We found behavior similar to that reported in both reports above for the θ-neuron in our preliminary simulations. These results will be published elsewhere. The θ-neuron model, like the integrate-and-fire model, is a one-dimensional caricature of a “real” neuron. Both models have arbitrarily low firing rates as the input current is lowered to the threshold. However, the way in which the frequency goes to zero is like 1/| log(I − I∗ )| in the integrateand-fire model rather than the square root law that type I models and the θ-model obey. Similarly, the slope at criticality is infinite in both cases. The main difference lies in the latency to firing due to a suprathreshold stimulus. There is no notion of latency to firing in an integrate-and-fire model; either a stimulus is above threshold, in which case firing is instantaneous, or it is below threshold and no firing occurs. In the θ -model and in type I membranes, the latency is due to the saddle point. This enables cells to respond at arbitrarily long latencies after receiving a suprathreshold stimulus. This in fact was noted by Hodgkin (1948). Our work suggests that cells with high CV spike trains are in the excitable regime as opposed to intrinsic oscillators. This may mean that cortical neurons are not intrinsic oscillators but are driven to firing by the inputs. Then coherent oscillations such as those observed in cortical networks depend critically on the presence and characteristics of the afferent and efferent inputs. One interesting finding in this study is the dependence of CV on the excitability of the cells. The excitability in the simple models we present is set by the DC bias. In in vitro experiments, the bias current is the step current applied by the experimenter, but in in vivo neurons, any slow depolarizing or hyperpolarizing currents or effects of inputs impinging on distal dendrites and filtered by the dendritic tree can be viewed as bias currents. Some suggestions for changing bias currents include NMDA activation and muscarinic modulation of the M-current; then our work implies an interesting effect of slow modulation—that the slow modulatory synaptic currents can upregulate the spike rate by making the cell easier to fire and reducing accommodation, and also change the overall variability of the spike train. The effects of changing excitability on spike-train variability can be studied experimentally in pyramidal neurons by manipulating the slow modulatory currents (e.g., by applying NMDA agonists or muscarinic antagonists). Finally, since our models become noisier as the input mean moves away from the repetitive firing threshold, the mechanism for generating highly noisy firing is not “balancing” of inhibition and excitation. As we have pointed out before, the key is the high-pass filtering property of the spikegenerating mechanism when the spike is caused by a fast procession in the random input current. On the other hand, in the intrinsically oscillating cell, the inputs are dominated by the intrinsic currents that generate the periodic firing, with the random inputs having less influence on the statistics of the spike train. Thus, the further the cell is from being an intrinsic oscillator, the
Interspike Interval Variability
1063
more it is driven by the synaptic currents (relative to intrinsic currents) and the noisier is its output. In summary, by focusing on the dynamics of the spike generation by a soma receiving random inputs, we propose that the dynamical mechanism of spike generation has a strong effect on the stochastic properties of neuronal activity. The dynamical mechanism in our work implies that spike generation in cortical pyramidal cells is consistent with saddle-node dynamics. This suggestion can be tested in in vitro experiments that examine spike latency curves and experimentally constructing phase response curves for cortical pyramidal neuron. In fact, we already know that arbitrarily low firing rates are observed experimentally. The saddle-node spike-generating dynamics can be caused by a number of biophysical mechanisms, and further experiments should be designed to pinpoint the precise combination of conductances that forms the substrate in a particular class of neurons. Furthermore, we expect that cells that tonically oscillate (due to the slow depolarization) and are given noisy current input would not show high CVs. At the same time, experiments where a noisy current is injected into cells exhibiting type II dynamics should corroborate the idea that such spike-generating dynamics cannot produce high, irregular firing except for a rather specific parameter regime (e.g., where the cell is bistable). Appendix: Morris-Lecar Equations The Morris-Lecar equations that we used are based on the model that appeared in Rinzel and Ermentrout (1989). They have the form: dV = −gCa m∞ (V)(V − VCa ) − gK w(V − VK ) − gL (V − VL ) + I dt dw = φ(w∞ (V) − w)/τw (V) dt m∞ (V) = .5(1 + tanh((V − V1 )/V2 )) C
w∞ (V) = .5(1 + tanh((V − V3 )/V4 )) τw (V) = 1/ cosh((V − V3 )/V4 )) Standard values for type I membranes are VK = −80 mV, VL = −60 mV, VCa = 120 mV, C = 20 µF/cm2 , gL = 2 µS/cm2 , gK = 8 µS/cm2 , V1 = −1.2 mV, V2 = 18 mV, V3 = 12 mV, V4 = 17.4 mV, φ = .067, and gCa = 4.0 µS/cm2 . For type II membrane simulations, parameters are the same except, V3 = 2 mV, V4 = 30 mV, φ = 0.04, and gCa = 4.4 µS/cm2 . Acknowledgments We thank Dr. Satish Iyengar and Dr. John Horn for fruitful discussions and Dr. Todd Troyer for valuable comments on earlier versions of this article. Work on this project was supported by NSF and NIMH.
1064
Boris S. Gutkin and G. Bard Ermentrout
References Bell, A., Mainen, Z., Tsodyks, M., & Sejnowski, T. (1995). “Balancing” of conductances may explain irregular cortical spiking (Tech. Rep. No. INC-9502). San Diego: Institute for Neural Computation, University of California at San Diego. Burns, B. D., & Webb, A. C. (1976). The spontaneous activity of neurons in the cat’s visual cortex. Proc. R. Soc. London (Biol.), 194, 211–223. Dean, A. (1981). The variability of discharge of simple cells in the cat striate cortex. Exp. Brain Res., 44, 437–440. Ermentrout, G. B. (1996a). XPPAUT1.8—The differential equations tool. Available at www.pitt.edu/bardware. Ermentrout G. B. (1996b). Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8(5), 979–1001. FitzHugh, R. (1961). Impulses and physiological states in models of nerve membrane. Biophys. J., 1, 445–466. Hodgkin, A. L. (1948). The local changes associated with repetitive action in a non-medullated axon. J. Physiol. (London), 107, 165–181. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London), 463, 391–407. Holt, G. R., Softky, W. R., Koch, K., & Douglas, R. J. (1996). A comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. J. Neurophysiol., 75(5), 1806–1814. Hoppensteadt, F. C., & Izhkevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Kloeden, P. E., & Platen, E. (1994). Numerical solutions to stochastic differential equations. New York: Springer-Verlag. Kroner, L., & Kaplan, E. (1993). Response variability in retinal ganglion cells in primates. Proc. Natl. Acad. Sci., USA, 86, 8128–8130. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons of the neocortex. J. Neurophysiol., 54, 782–805. Reich, D. S., Victor, J. D., Knight, B. W., Ozaki, T., & Kaplan, E. (1997). Response variability and timing precision of neuronal spike trains in vivo. J. Neurophysiol., 77, 2836–2841. Ricciardi, L. M. (1994). Diffusion models of single neurons. In F. Ventriglia (Ed.), Neural modeling and neural networks (pp. 129–162). Oxford: Pergamon Press. Rinzel, J., & Ermentrout, G. B. (1989). Analysis of neural excitability and oscillations. In K. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Op. Neurobiol., 4, 569–579. Smith, C. E. (1992). A heuristic approach to stochastic models of single neurons.
Interspike Interval Variability
1065
In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation (pp. 561–588). San Diego: Academic Press. Softky, W., & Koch, K. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13(1), 334–355. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophys. J., 5, 173–194. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. New York: Cambridge University Press. Troyer, T. W., & Miller, K. D. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9(5), 971–985. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplification of local fluctuations causes high spike rate variability, fractal firing patterns and oscillatory local field potentials. Neural Computation, 6(5), 795–836. van Vreeswijk, C., & Sompolinsky, H. (1996). Irregular spiking in cortex through inhibition/excitation balance. Poster presented at Computational Neural Systems Conference, Cambridge, MA. Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike interval distribution. J. Theo. Biol., 105, 345–368. Wilson, M. A., & Bower, J. M. (1989). The simulation of large-scale neural networks. In K. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Received May 9, 1997; accepted November 14, 1997.
NOTE
Communicated by Terrence Sejnowski
Correction to Proof That Recurrent Neural Networks Can Robustly Recognize Only Regular Languages Mike Casey Department of Psychology, Rutgers University, Newark, NJ 07102 U.S.A.
Our earlier article, “The Dynamics of Discrete-Time Computation, with Application to Recurrent Neural Networks and Finite State Machine Extraction” (Casey, 1996), contains a corollary that shows that finite dimensional recurrent neural networks with noise in their state variables that perform algorithmic computations can perform only finite state machine computations. The proof of the corollary is technically incorrect. The problem resulted from the fact that the proof of the theorem on which the corollary is based was more general than the statement of the theorem, and it was the contents of the proof rather than the statement that were used to prove the corollary. In this note, we state the theorem in the necessary generality and then give the corrected proof of the corollary.
We begin by stating the corollary and proof as stated in Casey (1996), which was given as a corollary of theorem 3.1: Corollary 3.1. A finite dimensional RNN can robustly perform only finite state machine computations. Proof. The proof of the corollary is simply to notice that by the compactness of the phase space it can contain only a finite number of disjoint sets with nonempty interior. To correct this proof, the property of having nonempty interior should be replaced with “containing a ball of size ²” where ² > 0 is the same fixed constant (reflecting the amount of noise in the system) for all of the sets. This will be corrected in the more general statement of the theorem, and wherever nonempty interior appeared in the original proof of theorem 3.1, it should be replaced with “contains a ball of size ².” The original theorem was stated for algorithmic computations that can be performed by finite state machines, but it makes no sense to define the theorem so that it applies only to finite state computations and then prove that it implies that only finite state computations are possible. In our more c 1998 Massachusetts Institute of Technology Neural Computation 10, 1067–1069 (1998) °
1068
Mike Casey
general theorem, the minimal state machine, MC , will mean the automaton that has a single state for each equivalence class of strings defined by the following equivalence relation: Two strings, R and S, are equivalent if and only if for all strings T, RT is in LC if and only if ST is in LC . This is what “minimal” means for the finite case as well, and we exploited this fact in our original proof. The original proof of the theorem requires no modification to accommodate this new definition. One last point is that theorem 3.1 was stated for the special case of RNNs with the unit N-cube as their range, and uses this as the state-space of the underlying noisy dynamical system. But this, taken with the definition of noise given in the article, leaves ambiguous how noise perturbs states initially mapped to within ² of the boundary. One natural solution to this problem is to define the state-space to be the unit N-cube plus all points within ² of the unit N-cube.1 When we say state-space in the more general theorem below, we will mean this. We will state theorem 3.1 in its more general form. The proof is not provided since the modifications to the original are only syntactic. Computational-Dynamical Isomorphism Theorem (Theorem 3.1 from Casey, 1996). The state-space of a recurrent neural network (RNN) robustly performing an algorithmic computation must have sets Qj corresponding to the states qj of the minimal state machine, MC , that performs that algorithmic computation. The Qj are mutually disjoint; each contain a ball of size ² and can be taken to be all of the points x in the state-space of the RNN such that if the RNN is initialized with x, then for any given input string, the RNN will produce the same output as the minimal state machine initialized to state qj . Using the computational-dynamical isomorphism theorem, we have the following proof that finite-dimensional RNNs can robustly recognize only regular languages. Proof. The proof of the corollary is simply to notice that by the compactness of the state-space, it can contain only a finite number of disjoint balls of size ². Recently a cleaner treatment of this result has been given as a special case of a more general theorem proved by Maass and Orponen (1997). Their proof is essentially the same as ours, which is essentially the same as the Myhill-Nerode theorem, in that they all argue that there must be system states corresponding to the equivalence classes of strings as defined. Our argument was changed from the original to allow for more general class 1 This is the mathematical equivalent of noting that it is possible for a system to be perturbed out of its natural range by noise.
Correction to Proof
1069
of systems, while Maass and Orponen’s argument was modified further to allow for a more general notion of computation. We apologize for the lack of precision in our definitions and hope that this did not cause any genuine confusion. Acknowledgments I thank Wolfgang Maass and an anonymous reviewer for pointing out these errors. I would like to acknowledge that some time ago, Cris Moore circulated a preprint that gave a very nice treatment of this result by using a proof structure closer to the one used in Maass and Orponen (1997). References Casey, M. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Computation, 8(6), 1135–1178. Maass, W., & Orponen, P. (1997). On the effect of analog noise in discrete-time analog computations. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9, Cambridge, MA: MIT Press. Received June 16, 1997; accepted December 2, 1997.
LETTER
Communicated by Eduardo Sontag
On the Effect of Analog Noise in Discrete-Time Analog Computations Wolfgang Maass Institute for Theoretical Computer Science, Technische Universit¨at Graz, Graz, Austria
Pekka Orponen Department of Mathematics, University of Jyv¨askyl¨a, Jyv¨askyl¨a, Finland
We introduce a model for analog computation with discrete time in the presence of analog noise that is flexible enough to cover the most important concrete cases, such as noisy analog neural nets and networks of spiking neurons. This model subsumes the classical model for digital computation in the presence of noise. We show that the presence of arbitrarily small amounts of analog noise reduces the power of analog computational models to that of finite automata, and we also prove a new type of upper bound for the VC-dimension of computational models with analog noise. 1 Introduction Analog noise is a serious issue in practical analog computation. However, there exists no formal model for reliable computations by noisy analog systems that allows this issue to be addressed in an adequate manner. We propose and investigate such model in this article. The investigation of noise-tolerant digital computations in the presence of stochastic failures of gates or wires was initiated by von Neumann (1956). We refer to Cowan (1966), Pippenger (1989), and G´al (1991) for a small sample of the numerous results that have been achieved in this direction. In all these articles, one considers computations that produce a correct output not with perfect reliability but with probability ≥ 12 + ρ (for some parameter ρ ∈ (0, 12 ]). The same framework (with stochastic failures of gates or wires) has been applied to analog neural nets in Siegelmann (1994). The approaches noted are insufficient for the investigation of noise in analog computations, because one has to be concerned not only with occasional total failures of gates or wires, but also with imprecision—omnipresent smaller (and occasionally larger) perturbations of analog outputs of internal computational units. These perturbations may, for example, be given by gaussian distributions. Therefore, we introduce and investigate in this article a notion of noise-robust computation by noisy analog systems where we assume that the values of intermediate analog values are moved according c 1998 Massachusetts Institute of Technology Neural Computation 10, 1071–1095 (1998) °
1072
Wolfgang Maass and Pekka Orponen
to some quite arbitrary probability distribution. We consider, as in the traditional framework for noisy digital computations, arbitrary computations whose output is correct with some given probability ≥ 12 + ρ (for ρ ∈ (0, 12 ]). We restrict our attention to analog computations with digital output. Since we impose no restriction (such as continuity) on the type of operations that can be performed by computational units in an analog computational system, an output unit of such a system can convert an analog value into a binary output via thresholding. We show in theorem 1 that any language recognized by such noisy analog computational system is regular. Our model and the theorems are somewhat related to the analysis of probabilistic finite automata in Rabin (1963), although in Rabin’s case the finiteness of the state-space simplifies the setup considerably. Continuous-space noise models similar to ours have been used in general studies of the stability of dynamical systems affected by random perturbations (e.g., Kifer, 1988), but our work is to our knowledge the first to consider the computational aspects of systems of this type. More specific hardware-oriented models for analog noise in analog neural nets have been discussed in Phatak and Koren (1995). Another related work is Casey (1996), which addresses the special case of analog computations on recurrent neural nets, where the analog noise can move an internal state at most over some bounded distance η, and the digital output is required to be perfectly reliable (ρ = 1/2 in the present notation). Casey’s corollary 3.1 states a special case of our theorem 1 for the model considered in that article. Casey’s proof of corollary 3.1 is incorrect.1 A correct proof is contained as a special case in the proof of Theorem 1 in section 3 of this article.2 Apart from corollary 3.1 there is no further overlap between Casey (1996) and this article. There are relatively few examples of nontrivial computations on common digital or analog computational models that can achieve perfect reliability of the output in spite of noisy internal components. Most constructions of noise-robust computational models rely on the replication of noisy computational units (see von Neumann, 1956; Cowan, 1966). The idea of this method is that the average of the outputs of k identical noisy networks (with stochastically independent noise processes) is more reliable than the output 1
Corollary 3.1 is derived as a corollary of theorem 1 in Casey (1996), whose proof relies on the assumption that the recognized language is regular. The proof given for corollary 3.1 is the following: “The proof of a corollary is simply to notice that by the compactness of the phase space it can contain only a finite number of disjoint sets with nonempty interior.” The following counterexample shows that this argument is wrong: The intervals [1/(2i + 1), 1/2i] for i = 1, 2, . . . are infinitely many disjoint sets with nonempty interior, which are all contained in the compact set [0, 1]. 2 Actually, since there is no need to analyze probability distributions for this special case. One can prove corollary 3.1 of Casey (1996) more directly by considering the equivalence relation defined at the beginning of section 3, and by deriving a lower bound for the volume of the set of states that correspond to an equivalence class.
Analog Noise
1073
of a single network. However, there exists in general a small but nonzero probability that this average deviates strongly from its expected value. In addition, if one assumes that the computational unit that produces the output of the computation is also noisy, one cannot expect the reliability of the output of the computation to be larger than the reliability of this last computational unit. Consequently, there exist many methods for reducing the error probability of the output to a small value, but these methods cannot achieve error probability 0 at the output. In addition, if one wants to investigate computations with common noise distributions such as gaussian noise, which may in principle move a state to any other state, it is necessary to move to a computational model with less than perfect reliability of the output bit, since otherwise the model would not be able to carry out any nontrivial computations. Therefore, we focus our attention in this article on the general case where the reliability of the network output is just required to be ≥ 1/2 + ρ for some ρ ∈ (0, 12 ]. Unfortunately an investigation of computations with less than perfect reliability requires a more complex mathematical analysis. In a computational model with perfect reliability of the output, it cannot happen that an intermediate state q occurs at some step t in both a computation for an input x that leads to output 0 and at step t in a computation for the same input x that leads to output 1. Hence an analysis of perfectly reliable computations can focus on partitions of intermediate states q according to the computations and the computation steps where they may occur. In contrast, in a computational model with less than perfect reliability of the output bit, the same internal state q may occur at an intermediate step in computation paths that yield different output bits. Hence for such a model, one has to analyze probability distributions over intermediate states q. Consider, for example, the special case of a sigmoidal neural net (with thresholding at the output), where for each input the output of an internal noisy sigmoidal gate is distributed according to some gaussian distribution (perhaps restricted to the range of all possible output values this sigmoidal gate can actually produce). In this case, an intermediate state q of the computational system is a vector of values produced by these gaussian distributions for different sigmoidal gates. Obviously each such intermediate state q can occur at any fixed step t in any computation (in particular in computations with different network output for the same network input). Hence perfect reliability of the network output is unattainable in this case. For an investigation of the actual computational power of a sigmoidal neural net with gaussian noise, one has to drop the requirement of perfect reliability of the output and instead analyze how probable it is that a particular network output is given and that a certain intermediate state is assumed. Hence, one has to analyze for each network input and each step t the different probability distributions over intermediate states q that are induced by computations of the noisy analog computational system. In fact, one may view the set of these probability distributions over intermediate states q
1074
Wolfgang Maass and Pekka Orponen
as a generalized set of states of a noisy analog computational system. In general the mathematical structure of this generalized set of states is substantially more complex than that of the original set of intermediate states q. In section 2, we define a rigorous mathematical model for this type of noisy analog computation and introduce some basic methods for analyzing this generalized set of states. The preceding remarks may illustrate that if one drops the assumption of perfect reliability, then a more complex variety of computations becomes possible, and the computational power of a system may potentially increase. In fact, in theoretical computer science, a substantial number of constructions rely on the premise that the computational power of a digital computational system does in fact increase if it gets access to random bits and less than perfect reliability of the output bit is tolerated. This is relevant for the discussions of this article, since internal noise of a noisy computational system may also be viewed as something positive: as a free source of random numbers, which may actually be helpful for certain computations. In section 3 we prove an upper bound for the computational power of noisy analog computational systems that limits the potential impact of such effects in analog computation. We show that under mild constraints on the noise characteristics, noisy analog systems with bounded finite-dimensional state-spaces have at most the computational power of finite automata. This upper bound is quite general, and it also covers practically relevant special cases such as systems with dependencies among different sources of stochasticity, as well as noisy computations in hybrid analog-digital computational models, such as a neural net combined with a binary register, or a network of noisy spiking neurons where a neuron may temporarily assume the discrete state not firing. One goal of our investigation of the effects of analog noise is to find out which features of the noise process have the most detrimental effect on the computational power of an analog computational system. This turns out to be a nontrivial question. For example, one might think that analog noise that is likely to move an internal state over a large distance is more harmful than another type of analog noise that keeps an internal state within its neighborhood. However, this intuition is deceptive. Consider the extreme case of analog noise in a sigmoidal neural net that moves a gate output x ∈ [−1, 1] to a value in some ε-neighborhood of −x, and compare it with noise that moves x to an arbitrary value in the 10ε-neighborhood of x. The first type of noise moves some values x over large distances but is likely to be less harmful for noise-robust computing than the second type, as the large jump from x to −x represents just a recoding of the output value. As a first step toward characterizing those aspects and parameters of analog noise that have a strong impact on the computational power of a noisy analog system, the proof of theorem 1 provides an explicit bound on the number of states of any finite automaton that can be implemented by an
Analog Noise
1075
analog computational system with a given type of analog noise. It is quite surprising to see on which specific parameters of the analog noise the bound depends (c.f. the remark at the end of section 3). In section 4 we prove a partial converse to the upper bound result in section 3 by showing that if one only considers bounded noise processes (where the analog noise can move an internal state at most over a distance η, for a sufficiently small value of η), then any finite automaton can be simulated with perfect (ρ = 1/2) reliability by a recurrent analog neural net of the type discussed in Anderson, Silverstein, Ritz, and Jones (1988) and Siegelmann and Sontag (1991). Other embeddings of finite automata in recurrent sigmoidal networks include Frasconi, Gori, Maggini, and Soda (1996) and Omlin and Giles (1996), which discuss, respectively, implementations of automata in noise-free radial basis function networks and in second-order networks with synaptic noise. In section 5 we establish a new type of upper bound for the VC-dimension of computational models with analog noise. We show that in the presence of arbitrarily small amounts of analog noise, there exists an upper bound for the VC-dimension of, for example, neural nets that is independent of the total number of units in the case of a feedforward architecture, and independent of the computation time in the case of a recurrent neural net. This contrasts with the anomaly that in the noise-free setting, the classes of finite recurrent analog neural nets (Siegelmann & Sontag, 1991) and finite recurrent networks of spiking neurons (Maass, 1996) have infinite VC-dimension, and are thus strongly unlearnable from the point of view of learning theory. Again, the proofs of the theorem 3, and its corollaries 3 and 4, provide explicit (although very large) upper bounds for the VC-dimension of noisy analog neural nets with batch input, which depend on specific parameters of the analog noise. 2 Preliminaries: Computational Systems and Noise Processes We shall define our computational model first in the noise-free setting and then consider the effect of noise on computations separately. An analog discrete-time computational system (briefly: computational system) M is defined in a general way as a five-tuple hÄ, p0 , F, 6, si, where Ä, the set of states, is a bounded subset of Rd , p0 ∈ Ä is a distinguished initial state, F ⊆ Ä is the set of accepting states, 6 is the input domain, and s : Ä × 6 → Ä is the transition function. To avoid unnecessary pathologies, we impose the conditions that Ä and F are Borel subsets of Rd , and for each a ∈ 6, s(p, a) is a measurable function of p. We also assume that 6 contains a distinguished null value t, which may be used to pad the actual input to arbitrary length. The nonnull input domain is denoted by 60 = 6 − {t}. The intended noise-free dynamics of such a system M is as follows. The system starts its computation in state p0 , and on each single computation
1076
Wolfgang Maass and Pekka Orponen
step on input, element a ∈ 60 moves from its current state p to its next state s(p, a). After the actual input sequence has been exhausted, M may still continue to make pure computation steps, which lead it from a state p to the state s(p, t). The system accepts its input if it enters a state in the class F at some point after the input has finished. (We give a more precise definition of the dynamics, including the effects of noise, later.) For instance, the recurrent analog neural net model of Siegelmann and Sontag (1991) (also known as the “brain state in a box” model of Anderson et al., 1988) is obtained from this general framework as follows. For a network N with d neurons and activation values between −1 and 1, the state-space is Ä = [−1, 1]d . The input domain may be chosen as either 6 = R or 6 = {−1, 0, 1} (for online input) or 6 = Rn (for batch input). In each case the value zero (or the zero vector) serves conveniently as the null value t. For simplicity, we treat here formally only the cases where 6 ⊆ R; the extensions to the case 6 = Rn are straightforward. The transition function s : Ä × 6 → Ä is in this model given in terms of a d × d weight matrix W = (wij ), a d-component bias vector h = (hi ), a d-component input weight vector c = (ci ), and a neuron activation function σ : R → [−1, 1]. For any p ∈ Ä and a ∈ 6, we define s(p, a) = p+ , where for each i = 1, . . . , d, d X wij pj + hi + ci a . p+ i =σ j=1
Both Anderson et al. (1988) and Siegelmann and Sontag (1991) use the saturated-linear sigmoid activation function −1, if u < −1, u, if − 1 ≤ u ≤ 1, σ (u) = 1, if u > 1, but one may obviously also define the model with respect to other activation functions, notably the standard sigmoid σ (u) = tanh u, or the discontinuous signum function ½ −1, if u < 0, sgn(u) = 1, if u ≥ 0, the latter choice yielding the model of recurrent threshold logic networks. The initial state in each of these models may be chosen as p0 = (−1, . . . , −1), and the set of accepting states is determined by the activity of some specific output unit, say unit 1, so that F = {p ∈ Ä | p1 > θ}, for some threshold value θ > 0. In the sequel, we shall use σ to denote the componentwise extension of the chosen activation function to state vectors, so that for any p ∈ Ä, σ (p) := (σ (p1 ), . . . , σ (pd )). This convention lets us write the transition function s defined above compactly as s(p, a) = σ (Wp + h + ac).
Analog Noise
1077
Feedforward analog neural nets may also be modeled in the same manner, except that in this case, one may wish to select as the state set Ä := ([−1, 1] ∪ {dormant})d , where dormant is a distinguished value not in [−1, 1]. This special value is used to indicate the state of a unit whose inputs have not all yet been available at the beginning of a given computation step (e.g., for units on the lth layer of a net at computation steps t < l). The completely different model of a network of m stochastic spiking neurons (see, e.g., Gerstner & van Hemmen, 1994, or Maass, 1997) is also a special case of our general framework. In this case one wants to set Äsp := Sl ( j=1 [0, T) j ∪ {not-firing})m , where T > 0 is a sufficiently large constant so that it suffices to consider only the firing history of the network during a preceding time interval of length T in order to determine whether a neuron fires (e.g., T = 30 ms for a biological neural system). If one partitions the time axis into discrete time windows [0, T), [T, 2T), . . . , then in the noise-free case, the firing events during each time window are completely determined by those in the preceding one. A component pi ∈ [0, T) j of a state in this set Äsp indicates that the corresponding neuron i has fired exactly j times during the considered time interval, and it also specifies the j firing times of this neuron during this interval. Due to refractory effects, one can choose l < ∞ for biological neural systems, for example, l = 15 for T = 30 ms. With some straightforward formal operations, one can also write this state set Äsp as a bounded subset of Rd for d := l · m. Let us then consider the effect of noise on computations. Let Z(p, B) be a function that for each state p ∈ Ä and Borel set B ⊆ Ä indicates the probability of noise corrupting state p into some state in B. The function Z is called the noise process affecting M, and it should satisfy the mild conditions of being a stochastic kernel (Feller, 1971, p. 205), that is, for each p ∈ Ä, Z(p, ·) should be a probability distribution, and for each Borel set B, Z(·, B) should be a measurable function. We assume that there is some measure µ over Ä so that Z(p, ·) is absolutely continuous with respect to µ for each p ∈ Ä; that is, µ(B) = 0 implies Z(p, B) = 0 for every measurable B ⊆ Ä . By the Radon–Nikodym theorem (Feller, 1971, p. 140), Z then possesses a density kernel with respect to µ; that is, there exists a function z(·, ·) such that for any state p ∈ Ä and Borel set B ⊆ Ä, Z z(p, q) dµ. Z(p, B) = q∈B
We assume that this function z(·, ·) has values in [0, ∞) and is measurable. (Actually, in view of our other conditions, this can be assumed without loss of generality.) The dynamics of a computational system M affected by a noise process Z is now defined as follows. If the system starts in a state p, the distribution of states q obtained after a single computation step on input a ∈ 6 is given
1078
Wolfgang Maass and Pekka Orponen
by the density kernel πa (p, q) = z(s(p, a), q). (Note that as a composition of two measurable functions, πa is again a measurable function.) The long-term dynamics of the system is given by a Markov process, where the distribution πxa (p, q) of states after |xa| computation steps with input xa ∈ 6 ∗ starting in state p is defined recursively by Z πx (p, r) · πa (r, q) dµ. πxa (p, q) = r∈Ä
One easily can verify by induction on |u| that Z πx (p, r) · πu (r, q) dµ πxu (p, q) = r∈Ä
for all x, u ∈ 6 ∗ of length ≥ 1 . Let us denote by πx (q) the distribution πx (p0 , q)—the distribution of states of M after it has processed string x, starting from the initial state p0 . Let ρ > 0 be the required reliability level. In the R most basic version, the system M accepts (rejects) some input x ∈ 60∗ if F πx (q) dµ ≥ 12 + ρ (respectively ≤ 12 −ρ). In less trivial cases, the system may also perform pure computation steps after it has read all of the input. Thus, we define more generally that the system M recognizes a set L ⊆ 60∗ with reliability ρ if for any x ∈ 60∗ : Z
1 + ρ for some u ∈ {t}∗ 2 ZF 1 x∈ / L ⇔ πxu (q) dµ ≤ − ρ for all u ∈ {t}∗ . 2 F
x∈L⇔
πxu (q) dµ ≥
This also covers the case of batch input, where |x| = 1 and 60 is typically quite large (e.g., 60 = Rn ). One gets a reasonably realistic model for noise in an analog neural net with state-space Ä = [−1, 1]d by defining the noise process Z so that it reflects a clipped gaussian distribution. Without more specific knowledge about the noise source, this appears to be the most appropriate model for analog noise in an analog neural net. One assumes in this model that for any computation step, the intended output pi ∈ [−1, 1] of the ith unit of the net is replaced by a clipped gaussian distribution of values qi ∈ [−1, 1], where values < −1 (> 1) are rounded to −1 (respectively, 1). If one assumes that this rounding occurs independently for each of the d units i in the network and, for simplicity, that all the underlying gaussians have the same variance, then one arrives in our general framework for a noisy computational system M at a noise process Z where Z(p, ·) is defined for each p ∈ Ä = [−1, 1]d by a symmetric gaussian distribution with density z(p, q) = ν(k q − p k) around p, but with all values qi < −1 (qi > 1) of the occurring states hq1 , . . . , qd i rounded to −1 (respectively 1). (Here k v k denotes the Euclidean norm of a vector v, and ν is the density function of some symmetric dvariate gaussian distribution.) Since such a rounding process will assign
Analog Noise
1079
probability > 0 to the lower-dimensional bounding hyperrectangles of Ä, we cannot simply define µ as the Lebesgue measure over Ä in order to subsume this type of analog noise under our general noise model. Rather one has to decompose Ä into components Ä1 , . . . , Äk (representing the interior Ä1 and lower-dimensional bounding hyperrectangles Ä2 , . . . , Äk of Ä = [−1, 1]d ), and define µ as a sum of measures µ1 + · · · + µk , where µ1 is the Lebesgue measure over Ä1 and µ2 , . . . , µk are Lebesgue measures for the lower-dimensional spaces Ä2 , . . . , Äk . In the case of a network of spiking neurons, the noise model has to take into account that not only the firing time of a neuron is subject to some jitter (which can be modeled by a gaussian distribution), but also neurons may randomly fail to fire, or they may fire “spontaneously” (even when they would not fire in the corresponding deterministic model). All these effects can be modeled by a suitable noise process Z defined on the state-space Äsp discussed earlier, with a measure µ over Ä defined by a decomposition of Ä similarly as in the case of analog neural nets. 3 An Upper Bound for the Computational Power of Systems with Analog Noise It has been shown for various concrete models of analog computation without noise, such as generalized shift maps (Moore, 1990), recurrent neural nets (Siegelmann & Sontag, 1991), and networks of spiking neurons (Maass, 1996), that they can simulate a universal Turing machine, and hence have immense computational power. It has long been conjectured that their computational power collapses to that of a finite automaton as soon as one assumes that they are subject to even small amounts of analog noise. We provide in this section a proof of this conjecture. Furthermore we make explicit on which parameters of the analog noise the required number of states of a simulating finite automaton depends. Our proof requires a mild continuity assumption for the density functions z(r, ·), which is satisfied in all concrete cases that we have considered. We do not require any global continuity property over Ä for the density functions z(r, ·) because of the previously discussed concrete cases, where the statespace Ä is a disjoint union of subspaces Ä1 , . . . , Äk with different measures on each subspace. We only assume that for some arbitrary partition of Ä into Borel sets Ä1 , . . . , Äk the density functions z(r, ·) are uniformly continuous over each Äj , with moduli of continuity that can be bounded independent of r. In other words, we require that z(·, ·) satisfies the following condition: A function π(·, ·) from Ä2 into R is called piecewise equicontinuous if for every ε > 0 there is a δ > 0 such that for every r ∈ Ä, and for all p, q ∈ Äj , j = 1, . . . , k: kp−qk≤δ
implies
¯ ¯ ¯π(r, p) − π(r, q)¯ ≤ ε.
(3.1)
1080
Wolfgang Maass and Pekka Orponen
Note that because the state-space Ä is bounded, any restriction π(r, ·) of a piecewise equicontinuous function π(·, ·) to fixed r ∈ Ä has bounded range. If z(·, ·) satisfies condition (3.1), we call also the resulting noise process Z piecewise equicontinuous. Our preceding discussions suggest that all practically relevant noise processes Z have this property. To formulate our result, we need a notion of regular sets of sequences over arbitrary domains 60 , which we define as follows. Let L ⊆ 60∗ be a set of sequences over an input domain 60 . Sequences x, y ∈ 60∗ are equivalent with respect to L if one has xw ∈ L ⇔ yw ∈ L for all w ∈ 60∗ . The set L is regular if this equivalence relation has only finitely many equivalence classes. By the Myhill–Nerode theorem (Hopcroft & Ullman, 1979, pp. 65– 67), for finite alphabets 60 , this definition coincides with the usual definition of regular sets via finite automata. From the point of view of computational complexity theory, machine models that accept only regular sets belong to the most “primitive” class of models. In contrast to Turing machines and other universal computational models, the number of internal states of such machine models is fixed, independent of the length of the input string. Theorem 1. Let L ⊆ 60∗ be a set of sequences over an arbitrary input domain 60 . Assume that some computational system M, affected by a piecewise equicontinuous noise process Z, recognizes L with reliability ρ, for some arbitrary ρ > 0. Then L is regular. Proof. Let M = hÄ, p0 , F, 6, si, where 6 = 60 ∪ {t}, be the system in question recognizing L. We shall show that there are only finitely many equivalence classes of sequences with respect to L. We begin by observing that if for two sequences x, y ∈ 60∗ , the distribuclose, tions πx (·) and πy (·) are ¯ then x and y are equivalent. To ¯ R sufficiently see this, assume that r∈Ä ¯πx (r) − πy (r)¯ dµ ≤ ρ, and suppose for a contradiction that x and y are not equivalent. Then there exists some w ∈ 60∗ with xw ∈ L ⇔ yw ∈ / L. WithoutRloss of generality, assume that R xw ∈ L. Thus, there exists some u ∈ {t}∗ with F πxwu (q) dµ ≥ 12 + ρ and F πywu (q) dµ ≤ 12 − ρ. This yields the contradiction ¯ ¯Z Z ¯ ¯ ¯ ¯ πxwu (q) dµ − πywu (q) dµ¯ 2ρ ≤ ¯ ¯ ¯ q∈F q∈F ¯ ¯Z Z Z Z ¯ ¯ ¯ ¯ πx (r) · πwu (r, q) dµ dµ − πy (r) · πwu (r, q) dµ dµ¯ =¯ ¯ ¯ q∈F r∈Ä q∈F r∈Ä Z Z ¯ ¯ ¯πx (r) − πy (r)¯ · πwu (r, q) dµ dµ ≤ Z
q∈F
r∈Ä
¯ ¯ ¯πx (r) − πy (r)¯ ·
= r∈Ä
ÃZ
! πwu (r, q) dµ
q∈F
dµ
Analog Noise
1081
≤ ρ.
¯ ¯ R Thus we have shown that r∈Ä ¯πx (r) − πy (r)¯ dµ ≤ ρ implies that x, y ∈ 60∗ are equivalent. Next we observe that all the density functions πx (·) for x ∈ 6 ∗ are piecewise uniformly continuous, with the same bounds on their moduli of continuity as the noise density functions z(r, ·) have. This is verified by induction on |x|. Given ε > 0, let δ > 0 be such that the density function z(·, ·) satisfies condition (3.1) for all r ∈ Ä and j = 1, . . . , k. We then have for any x ∈ 6 + , a ∈ 6, and all p, q ∈ Äj such that k p − q k ≤ δ: Z ¯ ¯ ¯ ¯ ¯πxa (p) − πxa (q)¯ = πx (r) · ¯πa (r, p) − πa (r, q)¯ dµ Zr∈Ä ¯ ¯ πx (r) · ¯z(s(r, a), p) − z(s(r, a), q)¯ dµ = r∈Ä Z πx (r) dµ ≤ ε· = ε.
r∈Ä
The preceding observation now implies that the space of all functions C so that¯ any two πx (·) for x ∈ 60∗ can be partitioned into finitely many ¯ R classes functions πx (·), πy (·) in the same class C satisfy r∈Ä ¯πx (r) − πy (r)¯ dµ ≤ ρ, and hence correspond to sequences that are equivalent with respect to L. Such a partition can for example be achieved in the following way. Using the piecewise uniform continuity of the πx (·), choose from within each component Äj of Ä a finite set (or “grid”) Gj that is so dense that for each r ∈ Äj , if tr ∈ Gj is the grid point closest to r, then |πx (r) − πx (tr )| ≤ ρ/4µ(Ä). (To see that such a finite Gj always exists, note that given the value δ > 0 corresponding to ε = ρ/4µ(Ä) in condition 3.1, one can by the BolzanoWeierstrass theorem choose only a finite number of points t from within the bounded set Äj so that any two distinct chosen points t, t0 are more than a Sk Gj . Now partition the (bounded!) range of distance δ apart.) Take G = j=1 all functions πx (·) into finitely many intervals I of length ρ/2µ(Ä), and place two functions πx (·), πy (·) in the same class C if for every grid point t ∈ G the values of πx (t) and πy (t) fall into the same interval I. Then for any two functions πx (·), πy (·) in the same class C it is the case that for any r ∈ Äj ⊆ Ä, j = 1, . . . , k, |πx (r) − πy (r)| ≤ |πx (r) − πx (tr )| + |πx (tr ) − πy (tr )| + |πy (tr ) − πy (r)| ≤ ρ/µ(Ä), ¯ ¯ R and thus r∈Ä ¯πx (r) − πy (r)¯ dµ ≤ (ρ/µ(Ä)) · r∈Ä dµ = ρ. R
Remark. In stark contrast to the results of Siegelmann and Sontag (1991) and Maass (1996) for the noise-free case, the preceding theorem implies that
1082
Wolfgang Maass and Pekka Orponen
both recurrent analog neural nets and recurrent networks of spiking neurons with online input from 60∗ can only recognize regular languages in the presence of any reasonable type of analog noise, even if their computation time is unlimited and if they employ arbitrary real-valued parameters. Remark. The proof of theorem 1 relies on an analysis of the space of probability density functions over the state set Ä. An upper bound on the number of states of a deterministic finite automaton that simulates M can be given in terms of the number k of components Äj of the state set Ä, the dimension and diameter of Ä, a bound on the values of the noise density function z, and the value of δ corresponding to ε = ρ/4µ(Ä) in condition 3.1. 4 Noisy Analog Neural Nets Recognize Regular Languages Let us say that a noise process Z defined on a set Ä ⊆ Rd is bounded by η if it can move a state p only to other states q that have a distance ≤ η from p in the L1 -norm over Rd , that is, if its density kernel z has the property that for any p = hp1 , . . . , pd i and q = hq1 , . . . , qd i ∈ Ä, z(p, q) > 0 implies that |qi − pi | ≤ η for all i = 1, . . . , d. As a partial converse to the upperbound result of the previous section, we now prove that regular languages over the alphabet {−1, 1} can be recognized with perfect reliability (ρ = 12 ) by recurrent analog neural nets, as long as the noise process affecting the computation is bounded by a certain constant η > 0. The basic idea of our proof is first to construct a threshold logic network T recognizing the regular language under consideration, and then simulate T with a noise-tolerant analog neural net. However, in order to obtain the tolerance versus delay trade-off results in a uniform manner, we derive them as corollaries from a general result on simulating threshold logic networks by noisy recurrent analog neural nets. Consider a d-unit threshold logic network T (cf. section 2) with transition function s(p, a) = sgn(Wp + h + ac), where W ∈ Rd×d is the weight matrix of T , h ∈ Rd is the bias vector, and c ∈ Rd is the input weight vector. Let us say that T has separation θ , if at each unit, the argument to the signum function is always at least θ away from zero; that is, if |wTi p + hi + ci a| ≥ θ always holds, for every i = 1, . . . , d, p ∈ {−1, 1}d , and a ∈ {−1, 0, 1}. Any threshold logic network operating on the input alphabet {−1, 0, 1} may be modified to have some nonzero separation value by adjusting the bias values appropriately. An important special case are networks with integer weights, which may be adjusted to have separation 1. (On input values a ∈ {−1, 1} this is straightforward; dealing with the value a = 0 may in some cases require modifying the network structure.) Theorem 2. Let a language L ⊆ {−1, 1}∗ be recognized by some d-unit threshold logic network T with separation θ > 0, and let wmax be the maximum total input
Analog Noise
1083
P weight to any unit of T (wmax = maxi j |wij |). Let η be a constant satisfying η < θ/wmax . Then L can also be recognized by a d-unit recurrent analog neural net N that has perfect reliability (ρ = 12 ) when affected by any noise process Z bounded by η. The activation function of N may be any function σ satisfying σ (u) → −1 for u → −∞ and σ (u) → 1 for u → ∞. Proof. The idea of the proof is simply to simulate the threshold logic network T with an analog neural network N by forcing the analog units always to operate close to saturation (in states u such that σ (u) is within δ of ±1, for some small constant δ), so that they in effect function as threshold logic units. This is achieved by multiplying the weights in N by a sufficiently large constant m. Thus, let a language L be recognized by a d-unit threshold logic network T with transition function p+ = s(p, a) = sgn(Wp + h + ac), and separation θ. Let δ and uδ be constants such that the noise bound is η < θ/wmax − δ, and for all u ≥ uδ , |1 − σ (u)| ≤ δ, and for all u ≤ −uδ , |(−1) − σ (u)| ≤ δ. Now consider the analog network N obtained from T by multiplying all the weights and thresholds by a constant, m≥
uδ θ − wmax (η + δ)
,
and replacing the signum nonlinearities by the sigmoids. We claim that N reproduces the behavior of T exactly, in the sense that the state of N at each time step, before noise is applied, is within δ of the corresponding state of T. Assume that the claim is true at some given time, when the state of T is some p ∈ {−1, 1}d , and that of N correspondingly p˜ = p + r, for some ˜ r ∈ [−δ, δ]d . Consider then the update of N first with a noise vector e = q˜ − p, where q˜ is generated according to some componentwise η-bounded noise ˜ q˜ ), and then with the network transition function density z(p, p˜+ = σ (mW q˜ + mh + mac) = σ (mW(p + r + e) + mh + mac) = σ (m(Wp + h + ac) + mW(r + e)). Considering the argument vector to the sigmoid componentwise, we obtain for each i = 1, . . . , d the bound: |m(wTi p + hi + ci a) + mwTi (r + e)| ≥ mθ − mwmax (δ + η) ≥ uδ . By our choice of the value uδ , we are thus again ensured that the components of the new state vector p˜+ of N are within δ of the corresponding components of the state vector p+ of T . The claim follows by induction.
1084
Wolfgang Maass and Pekka Orponen
One technicality concerning the choice of final states in the network N still needs to be pointed out. Even though in the network T the final states may be defined as, say, FT = {p ∈ {−1, 1}d | p1 = 1}, noise in the network N also affects the state of the output unit, and so the final states there should be defined as FN = {p ∈ [−1, 1]d | p1 ≥ 1 − η}, if the noise is bounded by η. Corollary 1. For every regular language L ⊆ {−1, 1}∗ there is a constant η > 0 such that L can be recognized with perfect reliability (ρ = 12 ) by a recurrent analog neural net in spite of any noise process Z bounded by η. Proof. Let L be recognized by some finite automaton with m states. As presented in Minsky (1972, pp. 55–57), one can easily construct from this automaton a threshold logic network T with 2m + 1 units that recognizes L. In Minsky’s construction, there is one threshold logic unit for each (state, input symbol) pair of the simulated automaton, plus one unit that tests for the acceptance condition. (Actually, our model mandates testing also for input termination, which requires adding a few extra units.) A unit is activated (goes to state 1) when it receives an excitatory signal from some preceding (state, symbol) unit and its input line. All the nonzero weights in T have absolute value 1, and the units have fan-in at most 2m+1. Since this network satisfies the conditions of theorem 2 with θ = 1, wmax = 2m + 1, we may choose any value of η < 1/(2m + 1). The next corollary shows that we can increase the noise tolerance of a network by slowing the computation. Given an integer constant τ ≥ 1, let us say that a network N recognizes a language L with delay τ , if for every string x = a1 , . . . , ak ∈ {−1, 1}∗ , x ∈ L if and only if N accepts the string aτ1 , . . . , aτk (each input symbol ai is repeated τ times before the next one is presented). Corollary 2. For every regular language L ⊆ {−1, 1}∗ there is a constant delay value τ such that for any η < 12 , L can be recognized with delay τ with perfect reliability (ρ = 12 ) by a recurrent analog neural net that may be subject to any noise process Z bounded by η. Proof. Let again L be recognized by some finite automaton with m states. The threshold logic units used in the simulation of corollary 1 simply test for the simultaneous activity on any one of the lines coming from the preceding (state, symbol) units and the appropriate input line. Thus, each such unit can be replaced by a tree of fan-in 2 OR gates, and a concluding AND gate. Considering that the maximum fan-in of the original units is 2m + 1, the AND-OR trees may be constructed to have height τ = dlog2 me + 2. The resulting network then has integer weights, with wmax = 2, and recognizes the language L with delay τ.
Analog Noise
1085
Remark. One can obtain different noise tolerance versus delay trade-offs using the recent, more advanced simulations of finite automata by threshold logic networks (Alon, Dewdney, & Ott, 1991; Horne & Hush, 1996; Indyk, 1995). For instance, Horne and Hush (1996) presents a simulation of mp state finite automata by threshold logic networks with O( m log m) units, connection weights ±1, and delay p 4. Thus, one can in corollary 2 achieve a noise-tolerance bound of η = O(1/ m log m) with delay τ = 4. Remark. The precise values of the η bounds obtained above are proportional to the width of the interval used to encode unit states in the analog neural net model. The results are here formulated using the interval [−1, 1], and changes in this interval would have the proportional effects on the η values. For instance, if the interval [0, 1] were used (as in Siegelmann & Sontag, 1991), the η bound in corollary 2 would decrease from 12 to 14 . 5 A Novel Upper Bound for the VC-Dimension of Various Types of Neural Nets with Analog Noise In this section we provide an example for the effect of analog noise on discrete time analog computations with batch input. We focus our attention on the most common types of analog neural nets and show that in the presence of arbitrarily small amounts of analog noise, there exists an upper bound for the VC-dimension of such neural nets that is independent of the total number of gates in the case of a feedforward architecture and independent of the computation time in the case of a recurrent neural net. It depends on only the structure of the first layer of the neural net (or alternatively of any other fixed layer). This novel type of upper bound depends apart from the analog noise on only those parameters of the net that are relevant for its first computation step, and it holds for arbitrary real-valued batch inputs and arbitrary real-valued “programmable parameters” (weights, etc.). The resulting upper bounds for the required sample size of a noisy multilayer sigmoidal neural net extend a preceding result by Haussler (1992). He had shown in corollary 3 that even in the noise-free case, an upper bound for the VC-dimension can be given that depends on only the maximal absolute value of weights for gates on layers ≥ 2 and on their maximal fan-in. In the present result, all dependence on parameters that concern gates on layers ≥ 2 is removed. It will become obvious from the proof of theorem 3 that our upper bound is actually of a quite general nature, and it can also be applied to various other models for discrete-time analog computation with analog noise that are not related to neural nets. The VC-dimension (abbreviated VC-dim(F )) of an arbitrary class F of functions f : Rn → {0, 1} is defined as follows. One says that F shatters a finite set S ⊆ Rn if for every subset A ⊆ S there exists a function f ∈ F with
1086
Wolfgang Maass and Pekka Orponen
f (x) = 1 for x ∈ A and f (x) = 0 for x ∈ S − A. The VC-dimension of F is defined as VC-dim(F ) := sup {|S| : S ⊆ Rn is shattered by F }. The VC-dimension of F may be viewed as a measure for the expressibility (or degrees of freedom) of F . In particular, it provides for arbitrary finite sets D ⊆ Rn an upper bound of the form |D|O(VC-dim(F )) for the number of functions D → {0, 1} that can be written as a restriction of a function in F to this finite domain D. As a consequence, the VC-dimension of F is the key parameter for estimating the number of randomly chosen examples that are needed to “learn” arbitrary target functions g : Rn → {0, 1} from randomly chosen examples hx, g(x)i for g by a learning algorithm that uses functions from F as hypotheses (see Haussler, 1992; Vapnik & Chervonenkis, 1971; Blumer, Ehrenfeucht, Haussler, & Warmuth, 1989; Maass, 1995). It should be noted that this does not only hold for the “classical” probably approximately correct (PAC) learning model where the target function g is required to belong to the class F , but according to Haussler (1992), also in the general case of agnostic PAC learning where g : Rn → {0, 1} can be any function. Of course, the latter case is much more relevant for the theory of learning with neural nets, where the class F of possible “hypotheses” is fixed by the architecture of the neural net on which we run a learning algorithm, whereas the examples hx, g(x)i may arise from some arbitrary real-world classification problem for which we train the neural net. It is obvious from the results of Siegelmann and Sontag (1991) and Maass (1996) that there exist finite recurrent analog neural nets and finite recurrent networks of spiking neurons with batch input and parameters from Q that have infinite VC-dimension (consider networks that can simulate a universal Turing machine, with each input bit-string encoded into a rational number). From the point of view of learning theory, an infinite VC-dimension is commonly interpreted as information-theoretic evidence that there exists no “learning algorithm” for such networks (not even one with unlimited computation time). We will show in this section that this “anomaly” disappears as soon as one takes into account that the neural net is subject to analog noise, even if the amount of such noise is arbitrarily small. For technical reasons, we also discuss the pseudo-dimension P-dim(G ) of a class G of real-valued functions g : Rn → R. One can define P-dim(G ) as the VC-dimension of the following associated class,
F : = { f : Rn+1 → {0, 1}: ∃ g ∈ G ( f (x, y) = 1 if g(x) ≥ y and f (x, y) = 0 if g(x) < y)}, of boolean-valued functions. Consider now the computation of a system M = hÄ, p0 , F, Rn , si on a batch input vector x ∈ Rn , affected by some piecewise equicontinuous noise process Z whose density function z has values in some range [0, B]. The distribution of states of M after k ≥ 1 computation steps is given by the density function πxu (p), where |x| = 1 and u = tk−1 . For k > 1, this density
Analog Noise
1087
R
· πu (q, p) dµ, and for k = 1 we have 0 (p , q). This decomposition of the density function for the
can be decomposed as
0 q∈Ä πx (p , q)
simply πxu (p) = πx state-distribution of M will be essential for our subsequent results. We show in theorem 3 that there exists a finite upper bound for the VC-dimension of the class F of functions computable by a class M of such systems M (which receive arbitrary real-valued batch input) that does not depend on the complexity of the class H of functions πz (·, ·) that describe the second part of the computations of these systems M after their first computation step. Let M be a class of such systems, affected by the same piecewise equicontinuous noise process Z. For example, M can be the class of systems M that result from different weight assignments to some feedforward or recurrent analog neural net with some fixed architecture. Denote by G the class of all density kernels of the form π(x, q) := πx (p0 , q) for systems M ∈ M, and by H the class of density kernels of the form ω(q, p) := πu (q, p), for systems M ∈ M and sequences u ∈ {t}∗ . (As a special case, we include also the constant function 1 in H.) Then all the boolean functions computed with reliability ρ by the systems M ∈ M are included in the class F of functions f : Rn → {0, 1} that are composed of a function π ∈ G and a function ω ∈ H so that for any x ∈ Rn the integral R
R
p∈F q∈Ä π(x, q)
· ω(q, p) dµ dµ has a value ≥ and else a value ≤
1 2
− ρ.
1 2
+ ρ if f (x) = 1,
(5.1)
Actually, the class F contains somewhat more than just the functions computed by systems from M, because the two component functions π and ω in equation 5.1 may come from two different systems in M (for example, from two different weight assignments to a recurrent analog neural net). In theorem 3 we consider an even more general setup where one has two 0 bounded state sets Ä ⊆ Rd and Ä0 ⊆ Rd , measures µ over Ä and µ0 over Ä0 , as well as a Borel set F ⊆ Ä0 of accepting final states. (In applications Ä is typically the set of possible intermediates states after a fixed number l (e.g., l = 1) of computation steps, and Ä0 is the set of possible output states of a computation. One has d 6= d0 if, for example, the number d of units on the first hidden layer of a feedforward sigmoidal neural net differs from the number d0 of output nodes of the net; see corollary 3.) We assume in theorem 3 that G is an arbitrary class of piecewise equicontinuous density kernels π : Rn ×Ä → [0, B] with uniformly bounded moduli of continuity (as in condition 3.1), that H is an arbitrary class of density kernels ω : Ä × Ä0 → R+ , that ρ > 0 is an arbitrary given parameter, and that F is the class of functions f : Rn → {0, 1} for which exist functions π ∈ G R there R and ω ∈ H so that for any x ∈ R the integral p∈F q∈Ä π(x, q) · ω(q, p) dµ dµ0 has a value ≥ 12 + ρ if f (x) = 1, and otherwise a value ≤ 12 − ρ. Because of our assumption about the function class G , one can (as in the
1088
Wolfgang Maass and Pekka Orponen
proof of theorem 1) superimpose on the space Ä a finite grid G, such that ˜ q)| ≤ ρ/5µ(Ä) for all q ∈ G for any π, π˜ R∈ G ¯and x, x˜ ∈ Rn : |π(x, ¯ q) − π˜ (x, ˜ q)¯ dµ < ρ/2. The size |G| of the grid (that is, ˜ x, implies that q∈Ä ¯π(x, q) − π( the number of grid points) depends in general on the reliability parameter ρ, the common moduli of continuity of the functions in G , and the volume and shape of the state-space Ä. Theorem 3. Let G , H, and F be function classes as specified above and assume in addition that the class G has finite pseudo-dimension 1. Then one can give a finite upper bound for the VC-dimension of F in terms of ρ, B, |G|, 1, and µ(Ä). Obviously this bound does not depend on the complexity of the function class H (except via parameters related to the state set Ä) . Proof. Let S ⊆ Rn be some arbitrary finite set shattered by F . For any subset AR⊆ SRwe fix functions πA ∈ G and ωA ∈ H so that for any x ∈ S the integral p∈F q∈Ä πA (x, q) · ωA (q, p) dµ dµ0 has a value ≥ 12 + ρ if x ∈ A, and
else a value ≤ 12 −ρ. We write GS∗ for the class of all functions πA ∈ G for A ⊆ S, and GS for the class of restrictions of these functions to the finite domain S × G. We also consider for γ := ρ/10µ(Ä) and any class A of functions with range R+ the class Aγ of all “γ -discretizations” gγ of functions g ∈ A, where gγ (z) :=
¹
g(z) γ
º for any z in the domain of g. γ
γ
In particular for the class GS the functions πA ∈ GS map S×G into {0, . . . , b− 1} for b := bB/γ c + 1. Note that by our assumptions on G , for any π, π˜ ∈ G ˜ q)| ≤ 1) implies and Rany x, x˜ ∈ S the condition ∀q ∈ G (|π γ (x, q) − π˜ γ (x, ˜ q)| dµ < ρ/2. ˜ x, that Ä |π(x, q) − π( One can get an upper bound for the complexity of GS∗ by applying γ to GS a generalization of Sauer’s lemma due to Alon, Cesa-Bianci, BenDavid, and Haussler (1993). Given integers m, b, and 1, define β(m, b, 1) := P ¡m¢ i γ log2 1 i=1 i b . Lemma 15 of Alon et al. (1993) states that if A is any class of functions obtained as the discretizations of the functions in a class A of pseudo-dimension 1, such that the functions in Aγ have a domain D of size m and range {0, . . . , b − 1}, then Aγ must contain an “L∞ 2-cover” B γ ⊆ Aγ of size at most |B γ | ≤ 2 · (mb2 )β(m,b,1) . That is, for every f ∈ Aγ there is some f˜ ∈ B γ such that | f (z) − f˜(z)| < 2 (and hence ≤ 1) for every z ∈ D. (The result holds for general values of γ , 1, m, and b.) Applied to our context (with A := GS ), this result implies that there exists a set G ∗ ⊆ GS∗ whose cardinality can be bounded in terms of the pseudodimension 1 of G as |G ∗ | ≤ 2 · (|S| · |G| · b2 )β(|S|·|G|,b,1) ,
(5.2)
Analog Noise
1089
such that for every π ∈ GS∗ there exists some π˜ ∈ G ∗ with |π γ (x, q) − π˜ γ (x, q)| ≤ 1 for all x ∈ S and all q ∈ G. γ With the help of the 2-cover of GS induced by G ∗ , we can now show that the cardinality |S| of the shattered set S can be bounded through the inequality |G|
2|S| ≤ 2 · (|S| · |G| · b2 )β(|S|·|G|,b,1) · 2b .
(5.3)
It is obvious that this inequality yields an upper bound for |S| that does not depend on the complexity of the function class H (except for parameters related to Ä). Let us consider for each ω ∈ H the discrete map ωˆ : {0, . . . , b − 1}G → {0, 1} which is induced by ω through the following definition: ω( ˆ πˆ ) has exist some π ∈ G and x ∈ S with value 0 for πˆ ∈ {0, . . . , b − 1}G if there R R 1 γ (x, q)| ≤ 1 for all q ∈ G and 0 |π(q)−π ˆ p∈F q∈Ä π(x, q)·ω(q, p) dµ dµ ≤ 2 −ρ. Else we set ω( ˆ π) ˆ = 1. Since we have the upper bound (see equation 5.2) on the size of the cover |G| G ∗ , and there exist at most 2b different functions ω, ˆ it suffices for proving equation 5.3 to show that the following claim holds. Claim. Let A1 , A2 ⊆ S. If some function π˜ ∈ G ∗ covers both πA1 and πA2 , γ γ in the sense that |π˜ γ (x, q) − πA1 (x, q)| ≤ 1 and |π˜ γ (x, q) − πA2 (x, q)| ≤ 1 for all x ∈ S and all q ∈ G, and moreover ωˆ A1 = ωˆ A2 , then A1 = A2 . In order to prove this claim, let us assume that A1 6= A2 , but both πA1 and πA2 are covered by the same function π˜ ∈ G ∗ . We shall show that ωˆ A1 6= ωˆ A2 . Fix some x0 ∈ S so that either x0 ∈ A1 − A2 or x0 ∈ A2 − A1 . Without loss of generality, we may assume that x0 ∈ A1 − A2 . Let πˆ : G → {0, . . . , b−1} be defined by π(q) ˆ = π˜ γ (x0 , q). Then we have ωˆ A2 (πˆ ) = 0, since by assumption γ |π˜ γ (x0 , q) − πA2 (x0 , q)| ≤ 1 for all q ∈ G and Z
Z p∈F
q∈Ä
πA2 (x0 , q) · ωA2 (q, p) dµ dµ0 ≤
1 − ρ. 2
Assume for a contradiction that also ωˆ A1 (πˆ ) = 0 for this function πˆ . This implies that there exist some π ∈ G and some x1 ∈ S with Z
Z p∈F
q∈Ä
π(x1 , q) · ωA1 (q, p) dµ dµ0 ≤
1 −ρ 2
and
(5.4)
|π(q) ˆ − π γ (x1 , q)| ≤ 1 for all q ∈ G. The latter implies by our choice of G and γ and the definition of πˆ that Z |π(x ˜ 0 , q) − π(x1 , q)| dµ < ρ/2. q∈Ä
(5.5)
1090
Wolfgang Maass and Pekka Orponen
On the other hand, the assumptions on π˜ ∈ G ∗ imply that |π˜ γ (x0 , q) − γ πA1 (x0 , q)| ≤ 1 for all q ∈ G, hence Z |π(x ˜ 0 , q) − πA1 (x0 , q)| dµ < ρ/2 . (5.6) q∈Ä
Furthermore since x0 ∈ A1 , we have by choice of ωA1 that Z Z 1 πA1 (x0 , q) · ωA1 (q, p) dµ dµ0 ≥ + ρ. 2 p∈F q∈Ä
(5.7)
The inequalities (equations 5.5 and 5.6) imply that Z q∈Ä
|π(x1 , q) − πA1 (x0 , q)| dµ < ρ.
This inequality yields in combination with equations 5.4 and 5.7 the contradiction ¯Z Z ¯ ¯ π(x1 , q) · ωA1 (q, p) dµ dµ0 ρ≤¯ ¯ p∈F q∈Ä ¯ Z Z ¯ ¯ − πA1 (x0 , q) · ωA1 (q, p) dµ dµ0 ¯ ¯ p∈F q∈Ä Z Z ¯ ¯ ¯π(x1 , q) − πA (x0 , q)¯ · ωA (q, p) dµ dµ0 ≤ 1 1 p∈F
Z = Z
q∈Ä
≤ q∈Ä
q∈Ä
¯ ¯ ¯π(x1 , q) − πA (x0 , q)¯ · 1
ÃZ p∈F
! 0
ωA1 (q, p) dµ
dµ
¯ ¯ ¯π(x1 , q) − πA (x0 , q)¯ dµ 1
< ρ.
This contradiction implies that ωˆ A1 (πˆ ) = 1, hence ωˆ A1 6= ωˆ A2 . Thus we have verified the preceding claim, and the proof of theorem 3 is now complete. Remark. It follows from Alon et al. (1993) that instead of a finite upper bound for the pseudo-dimension of G , it suffices for theorem 3 to assume a finite upper bound for the γ -dimension Pγ -dim(G ) of G for γ = ρ/20µ(Ä). Corollary 3. There exists a finite upper bound for the VC-dimension of layered feedforward sigmoidal neural nets and feedforward networks of spiking neurons with piecewise equicontinuous analog noise (for arbitrary real-valued inputs, boolean output computed with some arbitrary reliability ρ > 0, and arbitrary realvalued “programmable parameters”) that does not depend on the size or structure of the network beyond its first hidden layer.
Analog Noise
1091
Proof. We first consider for some arbitrary given parameters n, d, d0 ∈ N the class N of all layered feedforward sigmoidal neural nets with n input nodes, d units on their first hidden layer, and d0 output nodes. Thus, the nets in N may have arbitrary numbers of layers and gates and arbitrary real-valued weight assignments. We assume that the d gates on the first layer are affected by some piecewise equicontinuous noise process with density kernel z : Ä2 → R+ , where Ä := [−1, 1]d . Let F : Rn × Rm → Ä be the function whose value F(x, w) is the vector of outputs of the d first hidden-layer units (without noise), for arbitrary network inputs x ∈ Rn and arbitrary assignments w ∈ Rm to the weights and biases of these units. We take as the class G of functions π considered in the proof of Theorem 3 all functions of the form π(x, q) = z(F(x, w), q) for arbitrary parameters w ∈ Rm . The results presented in Karpinski and Macintyre (1997) imply that the pseudo-dimension of this class G of functions is bounded by a polynomial in m, for all common choices of activation functions of the sigmoidal units and all practically relevant density kernels z for the noise process (even involving the exponential function). In the case where the activation functions and density kernels are piecewise polynomial, one can apply the results of Goldberg and Jerrum (1995) to get a slightly better finite upper bound for the pseudo-dimension of G . 0 We define for Ä = [−1, 1]d and Ä0 = [−1, 1]d the class H as the class + 0 of all density kernels ω : Ä × Ä → R that describe the computations of the remaining layers of networks in N with arbitrary noise processes (and arbitrary real-valued weights). It follows from theorem 3 that the finite VC-dimension bound obtained for the class F of functions computed with reliability ρ > 0 by networks in the class N does not depend on the complexity of the function class H , and hence not on the number of layers, the number of units beyond the first layer, or the noise process on later layers of these networks. In the case of a network N of noisy spiking neurons, the programmable parameters consist of the “weights” of synapses, time delays for postsynaptic potentials, and parameters that determine other aspects of the functional form of response functions (i.e., postsynaptic potentials) and threshold functions. The pseudo-dimension of the class G that arises when one applies (as described in section 2) the framework considered here to the first layer of a network N of noisy spiking neurons can be bounded with the help of the same tools as for the case of sigmoidal neural nets. Corollary 4. There exists a finite upper bound for the VC-dimension of recurrent sigmoidal neural nets and networks of spiking neurons with analog noise (for arbitrary real valued inputs, boolean output computed with some arbitrary reliability ρ > 0, and arbitrary real valued “programmable parameters”) that does not depend on the computation time of the network, even if the computation time is allowed to vary for different inputs.
1092
Wolfgang Maass and Pekka Orponen
Proof. One proceeds in the same manner as for the proof of corollary 3, except that G now consists of the class of all state distributions that arise from the first computation step of the total network, and H consists of all possible state transformations that can arise from the rest of the computations of the same network.
6 Conclusions We have introduced a new framework for the analysis of analog noise in discrete-time analog computations that is better suited for real-world applications and more flexible than previous models. In contrast to preceding models, it also covers important concrete cases such as analog neural nets with a gaussian distribution of noise on analog gate outputs, noisy computations with less than perfect reliability, and computations in networks of noisy spiking neurons. Furthermore, we have introduced adequate mathematical tools for analyzing the effect of analog noise in this new framework. These tools differ quite strongly from those that have been used previously for the investigation of noisy computations. We show that they provide new bounds for the computational power and VC-dimension of analog neural nets and networks of spiking neurons in the presence of analog noise. Finally, our model for noisy analog computations can also be applied to completely different types of models for discrete-time analog computation than neural nets, such as arithmetical circuits (Tur´an and Vatan, 1994), the random access machine with analog inputs, the parallel random access machine with analog inputs, various computational discrete-time dynamical systems (Moore, 1990; Koiran, Cosnard, & Garzon, 1994; Asarin & Maler, 1994; Orponen & Matamala, 1996) and (with some minor adjustments) also the BSS model (Blum, Shub, & Smale, 1989; Koiran, 1993). Our framework provides for each of these models an adequate definition of noise-robust computation in the presence of analog noise, and our results provide upper bounds for their computational power and VC-dimension in terms of characteristics of their analog noise. Acknowledgments We thank Peter Auer for helpful conversations. Part of this work was done while P. O. was at the University of Helsinki and during visits to the Technische Universit¨at Graz and the University of Chile in Santiago. This work was supported by the “Fonds zur Forderung ¨ der wissenschaflichen Forschung (FWF), Austrian Science Fund,” project number P12153.
Analog Noise
1093
References Alon, N., Dewdney, A. K., & Ott, T. J. (1991). Efficient simulation of finite automata by neural nets. J. Assoc. Comput. Mach., 38, 495–514. Alon, N., Cesa-Bianchi, N., Ben-David, S., & Haussler, D. (1993). Scale-sensitive dimensions, uniform convergence, and learnability. In Proceedings of the 34th Annual IEEE Symposium on Foundations of Computer Science (pp. 292–301). New York: IEEE Computer Science Press. Anderson, J. A., Silverstein, J. W., Ritz, S. A., & Jones, R. S. (1988). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. In J. A. Anderson & E. Rosenfeld (Eds.), Neurocomputing: Foundations of research. Cambridge, MA: MIT Press. Asarin, E., & Maler, O. (1994). On some relations between dynamical systems and transition systems. In Proceedings of the 21st International Colloquium on Automata, Languages, and Programming (pp. 59–72). Lecture Notes in Computer Science 820. Berlin: Springer-Verlag. Blum, L., Shub, M., & Smale, S. (1989). On a theory of computation over the real numbers: NP-completeness, recursive functions and universal machines. Bulletin of the Amer. Math. Soc., 21, 1–46. Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36, 929– 965. Casey, M. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Computation, 8, 1135–1178. Cowan, J. D. (1966). Synthesis of reliable automata from unreliable components. In E. R. Caianiello (Ed.), Automata theory (pp. 131–145). New York: Academic Press. Feller, W. (1971). An introduction to probability theory and its applications (2nd ed.). New York: Wiley. Frasconi, P., Gori, M., Maggini, M., & Soda, G. (1996). Representation of finite state automata in recurrent radial basis function networks. Machine Learning, 23, 5–32. G´al, A. (1991). Lower bounds for the complexity of reliable boolean circuits with noisy gates. In Proceedings of the 32th Annual IEEE Symposium on Foundations of Computer Science (pp. 594–601). New York: IEEE Computer Science Press. Gerstner, W., & van Hemmen, J. L. (1994). How to describe neuronal activity: Spikes, rates or assemblies? In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 463–470). San Mateo, CA: Morgan Kaufmann. Goldberg, P. W., & Jerrum, M. R. (1995). Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers. Machine Learning, 18, 131–148. Haussler, D. (1992). Decision theoretic generalizations of the PAC-model for neural nets and other learning applications. Information and Computation, 100, 78–150.
1094
Wolfgang Maass and Pekka Orponen
Hopcroft, J. E., & Ullman, J. D. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison-Wesley. Horne, B. G., & Hush, D. R. (1996). Bounds on the complexity of recurrent neural network implementations of finite state machines. Neural Networks, 9, 243– 252. Indyk, P. (1995). Optimal simulation of automata by neural nets. In Proceedings of the 12th Annual Symposium on Theoretical Aspects of Computer Science (pp. 337– 347). Lecture Notes in Computer Science 900. Berlin: Springer-Verlag. Karpinski, M., & Macintyre, A. (1997). Polynomial bounds for VC-dimension of sigmoidal and general Pfaffian neural networks. J. Computer and System Sciences, 54, 169–179. Kifer, Y. (1988). Random perturbations of dynamical systems. Boston: Birkh¨auser. Koiran, P. (1993). A weak version of the Blum, Shub and Smale model. In Proceedings of the 34th Annual IEEE Symposium on Foundations of Computer Science (pp. 486–495). New York: IEEE Computer Science Press. Koiran, P., Cosnard, M., & Garzon, M. (1994). Computability with lowdimensional dynamical systems. Theoret. Comput. Sci., 132, 113–128. Maass, W. (1995). Vapnik-Chervonenkis dimension of neural nets. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 1000–1003). Cambridge, MA: MIT Press. Maass, W. (1996). Lower bounds for the computational power of networks of spiking neurons. Neural Computation, 8, 1–40. Maass, W. (1997). Fast sigmoidal networks via spiking neurons. Neural Computation, 9, 279–304. Minsky, M. L. (1972). Computation: Finite and infinite machines. Englewood Cliffs, NJ: Prentice Hall. Moore, C. (1990). Unpredictability and undecidability in physical systems. Phys. Review Letters, 64, 2354–2357. Omlin, C. W., & Giles, C. L. (1996). Constructing deterministic finite-state automata in recurrent neural networks. J. Assoc. Comput. Mach., 43, 937–972. Orponen, P., & Matamala, M. (1996). Universal computation by finite twodimensional coupled map lattices. In Proceedings of the Workshop on Physics and Computation, PhysComp’96 (pp. 243–247). Boston: New England Complex Systems Institute. Phatak, D. S., & Koren, I. (1995). Complete and partial fault tolerance of feedforward neural nets. IEEE Transactions on Neural Networks, 6, 446–456. Pippenger, N. (1989). Invariance of complexity measures for networks with unreliable gates. J. Assoc. Comput. Mach., 36, 531–539. Rabin, M. (1963). Probabilistic automata. Information and Control, 6, 230–245. Siegelmann, H. T. (1994). On the computational power of probabilistic and faulty networks. In Proceedings of the 21st International Colloquium on Automata, Languages, and Programming (pp. 23–34). Lecture Notes in Computer Science 820. Berlin: Springer-Verlag. Siegelmann, H. T., & Sontag, E. D. (1991). Turing computability with neural nets. Appl. Math. Letters, 4(6), 77–80. Tur´an, G., & Vatan, F. (1994). On the computation of boolean functions by analog circuits of bounded fan-in. In Proceedings of the 35th Annual IEEE Symposium
Analog Noise
1095
on Foundations of Computer Science (pp. 553–564). New York: IEEE Computer Science Press. Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. von Neumann, J. (1956). Probabilistic logics and the synthesis of reliable organisms from unreliable components. In C. E. Shannon & J. E. McCarthy (Eds.), Automata studies (pp. 329–378). Annals of Mathematics Studies 34. Princeton, NJ: Princeton University Press. Received October 3, 1996; accepted June 3, 1997.
LETTER
Communicated by Sue Becker
Category Learning Through Multimodality Sensing Virginia R. de Sa Sloan Center for Theoretical Neurobiology, University of California at San Francisco, San Francisco, CA 94143-0444, U.S.A.
Dana H. Ballard Department of Computer Science, University of Rochester, Rochester, NY 14627-0226, U.S.A.
Humans and other animals learn to form complex categories without receiving a target output, or teaching signal, with each input pattern. In contrast, most computer algorithms that emulate such performance assume the brain is provided with the correct output at the neuronal level or require grossly unphysiological methods of information propagation. Natural environments do not contain explicit labeling signals, but they do contain important information in the form of temporal correlations between sensations to different sensory modalities, and humans are affected by this correlational structure (Howells, 1944; McGurk & MacDonald, 1976; MacDonald & McGurk, 1978; Zellner & Kautz, 1990; Durgin & Proffitt, 1996). In this article we describe a simple, unsupervised neural network algorithm that also uses this natural structure. Using only the co-occurring patterns of lip motion and sound signals from a human speaker, the network learns separate visual and auditory speech classifiers that perform comparably to supervised networks. 1 Introduction The ability of humans to form complex categories without explicit supervision has challenged modelers. On the one hand, classification is simpler if more dimensions are available to separate the classes. For example, categorizing cows and horses is simpler if one can also make use of auditory features in addition to visual features. On the other hand, simple clustering of multimodality patterns would prevent adequate performance in the individual modalities, and appropriate density modeling techniques rapidly become infeasible in high dimensions. Also, it is well known that the cerebral cortex competently classifies unimodal stimuli while keeping the different modalities largely separate. Inspired by this, we describe an algorithm that avoids the intractable task of modeling cross-modal associations but uses this useful structure to derive its own internal target signals for classifiers in the individual modalities. The algorithm uses natural and neurophysiologc 1998 Massachusetts Institute of Technology Neural Computation 10, 1097–1117 (1998) °
1098
Virginia R. de Sa and Dana H. Ballard
ically plausible one-way connections for information transmission, which distinguishes this approach from backpropagation (Rumelhart, Hinton, & Williams, 1986) and also the unsupervised model of Becker and Hinton (Becker & Hinton, 1992; Becker, 1996). More biologically plausible implementations of the information theoretic approach are given in Phillips, Kay, and Smyth (1995) and Kay, Floreano, and Phillips (1998) but have not been demonstrated on real problems with overlapping classes. The idea behind the algorithm is to minimize the disagreement between the output decisions of two or more classifiers receiving different forms of input from the same source (see Figure 1). The key insight is that this can be done without directly connecting all the hidden units to each other and without requiring implausible communication of error signals backward along forward connections. 1.1 Classification. A general way of representing sensory inputs is in terms of n-dimensional points, or vectors, groups of which can be represented by prototypes, or codebook vectors. A simple classification border between two such codebook vectors representing different classes is the (n − 1)-dimensional hyperplane midway between them. With more codebook vectors representing several classes, nonlinear boundaries may be devised by taking the border from the Voronoi tessellation of these prototype points. Each codebook vector is assigned a class label, and patterns are classified as belonging to the class of the closest codebook vector. Class boundaries are then the edges of the Voronoi tessellation that separate codebook vectors of different classes. In learning algorithms, classification borders are moved indirectly by moving the codebook vectors. Competitive learning (Grossberg, 1976a, b; Kohonen, 1982; Rumelhart & Zipser, 1986) is an unsupervised, biologically plausible (Coultrip, Granger, & Lynch, 1992; Miikkulainen, 1991) way of achieving this for easily separable data clusters, but performs poorly on complicated clusters that are either not well separated or not well approximated by circularly symmetric distributions. More difficult categorization problems can be handled if the correct class of each pattern is known during training. The supervised LVQ2.1 algorithm (Kohonen, 1990) monitors and reduces the number of currently misclassified patterns (but see de Sa & Ballard, 1993a; Diamantini & Spalvieri, 1995). It can be described informally as: If the pattern is near a current border, move the codebook vector of the same class toward the pattern and that of the other class away. The resulting border movement increases the chances of an incorrectly classified pattern being correctly classified on a subsequent trial. When the labels of the sample patterns are given, the supervised goal (assuming equal costs) is to minimize the probability of misclassified patterns
Category Learning Through Multimodality Sensing
1099
Figure 1: Making use of natural correlational structure. The network learns from the structure inherent in the coherence between the visual and auditory signals. The feature vectors for classification were taken from 10 video frames and 100 msec of auditory signal. In the network, an arrow between units denotes full connectivity between these banks of units. The variables w, u, and v represent matrices that store the connection weights. The variables Xi , Yi , Z store the activity vectors of the various layers. The modalities minimize their disagreement by teaching or “driving” each other. The numbered arrows in the network show the order of propagation of activation when considering the auditory modality driving the visual driven one. In the driving hidden layer, the hidden unit with the closest weight vector (codebook vector) gets activity 1 (and all others in the layer 0). This 1-of-n activity pattern then serves as the input pattern for the output units, and the output unit with closest weight vector to this activation vector receives activity 1 (and all other output units 0). The two closest weight vectors (to their input pattern) in the driven hidden layer get a forward activity component of 1. The backward weights u are then used to supplement the activity of those driven units receiving forward activation. Due to the binary form of the backward weights u, these driven units will have activities of 0, 1, or 2. The inset shows the update direction for the driven codebook vector weights w as a function of their unit’s activity level. Positive ordinates represent movement of the weights toward their inputs, and negative values represent movement away. The actual update equations are given in equation 3.1.
1100
Virginia R. de Sa and Dana H. Ballard
for each modality. The goal for each modality is to minimize the number of patterns from each class that fall into Voronoi regions of codebook vectors with other labels. For example, where P(Ci ) is the a priori probability for class i and p(xj |Ci ) is the conditional density of the data from modality j from class i, the goal for modality 1 is to minimize (Diamantini & Spalvieri, 1995)
E({w1i }) =
Z XX (1 − δ(L(w1i ), Ck ))P(Ck ) k
i
V1i
p(x1 |Ck ) dx1 .
(1.1)
Here i is an index over codebook vectors, L(w1i ) gives the label of codebook vector w1i , V1i represents the Voronoi region around w1i (the volume closer to w1i than any other codebook vector), and δ(a, b) is defined to be 1 when a = b and 0 otherwise. The goal for modality 2 is analogous. 1.2 Cross-Modal Structure. The formulation expressed by equation 1.1 is explicitly supervised in that the estimation of the conditional probabilities depends explicitly on class information. That is, in order to estimate the term p(x|Ck ), it is necessary to know which patterns are from class k. An unsupervised error function must depend only on the whole pattern disP tribution p(x1 , x2 ) = k P(Ck )p(x1 , x2 |Ck ). A glance at Figure 2 shows that the structure in the joint feature space is often more informative than that available to either of the individual modalities. One solution this suggests is to perform unsupervised clustering or density estimation in the joint space. However simple k-means clustering, or competitive learning, in the full joint space would require that future patterns for classification contain all feature dimensions; they are not able to marginalize over the missing dimensions. The problem is that while we would like to learn from a joint cross-modal space, we would like, after learning, to be able to act on sensory information from a single modality. Density modeling methods do this and can handle missing features on classification, but they require fitting many parameters, and this is infeasible in high-dimensional spaces. An architecture that circumvents these problems is shown in Figure 1. The key organizational feature is that each modality has its own processing stream (or classification network) but access to the other’s output at a high level. One way to make use of the cross-modality structure in a network like this is to cluster the codebook vectors (in their individual spaces) but use the joint structure to learn the labels of these codebook vectors. This is a two-stage clustering algorithm. First, the input patterns in each modality are clustered using a competitive learning network. After this, the pattern of activation across the output units of the competitive learning networks (hidden units in Figure 1) can be considered new input patterns for another
Category Learning Through Multimodality Sensing
1101
Figure 2: A low-dimensional, two-class example. The graphs on the right side represent the probability distributions of patterns to the first modality; those on the left give the same for the second modality; (A) In the supervised case, the individual density functions associated with each class can be estimated and the appropriate border (at the crossing point of the distributions) found. The darker, left-most curve within each graph represents the distribution of the patterns from class CA . b1 and b2 represent example (but not optimal) classification borders in their respective modalities. (B) In the absence of class information, the computed density function is the sum of the individual class densities. The appropriate border may not be obvious (see, for example, modality 1’s density on the right). (C) The higher-dimensional joint distribution p(x1 , x2 ) = P(CA )p(x1 , x2 |CA ) + P(CB )p(x1 , x2 |CB ) has greater structure and is used to guide the placement of the borders in the individual modalities. The example shows a case where the two variables are conditionally independent, but this is not required.
1102
Virginia R. de Sa and Dana H. Ballard
level of clustering.1 By assigning labels to the output units of this second clustering stage, each codebook vector is labeled indirectly by the label of the output unit in whose cluster it belongs. This will give the same label to codebook vectors from the different modalities that tend to co-occur. While this approach is useful, and we use it for initialization (step 2 in the algorithm description in section 2), the constraining structure in the joint distribution can be used more powerfully if it is used for better placement of the codebook vectors themselves. 2 Minimizing Disagreement The core idea of our codebook placement algorithm is to minimize the disagreement error: the fraction of patterns classified differently by the two networks. The idea is that two modalities, representing different but cooccurring information from the same source, teach each other by finding a local minimum in their output disagreement. This section develops the derivation of the rules for moving the codebook vectors to minimize this error measure. The next shows that one can appropriately move the codebook vectors without directly connecting the codebook vectors to each other and without requiring neurobiologically implausible propagation of information as required in algorithms using backpropagation (Rumelhart et al., 1986) of error signals. The disagreement error can be written in terms of the codebook vectors as: E({w1i }, {w2j }) =
XX i
(1 − δ(L(w1i ), L(w2j )))Pr{x1∈V1i , x2∈V2j }. (2.1)
j
That is, E({w1i }, {w2j }) =
XXX k
Z
i
Z
× V1i
(1 − δ(L(w1i ), L(w2j )))P(Ck )
j
V2j
p(x1 , x2 |Ck ) dx1 dx2 .
(2.2)
Note that equation 2.2 does not depend on the class information but only on the joint density of all inputs (over all classes) and thus can be sampled without labels. It is, however, in the same form as the supervised equation (1.1) dealt with in Diamantini and Spalvieri (1995). Differentiating 1 While eventually the hidden-layer activation patterns will be of the 1-of-n or winnertake-all form, for this learning stage we activate the k closest units and anneal k to 1 throughout learning.
Category Learning Through Multimodality Sensing
1103
this equation after their treatment of the supervised version gives X X X δ(L(w1p ), L(w2j )) − δ(L(w1i ), L(w2j )) ∂E , = ∂w1p kw1i − w1p k k i,i6=p j Z Z (w1p − x1 ) p(x1 , x2 |Ck ) dx2 dx1 × P(Ck ) S1i,p
(2.3)
V2j
where S1i,p is the boundary surface between V1i and V1p . Similarly, X X X δ(L(w2p ), L(w1j )) − δ(L(w2i ), L(w1j )) ∂E . = ∂w2p kw2i − w2p k k i,i6=p j Z Z (w2p − x2 ) p(x1 , x2 |Ck ) dx1 dx2 × P(Ck ) S2i,p
(2.4)
V1j
Using uniform Parzen windows as in Wassel and Sklansky (1972), Sklansky and Wassel (1981), and Diamantini and Spalvieri (1995) to approximate the probability distributions in equations 2.3 and 2.4 and considering for each data sample X1 (n), X2 (n), only the two nearest codebook vectors in each modality give a particularly simple stochastic estimate of the derivatives. If kX1 −X1p,q k ≤ 1/2, where X1p,q is the projection of modality sample’s point on the border between its closest codebook vectors w1p and w1q (and w1p belongs to the same class as the modality 2 codebook vector closest to X2 , and w1q belongs to another class), then a sample estimate of the derivative in equation 2.3 for the codebook vector w1p is −
X1p,q − w1p 1(t)kw1q − w1p k
.
(2.5)
This gives the following simple stochastic approximation (Robbins & Monro, 1951) update rules for modality 1. If kX1 − X1p,q k ≤ 1(t)/2, w1p (n + 1) = w1p (n) + ²(t)
w1q (n + 1) = w1q (n) − ²(t)
X1p,q − w1p (n) 1(t)kw1q (n) − w1p (n)k X1p,q − w1q (n) 1(t)kw1p (n) − w1q (n)k
.
(2.6)
.
(2.7)
If the pattern falls outside a window of width 1(t) about the current border of the class output by the other modality, no changes are made. For all other indices, no changes in the codebook vector weights are made. The rules for updating the codebook vectors in modality 2 are exactly analogous.
1104
Virginia R. de Sa and Dana H. Ballard
These rules amount to: If the pattern received by a modality is close to a current border, move the codebook vector of the class that is output by the other modality toward the pattern and that of the other class away. This rule moves the borders to increase the local area assigned to the class output by the other modality. The minimizing-disagreement (M-D) algorithm applies this rule after each presentation of multimodal stimuli; it can be summarized as follows: 1. Initialize codebook vectors in each modality (unsupervised clustering). 2. Initialize codebook vectors’ labels using unsupervised clustering of the activity patterns across the codebook vector units (as described at the end of section 1). 3. Repeat for each presentation of input patterns X1 (n) and X2 (n) to their respective modalities: • Find the two nearest codebook vectors in each modality to their respective input patterns. • Find the hypothesized output class in each modality (as given by the label of the closest codebook vector). • For each modality, update the codebook vectors according to the rule above. • Update the labels (described below). The algorithm is moderately sensitive to the initial labeling, so improved results are often obtained by repeating steps 2 and 3 with the codebook vectors resulting from one cycle through the algorithm. Because the algorithm results in codebook vectors that better distinguish between classes, they tend to be easier to label appropriately in the initial labeling stage, which often leads to better performance after the third stage. The appendix and Figure 7 show more quantitatively how minimizing the disagreement is related to the classification goal of minimizing the number of misclassified patterns. The mapping of E with respect to the labels ({L(wi )}) is not continuous and thus not differentiable. However, to minimize E with respect to the labels (last point of step 3 above), one should assign the label for w1i to be the label that labels the most co-occurring patterns in the other modality. If we define the mapping W1 (L1 ) to be the set of codebook vectors in modality 1 for which L(w1i ) = L1 and let Z Z X X P(Ck ) p(x1 , x2 |Ck ) dx1 dx2 , colabell (w1i ) = w2j ∈W2 (Ll ) k
V1i
V2j
Category Learning Through Multimodality Sensing
1105
then L(w1i ) = argmaxl ({colabell (w1i )}). We use an online algorithm for this. Letting v1l,i∗ be the weight from codebook vector w1i∗ (the winning codebook vector in modality 1) to output unit l (the winning label picked by modality 2), v1l,i∗ (n + 1) = v1l,i∗ (n) + α(n) where the weights coming into each output unit are kept normalized: v1l,i =
v1l,i kv1l k
∀i, l.
This normalization means that the algorithm is not minimizing the disagreement with respect to the output weights but instead clustering the hidden unit representation using the output class given by the other modality. This objective is better for these weights as it balances the goal of agreement with the desire to avoid the trivial solution of all codebook vectors having the same label. Other forms of extra terms to force the output units to output different classes across the pattern set could also be used. This is analogous to the individual entropy terms H(Y1 ), H(Y2 ) in the IMAX algorithm (Becker & Hinton, 1992), which force the output units Yi in each modality (i = 1, 2) to output space, preventing the trivial solution of both modalities outputting a constant. We could also modify the energy function that the codebook vectors are following to prevent the hidden units from coding the input space as one class. However, due to the existence of many reasonable local minima, extra terms were not necessary in the data sets we have encountered, and our foray into adding them with our original data set yielded slightly worse performance (probably because the addition of the terms changes the position of the local minimum; it is no longer minimizing the disagreement). However, it is possible that for problems with more overlap between classes, terms like this might help if the current algorithm does not perform well. Figure 3 illustrates for an easily visualized two-class problem how, despite the existence of the undesirable global minima in disagreement, for enough segregation in the joint space a local minimum exists between the two classes. An initial border determined by most simple clustering algorithms would start within the basin of attraction of this minimum. The figure shows that an appropriate local minimum exists beyond the case where clusters could be separated given the individual modalities alone, but just short of what could be achieved if one could look for clusters in the joint space. The algorithm is able to extract most of the greater structure in the higherdimensional joint distribution without requiring the extra parameters for modeling in this large space.
1106
Virginia R. de Sa and Dana H. Ballard
Figure 3: The M-D energy function. (A) The joint probability density for a twoclass low dimensional problem. The two modalities are conditionally independent with the individual class distributions in both modalities normally distributed with standard deviation .5 (and means of −.5, .5). The dark curve shows the value of the joint distribution for x1 = x2 . For this case, this is the direction that gives the most information on separating the classes. (B) The individual modality distribution for one modality (identical for the other one) for varying standard deviations (σ ) of the individual class densities. As the classes become more diffuse (larger σ ), the dip between the classes gets smaller, and the classes are harder to separate. (C) The plot of the joint density along the plane x1 = x2 , (the most informative direction) for varying σ . Note that the classes are more separable in the higher-dimensional joint space for a given σ . (D) The minimizing disagreement energy (proportion of misclassified patterns) as a function of border position and σ . Note that for σ up to almost .6, the correct dividing borders (b1 = b2 = 0) are a local minimum (along b1 = b2 is the limiting direction); however, for more diffuse classes, there is no appropriate local minimum—only the global minima as x1 , x2 → ±∞.
Category Learning Through Multimodality Sensing
1107
3 Network Realization The M-D algorithm can be realized by the three-layer network shown in Figure 1, where the codebook vectors are represented by the weights w from the input to the hidden units (also called codebook vector units) and the class labels are represented by the output units. The codebook labels are given implicitly by the weights v and u. The codebook vector units determine which class is chosen (their label) through the forward weights v and receive agreement information through the complementary backprojecting weights u. Simple feedforward clustering using competitive learning is used to initialize the codebook vectors w in their respective input pattern spaces and subsequently the weights to the output units v. During this stage, the backprojecting weights u are kept consistent with the forward weights v by setting the back-projecting weights to the active hidden unit to Z—the 1-ofn activity vector over the output units driven by the forward weights from that activated unit. This results in backward weights of magnitude 1 from the output unit activated by the hidden unit and 0 from the others, and can be considered a form of fast Hebbian learning (Hebb, 1949). Next, using the M-D rule, both modalities teach each other. For each paired pattern presentation, the output units are driven first by one modality and then by the other. The output units are driven by the forward-projecting weights v of the current driving modality. This output then augments the activity in the nondriving modality through the back-projecting weights u, which provide boosted activity to activated units that agree with the output of the driving modality. (For details, see the caption of Figure 1.) Figure 1 shows the case where modality 1 teaches or drives modality 2. The codebook vectors in the driven modality (w2 in the figure) are updated using a simplified version of the M-D rule. (This rule, derived from generalizing the one-dimensional rule, is very slightly different from the rules in equations 2.6 and 2.7 derived from differentiating in the multidimensional space but is simpler and has performed as well or better.) Weights are updated only if the current pattern falls near the middle between two codebook vectors of different classes (The specification of this “window” and the decrease in learning step size are as in Kohonen’s (1990) supervised algorithm. The window is decreased with time, as in de Sa & Ballard, 1993a.) In this case: w2i (n) if Y2i < A (X (n)−w (n−1)) if A ≤ Y2i < B w2i (n) − ²(n) kX2 2 (n)−w2i 2 (n)k i (3.1) w2i (n + 1) = if B ≤ Y2i < C w2i (n) w (n) + ²(n) (X2 (n)−w2i (n−1)) if Y ≥ C. 2i 2i kX2 (n)−w2 (n)k i
This rule for updating the codebook vectors is a discrete version of the ABS rule (Artola & Singer, 1993) shown in graphical form in the inset in Figure 1. (In Figure 1, A = .4, B = 1.4, C = 1.7.) The nonpropagating forward
1108
Virginia R. de Sa and Dana H. Ballard
(v2 ) and backward weights (u1 ) are also updated at this stage. The backward weights of the winning driving unit (u1∗i ) are kept consistent with the forward weights (v1 ), as in the initialization stage, and the forward weights of the driven modality (v2 ) are updated to decrease the disagreement error by moving toward the output vector (Z) output by the other (driving) modality (already described). The forward weights to each output unit are kept normalized. 4 Results The algorithm was demonstrated on the problem of learning to recognize consonant-vowel utterances both visually and acoustically. A speaker was recorded using an 8 mm camcorder and directional microphone as he spoke 118 repetitions of /ba/, /va/, /da/, /ga/, and /wa/. The first 98 samples of each utterance class formed the training set and the remaining 20 the test set. Each set of 10 utterances (twice through the set) was preceded by a clap using a clapboard arrangement similar to that used in commercial movie production for matching the visual and auditory signals. The camera recorded 30 frames a second and was positioned to view the tip of the nose through the chin of the speaker. The acoustic data were low-pass filtered, and utterances were detected using threshold crossings of the smoothed time-domain waveform (using the ESPS (Entropic Signal Processing System) from Entropic Research Laboratory). As some of the consonantal information is low amplitude (before the threshold crossing), each utterance was taken from 50 msec before the automatically detected utterance start to 50 msec after. These utterances were then encoded using a 24-channel mel code2 over 20 msec windows overlapped by 10 msec. This is a coarse, short time frequency encoding, which crudely approximates peripheral auditory processing. Each feature vector was linearly scaled so that all dimensions lie in the range [−1, 1]. The final auditory code is a (24 × 9) 216-dimension vector for each utterance. Example auditory feature vectors are shown in Figure 4. The visual data were processed using software designed and written by Ramprasad Polana (Polana, 1994). Visual frames were digitized as 64 × 64 8-bit gray-level images using the Datacube MaxVideo system. The video and auditory tracks were aligned using the clapboard arrangement, and visual detection of the clap was performed manually, which allowed alignment to within one video frame (1/30 second). The frame of the clap was matched to the time of the acoustically detected clap, allowing the automatic segmentation obtained from the acoustic signal to be used to segment the video. Segments were taken as six frames before the acoustically determined utterance offset and four after. The normal flow was computed using 2
Linear spacing below 1000 Hz and logarithmic above 1000 Hz.
Category Learning Through Multimodality Sensing
1109
Figure 4: Example auditory patterns. The x-axis within each feature vector represents frequency channels and the y-axis time. The area of the small squares within the feature vector corresponds to the magnitude and the color that of the sign (white positive, black negative) of the feature dimension.
differential techniques between successive frames. Each pair of frames was then averaged, resulting in five frames of motion over the 64 × 64 pixel grid. The frames were divided into 25 equal areas (5 × 5), and the motion magnitudes within each frame were averaged within each area. The final visual feature vector of dimension (5 frames × 25 areas) 125 was linearly normalized as for the auditory vectors. Example visual feature vectors are shown in Figure 5. The results are shown in Figure 6. After training, the visual network achieved a classification performance of 80% on the test set, while the auditory network had a test set performance of 93%. For comparison, the LVQ2.1 algorithm trained on the auditory data with the same architecture as the auditory subnetwork had a test set classification of 96%, and the supervised visual network, again with the same architecture as the corresponding subnet, 83%. The performance after the initial unsupervised clustering was 56%
1110
Virginia R. de Sa and Dana H. Ballard
Figure 5: Example visual patterns. These patterns correspond to the auditory patterns in Figure 4. The x-axis represents spatial positions and the y-axis time. The area of the small squares within the feature vector corresponds to the magnitude and the color of the sign (white positive, black negative) of the feature dimension.
and 66% for the auditory and visual subnets, respectively, even though we helped this stage by weighting the auditory activity pattern 50% more than the visual pattern. Slightly better results are obtained with artificially increased pairing. In these experiments, the data set was expanded by matching each auditory pattern of one class of utterances with each visual pattern of that class in the training set (not just the individual pattern with which it co-occurred). These results reflect the expected performance of the M-D algorithm with more data under the assumption that within an utterance class, the exact auditory and visual patterns are independent (and thus each auditory pattern is just as likely to have occurred with each visual pattern in the class). Results on a preliminary multispeaker task were not as favorable, likely due to the greatly increased difficulty in the visual classification problem
Category Learning Through Multimodality Sensing
1111
Figure 6: Experimental results in the auditory-visual speech task. The left bar in each set (labeled V) gives the performance of the visual network, and the right bars (labeled A) show the auditory network’s performance. All bars represent categorization performance on the test set averaged over 30 experiments from random initial weights. The error bars represent one standard deviation across the runs. The pair of bars labeled “Initial Labeling” represent the performance after the initialization stage of unsupervised clustering in the respective input spaces and output label space. This gives the categorization ability of unsupervised clustering. The bars labeled “All-Pair Combinations” represent the use of an artificially increased data set obtained by matching each auditory pattern of one class of utterances with each visual pattern of that class in the training set (not just the individual pattern with which it co-occurred). Results for the related supervised algorithm (LVQ2.1) (Kohonen, 1990) using the same number of codebook vectors are also shown for comparison.
1112
Virginia R. de Sa and Dana H. Ballard
(supervised test performance on the visual data was about 60%). It is likely that a different visual encoding could improve this result, or possibly more data were required because the artificially increased pairing did give results comparable to supervised performance. (However, the argument that this simulates collection of more data is less compelling in the multispeaker case.) 5 Discussion The fact that both networks are simultaneously learning makes this problem significantly harder than approaches where one modality trains another (Munro, 1988; Carpenter, Grossberg, & Reynolds, 1991; Tan, 1995) or others that combine two already trained networks (Yuhas, Goldstein, & Sejnowski, 1988; Stork, Wolff, & Levine, 1992). The approach taken in this work and that of Becker and Hinton (1992), Becker (1996), Schmidhuber and Prelinger (1993), Phillips et al. (1995), and Kay et al. (1998) is to use the relationships between inputs to different networks to discover features in the individual networks that could not have been discovered simply by unsupervised learning in the individual spaces. This algorithm is more restricted than these other similar algorithms because it is limited to classification problems and uses 1-of-n or winner-take-all output encodings. On the other hand, it easily deals with real problems of many input dimensions. To the best of our knowledge, this is the largest problem attempted with this type of algorithm. It also deals easily, and has improved performance, with the addition of more modalities (for an example of a similar precursor algorithm, see de Sa & Ballard, 1993b). One limit of the 1-of-n output encoding is that as the number of output units is fixed, the number of output classes must be pre-chosen. In our case, five output units were used because we were looking for five classes. For fewer output classes, the algorithm can simply group classes, and we would expect no change or even a decrease in the number of disagreements. For more output classes, we would expect an increase in the number of disagreements. We have done experiments with two, three, four, six, seven, and ten output classes. On the training data, these experiments show a greater increase in the number of disagreements after five classes, though the same curve on the test set gave only a smooth increase in disagreements with class size. Thus, it is possible that the number of classes might be recoverable from the data. However, this will require further development. The algorithm is currently limited to hard winner-take-all decisions. Incorporating “soft” output decisions would be an easy modification, but making appropriate use of the resulting extra information to provide better teaching signals to the other modality is an interesting problem deserving more research. The uncertainty in classification in the driven hidden layer is reflected in the windowing; only patterns near a current border are able to
Category Learning Through Multimodality Sensing
1113
influence the border. This decision can be made softer by using nonuniform Parzen windows. The M-D algorithm offers a straightforward computational model for why neurons in one sensory modality also respond to inputs to another sensory modality (Morrell, 1972; Fishman & Michael, 1973; Sams et al., 1991; Buser & Borenstein, 1959; Murata, Cramer, & Rita, 1965; Spinelli, Starr, & Barrett, 1968; Haenny, Maunsell, & Schiller, 1989; Maunsell, Sclar, Nealey, & DePriest, 1991). In fact, the algorithm is not limited to different sensory modalities but can also be used for submodal patterns such as color and motion. The key requirement is that there be some decorrelation of the instances of the different pairs of patterns. The model shows that without the huge cost of connecting all neurons to all sensory input, one can still take advantage of the greater structure available in the higher-dimensional multimodal sensory space. We suggest that cortical back-projections and multisensory integration may be doing more than affecting the properties of developed systems but may play an important role in the learning process itself. Appendix: Minimizing Disagreement as an Approximation to Minimizing Misclassifications Note that the algorithm to minimize the disagreement corresponds to the LVQ2.1 algorithm except that the “label” for each modality’s pattern is the hypothesized output of the other modality. To understand how making use of this label, through minimizing the disagreement between the two outputs, relates to the true goal of minimizing misclassifications in each modality, consider the conditionally independent (within a class) version of the two-modality example illustrated in Figure 7. In the supervised case (see Figure 7A), the availability of the actual labels allows sampling of the actual marginal distributions. For each modality, the number of misclassifications can be minimized by setting the boundaries for each modality at the crossing points of their marginal distributions. However, in the self-supervised system, the labels are not available. Instead we are given the output of the other modality. Consider the system from the point of view of modality 2. Its patterns are labeled according to the outputs of modality 1. This labels the patterns in class A as shown in Figure 7B. Thus, from the actual class A patterns, the second modality sees the “labeled” distributions shown. Letting a be the fraction of class A patterns that are misclassified by modality 1, the resulting distributions of the real class A patterns seen by modality 2 are (1 − a)P(CA )p(x2 |CA ) and (a)P(CA )p(x2 |CA ). Similarly, Figure 7C shows modality 2’s view of the patterns from class B (given modality 1’s current border). Letting b be the fraction of class B patterns misclassified by modality 1, the distributions are given by (1 − b)P(CB )p(x2 |CB ) and (b)P(CB )p(x2 |CB ). Combining the effects on both classes
1114
Virginia R. de Sa and Dana H. Ballard
Figure 7: An example of the joint and marginal distributions for the conditionally independent version of the example problem introduced in Figure 2. The darker gray represents patterns labeled A; the lighter gray are labeled B. (A) The labeling for the supervised case. (B) The labeling of class A patterns as seen by modality 2 given the modality 1 border shown. a represents the fraction of the class A patterns that are misclassified by modality 1. (C) The labeling of class B patterns as seen by modality 2 given the same modality 1 border. b represents the fraction of the class B patterns that are misclassified by modality 1. (D) The total pattern distributions seen by modality 2 given the labels determined by modality 1. These distributions can be considered as the labeled distributions on which modality 2 is performing a form of supervised learning. (However, it is more complicated; modality 1’s border is concurrently influenced by the current position of modality 2’s border.) See the text for details.
results in the “labeled” distributions shown in Figure 7D. The “apparent class A” distribution is given by (1 − a)P(CA )p(x2 |CA ) + (b)P(CB )p(x2 |CB ) and the “apparent class B” distribution by (a)P(CA )p(x2 |CA ) + (1 − b)P(CB )p(x2 |CB ). The crossing point of these two distributions occurs at the value of x2 for which (1 − 2a)P(CA )p(x2 |CA ) = (1 − 2b)P(CB )p(x2 |CB ). Com-
Category Learning Through Multimodality Sensing
1115
paring this with the crossing point of the actual distributions that occurs at x2 satisfying P(CA )p(x2 |CA ) = P(CB )p(x2 |CB ) reveals that if the proportion of class A patterns misclassified by modality 1 is the same as the proportion of class B patterns misclassified by modality 1 (i.e., a = b), the crossing points of the distributions will be identical. This is true even though the approximated distributions will be discrepant for all cases where there are any misclassified patterns (a > 0 OR b > 0). If a ≈ b, the crossing point will be close. Simultaneously the second modality is labeling the patterns to the first modality. At each iteration of the algorithm, both borders move according to the samples from the “apparent” marginal distributions. Acknowledgments This research was supported by grants from the Human Frontier Science Program, the National Institutes of Health Biotechnology Resource Program, and a Canadian NSERC 1967 Science and Engineering Scholarship. We thank Sue Becker, Peter Dayan, Jeff Schneider, Zoubin Ghahramani, Mary Hayhoe, Chris Brown, David Williams, and the anonymous reviewers for helpful comments and Ramprasad Polana and Ramesh Sarukkai for help with the visual and auditory pattern processing. References Artola, A., & Singer, W. (1993). Long-term depression of excitatory synaptic transmission and its relationship to long-term potentiation. Trends in Neurosciences, 16(11), 480–487. Becker, S. (1996). Mutual information maximization: Models of cortical selforganization. Network: Computation in Neural Systems, 7, 7–31. Becker, S., & Hinton, G. E. (1992). A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355, 161–163. Buser, P., & Borenstein, P. (1959). Responses somesthesiques, visuel et auditives, recuellies, au niveau du cortex “associatif” infrasylvien chez le chat curarise non anesthesie. Electroencephalog. Clin. Neurophysiol., 11, 285–304. Carpenter, G. A., Grossberg, S., & Reynolds, J. H. (1991). Artmap: Supervised real-time learning and classification of nonstationary data by a selforganizing neural network. Neural Networks, 4, 565–588. Coultrip, R., Granger, R., & Lynch, G. (1992). A cortical model of winner-take-all competition via lateral inhibition. Neural Networks, 5, 47–54. de Sa, V. R., & Ballard, D. H. (1993a). A note on learning vector quantization. In C. Giles, S. J. Hanson, & J. Cowan (Eds.), Advances in neural information processing systems 5 (pp. 220–227). San Mateo, CA: Morgan Kaufmann. de Sa, V. R., & Ballard, D. H. (1993b). Self-teaching through correlated input. In F. H. Eeckman & J. M. Bower (Eds.), Computation and neural systems 1992 (pp. 437–441). Needham, MA: Kluwer Academic. Diamantini, C., & Spalvieri, A. (1995). Pattern classification by the Bayes machine. Electronics Letters, 31(24), 2086–2088.
1116
Virginia R. de Sa and Dana H. Ballard
Durgin, F. H., & Proffitt, D. R. (1996). Visual learning in the perception of texture: Simple and contingent aftereffects of texture density. Spatial Vision, 9(4), 423– 474. Fishman, M. C., & Michael, C. R. (1973). Integration of auditory information in the cat’s visual cortex. Vision Research, 13, 1415–1419. Grossberg, S. (1976a). Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121–134. Grossberg, S. (1976b). Adaptive pattern classification and universal recoding: II. Feedback, expectation, olfaction, illusions. Biological Cybernetics, 23, 187–202. Haenny, P., Maunsell, J., & Schiller, P. (1989). State dependent activity in monkey visual cortex II. Retinal and extraretinal factors in V4. Experimental Brain Research, 69(2), 245–259. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Howells, T. (1944). The experimental development of color-tone synesthesia. Journal of Experimental Psychology, 34(2), 87–103. Kay, J., Floreano, D., & Phillips, W. (1998). Contextually guided unsupervised learning using local multivariate binary processors. Neural Networks, 11(1), 117–140. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kohonen, T. (1990). Improved versions of learning vector quantization. In IJCNN International Joint Conference on Neural Networks (Vol. 1, pp. I545–I550). MacDonald, J., & McGurk, H. (1978). Visual influences on speech perception processes. Perception and Psychophysics, 24(3), 253–257. Maunsell, J., Sclar, G., Nealey, T., & DePriest, D. (1991). Extraretinal representations in area V4 of Macaque monkey. Visual Neuroscience, 7(6), 561–573. McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 264, 746–748. Miikkulainen, R. (1991). Self-organizing process based on lateral inhibition and synaptic resource redistribution. In T. Kohonen, K. Makis¨ara, O. Simula, & J. Kangas (Eds.), Artificial neural networks (pp. 415–420). Amsterdam: Elsevier. Morrell, F. (1972). Visual system’s view of acoustic space. Nature, 238, 44–46. Munro, P. (1988). Self-supervised learning of concepts by single units and “weakly local” representations (Tech. Rep. No. LIS003/IS88003). Pittsburgh, PA: School of Library and Information Science, University of Pittsburgh. Murata, K., Cramer, H., & Rita, P. B. (1965). Neuronal convergence of noxious, acoustic and visual stimuli in the visual cortex of the cat. Journal of Neurophysiology, 28, 1233–1239. Phillips, W., Kay, J., & Smyth, D. (1995). The discovery of structure by multistream networks of local processors with contextual guidance. Network: Computation in Neural Systems, 6, 225–246. Polana, R. (1994). Temporal texture and activity recognition. Unpublished doctoral dissertation, Department of Computer Science, University of Rochester. Robbins, H., & Monro, S. (1951). A stochastic approximation method. Annals of Math. Stat., 22, 400–407. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal repre-
Category Learning Through Multimodality Sensing
1117
sentations by error propagation. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 318–364). Cambridge, MA: MIT Press. Rumelhart, D. E., & Zipser, D. (1986). Feature discovery by competitive learning. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 151–193). Cambridge, MA: MIT Press. Sams, M., Aulanko, R., H¨am¨aa¨ inen, M., Hari, R., Lounasmaa, O. V., Lu, S.-T., & Simola, J. (1991). Seeing speech: Visual information from lip movements modifies activity in the human auditory cortex. Neuroscience Letters, 127, 141– 145. Schmidhuber, J., & Prelinger, D. (1993). Discovering predictable classifications. Neural Computation, 5, 625–635. Sklansky, J., & Wassel, G. N. (1981). Pattern classifiers and trainable machines. Berlin: Springer-Verlag. Spinelli, D., Starr, A., & Barrett, T. W. (1968). Auditory specificity in unit recordings from cat’s visual cortex. Experimental Neurology, 22, 75–84. Stork, D. G., Wolff, G., & Levine, E. (1992). Neural network lipreading system for improved speech recognition. In IJCNN International Joint Conference on Neural Networks (Vol. 2, pp. II286–II295). Tan, A.-H. (1995). Adaptive resonance associative map. Neural Networks, 8(3), 437–446. Wassel, G. N., & Sklansky, J. (1972). Training a one-dimensional classifier to minimize the probability of error. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(4), 533–541. Yuhas, B., Goldstein, M. W., Jr., & Sejnowski, T. J. (1988). Neural network models of sensory integration for improved vowel recognition. Proceedings of the IEEE, 78(10), 1658–1668. Zellner, D. A., & Kautz, M. A. (1990). Color affects perceived odor intensity. Journal of Experimental Psychology: Human Perception and Performance, 16(2), 391–397. Received January 8, 1997; accepted October 8, 1997.
LETTER
Communicated by Sidney Lehky
A Hierarchical Model of Binocular Rivalry Peter Dayan Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
Binocular rivalry is the alternating percept that can result when the two eyes see different scenes. Recent psychophysical evidence supports the notion that some aspects of binocular rivalry bear functional similarities to other bistable percepts. We build a model based on the hypothesis (Logothetis & Schall, 1989; Leopold & Logothetis, 1996; Logothetis, Leopold, & Sheinberg, 1996) that alternation can be generated by competition between top-down cortical explanations for the inputs, rather than by direct competition between the inputs. Recent neurophysiological evidence shows that some binocular neurons are modulated with the changing percept; others are not, even if they are selective between the stimuli presented to the eyes. We extend our model to a hierarchy to address these effects. 1 Introduction If one’s eyes are presented with two different but very low-contrast stimuli, as shown in Figure 1, then the overall percept is of the sum or composition of the stimuli (Liu, Tyler, & Schor, 1992). However, as the stimuli are made higher contrast, there comes a point when it appears as if the inputs from the eyes rival. First one dominates, then the other, with stochastic switching between the two. Figure 1 shows the case of horizontal and vertical gratings (the grid lines and boxes A and B are for later descriptive convenience and are not presented), but more complex patterns are also often used. If the stimuli are large, then one single stimulus may not dominate across the entire field; rather, there will be a mosaic of patches, with different stimuli dominating in each patch (Wheatstone, 1838; Levelt, 1965). The dynamics of rivalry are sensitive to the contrast of the stimuli in the eyes if they are different, with such characteristic results as that increasing the contrast of one stimulus decreases the time during which that stimulus is suppressed much more than it increases the time that it is dominant (Levelt, 1965; Fox & Rasche, 1969; Blake, 1977; Mueller & Blake, 1989). There are also effects of the nature of the stimuli—for instance, if two separate patterns are divided up between the two stimuli, then in certain cases, the patterns will rival rather than the stimuli directly (Whittle, Bloor, & Pocock, 1968; Kovacs, Papathomas, Yang, & Feher, 1996), and there is some evidence that familiar c 1998 Massachusetts Institute of Technology Neural Computation 10, 1119–1135 (1998) °
1120
Peter Dayan
Left eye (L)
Right eye (R)
A
A
B
B
Figure 1: Rivalrous gratings. Rivalrous stimuli for the left and right eyes consisting of horizontal and vertical gratings, respectively. The grid lines and the dotted boxes are for descriptive purposes and are not presented. Boxes A show the basic competitive element in the model, between short horizontal and vertical parts of the gratings; boxes B show the essential repeating unit that the stimulus comprises.
patterns enjoy an advantage over unfamiliar ones during rivalry (see Yu & Blake, 1992). It is natural to suppose that this rivalry is instantiated in the parts of the visual pathway that are still monocular, that is, the lateral geniculate nucleus (LGN) and layer IV of V1. Indeed, most models of rivalry implicitly or explicitly make this assumption (Matsuoka, 1984; Lehky, 1988; Blake, 1989; Mueller, 1990; Lehky & Blake, 1990), using various forms of reciprocal inhibition between two pathways and thus capturing many of the intricacies of the dynamics of rivalry. It turns out that the activities of neurons in the LGN are not affected by rivalry (Lehky & Maunsell, 1996), leaving layer IV of V1 as the candidate for this class of models. These models could be augmented with some top-down processing to capture the familiarity and pattern-based effect. However, they are directly challenged by psychophysical data from Logothetis et al. (1996) and are hard to reconcile with the neurophysiological data from Logothetis and Schall (1989) and Leopold and Logothetis (1996). Logothetis et al. (1996) switched rivalrous patterns quickly between the two eyes (see also Blake, Westendorf, & Overton, 1980) while constantly flickering the stimuli. Subjects report that the perceptual switching time is much greater than the actual switching time, which is inconsistent with the hypothesis that there is a dominant eye rather than a dominant pattern. Of course, there could be eye-based competition as well (Wales & Fox, 1970; Fox & Check, 1972; Blake & Fox, 1974).
A Hierarchical Model of Binocular Rivalry
1121
Leopold and Logothetis (1996) trained monkeys to report their percept during rivalrous and nonrivalrous stimuli while recording from neurons in V1/2 and V4. They found that striate monocular neurons are unaffected by rivalry; that there are binocular neurons in all areas that are selective between the stimuli during binocular presentation and whose activities are not modulated with the monkey’s percept; that there are binocular neurons in all areas that are sensitive between the stimuli during binocular presentation and whose activities are modulated with the monkey’s percept; and that there are binocular neurons in all areas that are sensitive between the stimuli during binocular presentation whose activities are elevated during perceptual suppression of their preferred stimuli; and also binocular neurons that are not selective between the stimuli in binocular viewing, but whose activities are nevertheless modulated during rivalry. Logothetis and his colleagues have long suggested an account of rivalry under which it is cortical explanations of sensory input that compete rather than the inputs themselves. Various recent models of cortical processing are based on the old notion of analysis by synthesis (MacKay, 1956; Grenander, 1976; Mumford, 1994; Carpenter & Grossberg, 1987; Pece, 1992; Hinton, Dayan, Frey, & Neal; 1995; Dayan, Hinton, Neal, & Zemel, 1995; Olshausen & Field, 1996; Rao & Ballard, 1997). For these, the synthetic model, which is usually instantiated in top-down connections in cortex, exactly constructs a top-down explanation for input, and an analysis procedure finds which particular synthetic explanation is appropriate for a given input. In this article, we consider one form of analysis-by-synthesis model and show how it can exhibit rivalry between explanations in the case that the eyes receive different input. This model can provide an account for many of the behaviors described above. Section 2 discusses a simplified case of rivalry to illustrate the basic principles of the model, based on the contents of boxes A in Figure 1; section 3 describes a more complete model with three layers of units in a hierarchy, based on boxes B of Figure 1; the implications of the model are discussed in section 4. 2 The Simple Model Figure 2a shows a simple abstract model illustrating competition between cortical explanations. It is taken from boxes A of Figure 1, representing the minimal competitive unit in that stimulus. The grating consists of pairs of horizontal and vertical bars, to enhance the strength of the signal. For illustrative convenience, the pairs have been separated. There is no special significance to the spatial order of the input units. In Figure 2, w1 and w2 model two binary-valued notational striate units, and layer z models 32 binary-valued geniculate units—16 each for left (L) and right (R) eyes. In the generative model, turning w1 on activates two binocular horizontal bars in the input z, and we therefore say that w1 ex-
1122
Peter Dayan
bw w 1
a)
b)
w2
w
Jwz
bz
φ
l
r
z
L
R
L
R
Figure 2: (a) Simple generative model. Units w1 and w2 are shown in the form of their projective fields (e.g., w2 generates two vertical bars binocularly in the 4×4 left (L) and right (R) geniculate units, z), together with a subset of the weights. The other weights follow similarly. bw = −2 and bz = −3 are the generative biases, Jwz = 5.8 are the generative weights from w to z. (b) Rivalrous input pattern. Horizontal input is provided to the geniculate units for the left eye (z1 , . . . , z16 ), with strength l (e.g., P[z1 = 1] = σ (l)); vertical input to those for the right eye (z17 , . . . , z32 ), with strength r. Silent units have input φ such that σ (φ) = 0.01.
plains the input activity, if the input were really to consist of two binocular horizontal bars. Similarly, the activity of w2 explains two binocular vertical bars in the input. More formally, the explanations arise as the analysis or recognition phase of an analysis-by-synthesis model of cortical function. The top-down, synthetic model specifies successively probabilities P [w] and P [z|w] according to:
P [wk = 1] = σ (bw ) Ã P [zi = 1|w] = σ bz +
2 X
! ki wk Jwz
,
(2.1)
k=1
where
½ ¾ Z 3998 x 1 2 1+ √ e−t /2 dt , 4000 2π t=−∞ is a normal distribution function, squashed to avoid infinities, and the wk and zi are independent (the latter, given w). The parameters of the generative model, the weights Jwz , and the biases bw and bz are shown in the diagram. They were set by hand such that in the generative model, w1 and w2 are active only rarely (i.e., activity in the w layer is sparse), but are almost sure to produce their favored pattern in z if they do fire. In general, these weights would be learned from experience of horizontal and vertical σ (x) =
A Hierarchical Model of Binocular Rivalry
1123
contours (Hinton et al., 1995; Dayan et al., 1995; Saul, Jaakkola, & Jordan, 1996). Since the units in the generative model are binary, we cannot model differing input contrasts directly by changing the level of activity of the zi . Rather, we represent the input to zi as di , where P [zi = 1] = σ (di ) and all the zi are independent. Write P [z; d] as the induced probability distribution over the input units. Recognition is formally the statistical inverse to generation. For the network in Figure 2a, it should produce P [w|z] over the four choices for w, given a particular input. In this case, it would be easy to calculate these probabilities exactly. However, in general, this is computationally intractable, since if there are n units, then there are 2n probabilities. Further, we require a way of representing these 2n probabilities in terms of just the activities of the n units. Inspired by Saul et al. (1996) and Jaakkola, Saul, and Jordan (1996), we achieve both of these by using a mean-field inversion method. This approximates P [w|z] by the parameterized factorial form, Y Q[w; µ] = σ (µi )wi (1 − σ (µi ))1−wi . (2.2) i
This sets the mean activity of w1 to be σ (µ1 ). Note that µi are real values that, through equation 2.2, parameterize a distribution over the binary-valued wi . We model the activities of cells as the real-valued σ (µi ). Mean-field methods would use a descent method to optimize the parameters µ to minimize the mean Kullback-Leibler divergence between Q[w; µ] and P [w|z]: X X Q[w; µ] . F [µ] = P [z; d] Q[w; µ] log P [w|z] z w The simplest model of gradient descent has: (2.3) µ1 (t + 1) = µ1 (t) − δ∇µ1 F [µ(t)] ¸ · σ (µ1 (t)) σ (−bw ) = µ1 (t) − δ(log σ (−µ1 (t)) σ (bw ) +σ 0 (µ1 (t)) (P10 − P00 + σ (µ2 (t))[P11 − P10 − P01 + P00 ])), where Pab =
X
σ (di ) log P [zi = 1|w1 = a, w2 = b]
i
+ σ (−di ) log P [zi = 0|w1 = a, w2 = b], and δ acts like an adaptation rate. In this simple case, calculating these terms requires only operations local to each unit, although the operations are somewhat complicated. Jaakkola
1124
Peter Dayan
et al. (1996) provide a further approximation that simplifies these calculations. In this, unit zi passes back to w1 and w2 information about how it is incorrectly predicted by w1 and w2 : Ã ∇w di − bz −
X
!2 ki σ (wk )Jwz
.
k
We found this model to work slightly less well. Note that the mean-field model affects only the activities in the w layer and does not affect the inputs, even though there are top-down inputs to those units. Lehky and Maunsell (1996) resolved conclusively that the activities of neurons in the LGN of macaque monkeys are not modulated during rivalry (see Varela & Singer, 1987), by clear contrast with the data cited above from cortical cells. In the hierarchical model in the next section, there are top-down influences on the activities of modeled cortical (but not modeled thalamic) cells. If a nonrivalrous input is presented, with just horizontal bars in both channels, then recognition assigns full responsibility to w1 . Rivalry results when different inputs are presented to the two eyes. For inputs such as Figure 2b, {w1 = 1; w2 = 0} and {w1 = 0; w2 = 1} are equally good explanations (albeit worse than in the nonrivalrous case). Explanation {w1 = 0; w2 = 0} is poor because it does not account for any input; {w1 = 1; w2 = 1} is poor because activity across w should be sparse, according to the generative model, and w1 = 1 explains away (Pearl, 1988) the need for w2 = 1 for those elements of z that are common between horizontal and vertical bars. Note that w1 and w2 compete even though there are no explicit inhibitory interconnections between them in the generative model. The recognition model of a Helmholtz machine (Hinton et al., 1995) is unsuitable to model rivalry, since it acts in a purely bottom-up direction in such a way that it lacks the capacity to capture explaining away (Dayan & Hinton, 1996), on which this model of rivalry crucially depends. This is one reason that we used a mean-field method instead (Saul et al., 1996). If the dynamics were just determined by equation 2.3, then the activities would tend to one of the two equivalently good explanations (which are global minima of F ) and stay there. We therefore implemented a simple oscillatory model with auxiliary variables µ0k (t) implementing a form of fatigue process. The full dynamics for µ1 (t) and µ01 (t) are: µ1 (t + 1) = µ1 (t) + δ(−∇µ1 F [µ(t)] + α(βµ1 (t)) − µ01 (t))
µ01 (t + 1) = µ01 (t) + δ(µ1 (t) − βµ01 (t)),
where β is a decay term. A similar equation applies for µ2 (t) and µ02 (t). In all the simulations, α = 0.5, β = 0.1, and δ = 0.01. Factor 1/δ now plays the role of a time constant for the network. As with most models of rivalry (see
A Hierarchical Model of Binocular Rivalry
1125
Lehky, 1988, for a notable exception), we are modeling data on the mean dominance times and are ignoring the stochasticity of the data. Based on this simple oscillatory process, the model effectively switches between horizontal ({w1 = 1; w2 = 0}) and vertical ({w1 = 0; w2 = 1}) explanations. Figure 3a shows the resulting activities of w1 and w2 for a case in which the input to the horizontal bars (l) is stronger than to the vertical (r). Alternations ensue, with a greater dominance period for w1 than w2 . Figure 3b shows that, as empirically observed, when the input strengths for both patterns are increased together (modeling increasing contrast), the oscillations speed up (Levelt, 1965; Fox & Rasche, 1969), and when just r is varied, it has a significantly greater effect on the period for which the vertical explanation is suppressed (i.e., the horizontal explanation is dominant) than on the period for which it is dominant (Levelt, 1965; Fox & Rasche, 1969; Blake, 1977; Mueller & Blake, 1989; Leopold & Logothetis, 1996). This achieves the effect of mutual inhibition (Fox & Rasche, 1969; Matsuoka, 1984; Lehky, 1988; Mueller, 1990) between w1 and w2 , dependent on input contrast (Mueller, 1990) by statistically justifiable means. Furthermore, for very weak inputs, both w1 and w2 are weakly activated, which is the model’s account of the psychophysical observation that fusion rather than rivalry occurs for very low-contrast stimuli. Also, if the eyes are provided with binocularly consistent inputs within a reasonable range of contrast, then the system does not oscillate. We have therefore shown that it is possible to get rivalry between cortical explanations for input, using a mean-field inversion method for a top-down generative model. In this case, the final model resembles existing models for rivalry in which there is competition among binocular-oriented units rather than within a monocular system (Grossberg, 1987). Indeed, Sengpiel, Blakemore, and Harrad (1995) studied interocular suppression of activity in binocular cells when the two eyes were presented with gratings of orthogonal orientations. In the mean-field model, this suppression arises as a consequence of explaining away during the process of recognition and has a precise relationship with the underlying top-down generative model. 3 The Hierarchical Model The simple model is too small to be able to have populations of units that are and are not modulated with rivalry, as in the neurophysiological data. We therefore extended it to a hierarchy of units covering a larger spatial array, incorporating various characteristics of cortical visual processing. The hierarchical version is intended to capture the processing of boxes B in Figure 1. Boxes B were chosen to capture the minimal repeating unit in the stimulus. No smaller box will suffice; for instance, boxes A miss the portion of the stimuli that do not directly compete. No larger box is necessary, since it would only represent copies of boxes B. Since the model operates by constructing explanations, it is, of course, vital to choose appropriately
1126
Peter Dayan
a)
Oscillation Structure horizontal dominance w1 (1.25) vertical dominance w2 (1.0)
1.0
Activity
0.8 0.6 0.4 0.2 0.0
0
1000
2000
3000
4000
Iteration
Oscillation period
700
b)
Contrast Dependence
equal contrast horizontal dominance (l=1.25) vertical dominance (r) 600
500
1.0
1.2
1.4
1.6
Test vertical ‘contrast’ (r) Figure 3: (a) Development and maintenance of oscillations in the mean activities of w1 = σ (µ1 ) and w2 = σ (µ2 ) over time. Input strengths l = 1.25 and r = 1.0, so the horizontal bars dominate. (b) Dependence on the input strength (modeling contrast) in l and r of the periods of suppression and dominance. Horizontal patterns are taken as dominant when the mean activity of w1 is greater than the mean activity of w2 . There is no switching reaction time. For the equal contrast case, l and r were varied together; for suppression and dominance plots, l = 1.25 was constant; r was varied.
A Hierarchical Model of Binocular Rivalry
bw
LAYER
w
bx
1127
w2
w1
Jwx x1
x
x16
by
y
y1
y3
y9
y11
y17
y18
y19
J xy
y21
y 22
y 31
bz
y32
Jyz
z
L
R
Figure 4: Hierarchical generative model for 8×8 bar patterns across the two eyes. Units are depicted by their net projective (generative) fields, and characteristic weights are shown. Even though the net projective field of x1 is the top horizontal bar in both eyes, note that it generates this by increasing the probability that units y1 and y9 in the y layer will be active, not by having direct connections to the input z. Unit w1 connects to x1 , x2 , . . . , x8 through Jwx = 0.8; x16 connects to y31 , y32 through Jxy = 1.0, and y32 connects to the bottom right half vertical bar through Jyz = 5.8. Biases are bw = −0.75, bx = −1.5, by = −2.7, and bz = −3.3. Each unit in the z layer is really a pair of units (as in Hinton et al., 1995), to increase the strength of the signal.
those portions of the input that are to be explained. Figure 4 shows the full generative model. Units in layers y (crudely modeling V1) and x and w (modeling early and late extrastriate areas) are all binocular and jointly explain successively more complex features in the input z according to a top-down generative model. Apart from the half bars in y, the generative model is similar to that
1128
Peter Dayan
learned by the Helmholtz machine (Dayan et al., 1995) for which increasing complexity in higher layers rather than the increasing input scale is key.1 In this case, for instance, w2 specifies the occurrence of vertical bars anywhere in the 8 × 8 input grids; x16 specifies the right-most vertical bar; and y31 and y32 the top and bottom half of this vertical bar, respectively. Again, these specifications are provided by a top-down generative model in which, as in equation 2.1, the activations of units are specified by probabilities such as à ! X ki P [yi = 1|x] = σ by + xk Jxy , k
where the sum k is over all the units in the x layer. In this more complicated model, activities of units in different layers could conflict. For instance, unit w1 could be activated, suggesting that there are horizontal bars in the input; but units x15 and x16 could also be active, suggesting that there are two particular vertical bars. Such patterns of activity are unlikely, since they are inconsistent with the generative model, and we never observed them with the settings of the weights that we adopted. We therefore model the percept of the network as the activity in the w layer. A similar mean-field method is used to perform recognition in this hierarchical model. The equivalent mean-field distribution is:
Q[w, x, y; µ, ξ , ψ ] = Q[w; µ]Q][x; ξ ]Q[y; ψ ], which renders independent all the units in the model. The equivalent of F now depends on µ, ξ , and ψ : X X F [µ, ξ , ψ ] = P [z; d] Q[w, x, y; µ, ξ , ψ ] z
w,x,y
Q[w, x, y; µ, ξ , ψ ] . × log P [w, x, y|z] We adopted various heuristics to simplify the process of using this rather cumbersome mean-field model. First, fatigue is implemented only for the units in the y layer, and the ψ follow the equivalent of the dynamical equations above. Although adaptation processes can clearly occur at many levels in the system, their exact form is not clear. Bialek and DeWeese (1995) argue that the rate of a switching process should be adaptive to the expected rate of change of the associated signal on the basis of prior observations. This is clearly faster nearer to the input. The second heuristic is that rather than perform gradient descent for the nonfatiguing units, the optimal values of µ and ξ are calculated on each iteration by solving numerically equations, such as ∇ξi F [µ, ξ , ψ ] = 0. 1 Although the recognition model of the Helmholtz machine is not used, since it does not capture explaining away.
A Hierarchical Model of Binocular Rivalry
1129
w
w
w
w
x
x
x
x
y
y
y
y
z
z
z
z
L
R
(i)
L
R
(ii)
L
R
(iii)
L
R
(iv)
Figure 5: Recognition activity in the network for four different input patterns. The units are arranged in the same order as Figure 4, and white and black squares imply activities for the units whose means are less than and greater than 0.5. (i, ii) Normal binocular stimulation. (iii, iv) The two alternative stable states during rivalrous stimulation, without the fatigue process.
The dearth of connections in the network of Figure 4 allows µ and ξ to be calculated locally at each unit in an efficient manner. Whether this is reasonable depends on the time constants of settling in the mean-field model with respect to the dynamics of switching and, more particularly, on the way that this deterministic model is made appropriately stochastic. Top-down connections are allowed to influence the activities of the units in layers x and y. This is necessary in general to coordinate the explanations for distant parts of the input and to provide a means by which top-down information can influence the course of rivalry. As in the simpler model, and following the data of Lehky and Maunsell (1996), the activities of units in layer z are not affected by top-down influences, although this is not for a principled reason in the model. Figure 5 shows the activities of units in response to binocular horizontal (i) and vertical (ii) bars. In these cases, there are no oscillations. Figure 5 also shows the two equally likely explanations for rivalrous input (iii and iv). For rivalry, there is direct competition in the top left-hand quadrant of z, as in Figure 2, which is reflected in the competition between y1 , y3 and y17 , y21 . However, the input regions (top right of L and bottom left of R), for which there is no competition, require the constant activity of explanations y9 , y11 , y18 , and y22 . Under the generative model, the coactivation of y1 and y9 without x1 is quite unlikely (P [x1 = 0|y1 = 1, y3 = 1] = 0.1), which is why x1 , x3 , and also w1 become active with y1 and y3 . Figure 6a shows the resulting activities during rivalry of units at various levels of the hierarchy, including the fatigue process. Broadly, the competing explanations in Figure 5 (iii and iv)—that is, the horizontal and vertical percepts—alternate, and units without competing inputs, such as y9 ,
1130
Peter Dayan
0
Iterations 1000
2000
3000
4000
w1 w2
1.0
Activity
0.5
x1
0.0 1.0 y1
0.5 0.0
Contrast Dependence
equal contrast horizontal dominance (l=1.25) vertical dominance (r)
500
1.0 0.5 0.0
b) 600
Oscillation period
a) 1.0 0.5 0.0
y9 1.0
0
1000
2000
3000
4000
1.2 1.4 1.6 Test vertical ‘contrast’ (r)
Figure 6: (a) Mean activities of units at three levels of the hierarchy in response to rivalrous stimuli with input strengths l = r = 1.75. (b) Contrast dependence of the oscillation periods. The dash-dotted line shows the period when the contrasts in both eyes are varied together. The solid and dashed lines show the periods of dominance of the left and right eyes, respectively, when the contrast in the left eye is fixed (l = 1.25) and the contrast in the right eye r is varied.
are much less modulated than the others, such as y1 . The activity of y9 is slightly elevated when horizontal bars are dominant, based on top-down connections. The activities of the units higher up, such as x1 and w1 , do not decrease to 0 during the suppression period for horizontal bars, leaving weak activity during suppression. Leopold and Logothetis (1996) observed that many of their modulating cells were not completely silent during their periods of less activity. Figure 6b shows that the hierarchical version of the model also behaves in accordance with experimental results on the effects of varying the input contrast (Levelt, 1965; Fox & Rasche, 1969; Blake, 1977; Mueller & Blake, 1989; Leopold & Logothetis, 1996). 4 Discussion Following Logothetis and his colleagues (Logothetis & Schall, 1989; Leopold & Logothetis, 1996; Logothetis et al., 1996; see also Grossberg, 1987), we have suggested an account of rivalry based on competing top-down hierarchical explanations. Neurons explain inputs by virtue of being capable of generating their activities through a top-down statistical generative model. Competition arises between higher-level explanations of overlapping active regions (those involving contrast changes) of the input rather than between inputs themselves. The overall model mechanistically has much in common with models that place the competition in rivalry at the level of binocular-oriented cells
A Hierarchical Model of Binocular Rivalry
1131
rather than between monocular cells (see Grossberg, 1987; Blake, 1989). Indeed, the model is based on an explanation-driven account for normal binocular processing, so this is to be expected. The advantage of couching rivalry in terms of explanations is that this provides a natural way of accounting for top-down influences, which are clear in such phenomena as the influence of perceptual organization on rivalry (Whittle et al., 1968; Kovacs et al., 1996). In fact, one can hope to study top-down control through studying its effects on the behavior of cells during rivalry. The model would also explain other sorts of alternation phenomena (such as those that arise with the Necker cube) in terms of competition between top-down explanations. The top-down model governs which units should compete with each other. The model correctly captures a number of the experimental characteristics of rivalry. If the input stimuli are weak, then there is no alternation, and instead both horizontal and vertical representing neurons are weakly activated (as in Liu et al., 1992). If input stimuli are stronger, then alternation ensues. The period of the alternation increases as the contrast of both the stimuli decreases, and if the contrast of only one stimulus increases, then the dominance period of the other stimulus decreases substantially more than the dominance period of the given stimulus increases (as in Levelt, 1965; Fox & Rasche, 1969; Blake, 1977; Mueller & Blake, 1989). There are two classes of binocular units activated by the rivalrous stimulus. The activity of one class is substantially modulated during rivalry; the activity of the other is not (as in Leopold & Logothetis, 1996). Alternating the input between the two eyes has absolutely no effect on this behavior of the model (as in Logothetis et al., 1996). The last effect arises since, apart from the input layer, on which there are no top-down influences, all the units are binocular, and there is no static or dynamic difference in the connections from the two eyes. Although it captures these phenomena, the model is simplified and incomplete. In particular, it does not exhibit two of the phenomena that Leopold and Logothetis (1996) observed. The first is that there is no opportunity in the model for monocular cells to be unmodulated during rivalry, as they found. Given redundant inputs and an extra layer of monocular units between layers z and y, this behavior would be expected. These units would explain away the redundancy in the input and, like unit y9 in Figure 6, would have to be activated consistently during rivalry. The second lacuna is that there are no units in the model that are selective between the stimuli when presented binocularly and are preferentially activated during suppression of their preferred stimuli during rivalry, or are not selective during binocular presentation but are selective during rivalry. In a model with more complicated stimulus contingencies, such units would emerge to account for the parts of the stimulus in the suppressed eye that are not accounted for by the explanation of the overlying parts of the dominant explanation, at least provided that this residual between the true
1132
Peter Dayan
monocular stimulus and the current explanation is sufficiently complex as to require explaining itself. This suggests the experimental test of presenting binocularly a putative form of the residual (e.g., dotted lines for competing horizontal and vertical gratings). We predict that these cells should be activated. One might expect some of these cells to participate in the explanation of the patterns when presented binocularly, whereas the activity of others would be explained away during binocular presentation, only to emerge during suppression. Other extensions are also desirable. Foremost, it is necessary to model the stochasticity of switching between explanations (Fox & Herrmann, 1967; Levelt, 1965). The distributions of dominance times for both humans and monkeys have traditionally been characterized in terms of a gamma distribution and, more recently, in terms of a log normal distribution (Lehky, 1995), with independence between successive dominance periods. Our mean-field recognition process is deterministic. The stochastic analog would be some form of Markov chain Monte Carlo method such as Gibbs sampling (see Neal, 1993). However, it is not obvious how to incorporate the equivalent of fatigue in a computationally reasonable way. In any case, the nature of neuronal randomness is subject to significant debate at present. We have adopted a very simple mean-field approach to recognition, giving up neurobiological plausibility for convenience. The determinism of the mean-field model in any case rules it out as a complete explanation, but it at least shows clearly the nature of competition between explanations. The architecture of the model is also incomplete. The cortex is replete with what we would model as lateral connections between units within a single layer. We have constructed generative models in which there are no such direct connections, because they significantly complicate the mean-field recognition method. These connections are certainly important for the recognition process (Dayan & Hinton, 1996), but modeling their effect would require representing them explicitly. This would also allow modeling of the apparent diffusive process by which patches of dominance spread and alter. In a complete model, it would also be necessary to account for competition between eyes in addition to competition between explanations (Wales & Fox, 1970; Fox & Check, 1972; Blake & Fox, 1974). Another extension is some form of contrast gain control (Carandini & Heeger, 1994). The model is quite sensitive to input contrast, which is obviously important for the effects shown in Figures 3 and 6. However, the range of contrasts over which it works should be larger. Achieving this will likely require a statistical model with real-valued rather than binaryvalued activities. It would be particularly revealing to explore the effects of changing the contrast in some parts of images and examine the consequent effects on the spreading of dominance, particularly in images as large as the full Figure 1 rather than just the portion in boxes B that the existing model addresses.
A Hierarchical Model of Binocular Rivalry
1133
Acknowledgments I am grateful to Bart Anderson, Adam Elga, Geoff Goodhill, Geoff Hinton, David Leopold, Nikos Logothetis, Earl Miller, Read Montague, Bruno Olshausen, Pawan Sinha, and particularly Zhaoping Li, Tommi Jaakkola, and Rich Zemel for discussion and comments on earlier drafts. This work was supported by NIMH grant 1 R29 MH 55541-01. References Bialek, W., & DeWeese, M. (1995). Random switching and optimal processing in the perception of ambiguous signals. Physical Review Letters, 74, 3077–3080. Blake, R. (1977). Threshold conditions for binocular rivalry. Journal of Experimental Psychology: Human Perception and Performance, 3, 251–257. Blake, R. (1989). A neural theory of binocular rivalry. Psychological Review, 96, 145–167. Blake, R., & Fox, R. (1974). Binocular rivalry suppression: Insensitive to spatial frequency and orientation change. Vision Research, 14, 687–692. Blake, R., Westendorf, D. H., & Overton, R. (1980). What is suppressed during binocular rivalry? Perception, 9, 223–231. Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science, 264, 1333–1336. Carpenter, G., & Grossberg, S. (1987). A massively parallel architecture for a selforganizing neural pattern recognition machine. Computer Vision, Graphics and Image Processing, 37, 54–115. Dayan, P., & Hinton, G. E. (1996). Varieties of Helmholtz machine. Neural Networks, 9, 1385–1403. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. Fox, R., & Check, R. (1972). Independence between binocular rivalry suppression duration and magnitude of suppression. Journal of Experimental Psychology, 93, 283–289. Fox, R., & Herrmann, J. (1967). Stochastic properties of binocular rivalry alternations. Perception and Psychophysics, 2, 432–436. Fox, R., & Rasche, F. (1969). Binocular rivalry and reciprocal inhibition. Perception and Psychophysics, 5, 215–217. Grenander, U. (1976–1981). Lectures in pattern theory I, II and III: Pattern analysis, pattern synthesis and regular structures. Berlin: Springer-Verlag. Grossberg, S. (1987). Cortical dynamics of three-dimensional form, color and brightness perception: 2. Binocular theory. Perception and Psychophysics, 41, 117–158. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1160. Jaakkola, T., Saul, L. K., & Jordan, M. I. (1996). Fast learning by bounding likelihoods in sigmoid type belief networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems 8. Cambridge, MA: MIT Press.
1134
Peter Dayan
Kovacs, I., Papathomas, T. V., Yang, M., & Feher, A. (1996). When the brain changes its mind: Interocular grouping during binocular rivalry. Proceedings of the National Academy of Sciences of the United States of America, 93, 15508– 15511. Lehky, S. R. (1988). An astable multivibrator model of binocular rivalry. Perception, 17, 215–228. Lehky, S. R. (1995). Binocular rivalry is not chaotic. Proceedings of the Royal Society of London, Series B: Biological Sciences, 259, 71–76. Lehky, S. R., & Blake, R. (1991). Organization of binocular pathways: Modeling and data related to rivalry. Neural Computation, 3, 44–53. Lehky, S. R., & Maunsell, J. H. (1996). No binocular rivalry in the LGN of alert macaque monkeys. Vision Research, 36, 1225–1234. Leopold, D. A., & Logothetis, N. K. (1996). Activity changes in early visual cortex reflect monkeys’ percepts during binocular rivalry. Nature, 379, 549–554. Levelt, W. J. M. (1965). On binocular rivalry. Soesterberg, Netherlands: Institute for Perception RVO-TNO. Liu, L., Tyler, C. W., & Schor, C. (1992). Failure of rivalry at low contrast: Evidence of a suprathreshold binocular summation. Vision Research, 32, 1471–1479. Logothetis, N. K., Leopold, D. A., & Sheinberg, D. L. (1996). What is rivalling during binocular rivalry. Nature, 380, 621–624. Logothetis, N. K., & Schall, J. D. (1989). Neuronal correlates of subjective visual perception. Science, 245, 761–763. MacKay, D. M. (1956). The epistemological problem for automata. In C. E. Shannon & J. McCarthy (Eds.), Automata studies (pp. 235–251). Princeton, NJ: Princeton University Press. Matsuoka, K. (1984). The dynamic model of binocular rivalry. Biological Cybernetics, 49, 201–208. Mueller, T. J. (1990). A physiological model of binocular rivalry. Visual Neuroscience, 4, 63–73. Mueller, T. J., & Blake, R. (1989). A fresh look at the temporal dynamics of binocular rivalry. Biological Cybernetics, 61, 223–232. Mumford, D. (1994). Neuronal architectures for pattern-theoretic problems. In C. Koch and J. Davis (Eds.), Large-scale theories of the cortex (pp. 125–152). Cambridge, MA: MIT Press. Neal, R. M. (1993). Probabilistic inference using Markov Chain Monte Carlo methods (Tech. Rep. No. CRG-TR-93-1). Toronto: Department of Computer Science, University of Toronto. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Pece, A. E. C. (1992). Redundancy reduction of a Gabor representation: A possible computational role for feedback from primary visual cortex to lateral geniculate nucleus. In I. Aleksander & J. Taylor (Eds.), Artificial Neural Networks, 2 (pp. 865–868) Amsterdam: Elsevier.
A Hierarchical Model of Binocular Rivalry
1135
Rao, P. N. R., & Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties in the visual cortex. Neural Computation, 9, 721–764. Saul, L. K., Jaakkola, T., & Jordan, M. I. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61–76. Sengpiel, F., Blakemore, C., & Harrad, R. (1995). Interocular suppression in the primary visual cortex: A possible neural basis of binocular rivalry. Vision Research, 35, 179–195. Varela, F. J., & Singer, W. (1987). Neuronal dynamics in the visual corticothalamic pathway revealed through binocular rivalry. Experimental Brain Research, 66, 10–20. Wales, R., & Fox, R. (1970). Increment detection thresholds during binocular rivalry suppression. Perception and Psychophysics, 8, 90–94. Wheatstone, C. (1838). Contributions to the theory of vision. I: On some remarkable and hitherto unobserved phenomena of binocular vision. Philosophical Transactions of the Royal Society of London, 128, 371–394. Whittle, P., Bloor, D. C., & Pocock, (1968). Some experiments on figural effects in binocular rivalry. Perception and Psychophysics, 4, 183–188. Yu, K., & Blake, R. (1992). Do recognizable figures enjoy an advantage in binocular rivalry? Journal of Experimental Psychology: Human Perception and Performance, 18, 1158–1173. Received January 2, 1997; accepted October 27, 1997.
LETTER
Communicated by Carsten Peterson
Efficient Learning in Boltzmann Machines Using Linear Response Theory H. J. Kappen RWCP SNN Laboratory, Department of Biophysics, University of Nijmegen, NL 6525 EZ Nijmegen, The Netherlands
F. B. Rodr´ıguez Instituto de Ingenier´ıa del Conocimiento y Departamento de Ingenier´ıa Inform´atica, Universidad Aut´onoma de Madrid, 28049 Madrid, Spain
The learning process in Boltzmann machines is computationally very expensive. The computational complexity of the exact algorithm is exponential in the number of neurons. We present a new approximate learning algorithm for Boltzmann machines, based on mean-field theory and the linear response theorem. The computational complexity of the algorithm is cubic in the number of neurons. In the absence of hidden units, we show how the weights can be directly computed from the fixed-point equation of the learning rules. Thus, in this case we do not need to use a gradient descent procedure for the learning process. We show that the solutions of this method are close to the optimal solutions and give a significant improvement when correlations play a significant role. Finally, we apply the method to a pattern completion task and show good performance for networks up to 100 neurons. 1 Introduction Boltzmann machines (Ackley, Hinton, & Sejnowski, 1985) are networks of binary neurons with a stochastic neuron dynamics, known as Glauber dynamics. Assuming symmetric connections between neurons, the probability distribution over neuron states Es will become stationary and will be given by the Boltzmann-Gibbs distribution P(Es). The Boltzmann distribution is a known function of the weights and thresholds of the network. However, exact computation of P(Es) or any statistics involving P(Es), such as mean firing rates or correlations, requires exponential time in the number of neurons. This is due to the fact that P(Es) contains a normalization term Z, which involves a sum over all states in the network, of which there are exponentially many. This problem is particularly important for Boltzmann machine learning because the Boltzmann machine learning rule requires the computation of correlations between neurons. Thus, learning in Boltzmann machines requires exponential time. c 1998 Massachusetts Institute of Technology Neural Computation 10, 1137–1156 (1998) °
1138
H. J. Kappen and F. B. Rodr´ıguez
For specific architectures, learning can be dramatically accelerated. For instance Saul and Jordan (1994) discuss how learning times become linear in the number of neurons for treelike architectures. Kappen (1995) shows how strong inhibition between hidden neurons reduces the computation time to polynomial in the number of neurons. A well-known approximate method to compute correlations is the Monte Carlo method (Itzykson & Drouffe, 1989), which is a stochastic sampling of the state-space. Glauber dynamics is an example of such a method. The terms in the sum over states are proportional to a “Boltzmann factor” exp(−E). Monte Carlo methods can be more effective than the summation of all terms because the sampling is biased toward states with lower E. These terms will give the dominant contribution to the sum over states. This is the approach chosen for learning in the original Boltzmann machine (Ackley et al., 1985). Practical use requires that the Markov process converge sufficiently fast—in polynomial time—to the equilibrium distribution. This property is known as rapid mixing and probably does not hold in general for Glauber dynamics (Sinclair, 1993). Useful results can be obtained with Glauber dynamics when the network is not too large and has small weights. In Peterson and Anderson (1987), an acceleration method for learning in Boltzmann machines is proposed. They suggest replacing the correlations in the Boltzmann machine learning rule by the naive mean-field approximation: hsi sj i = mi mj , where mi is the mean-field activity of neuron i. The mean fields are given by the solution of a set of n coupled mean-field equations, with n the number of neurons. The solution can be efficiently obtained by fixed-point iteration. The method was further elaborated in Hinton (1989). In this article, we will show that the naive mean-field approximation of the learning rules does not converge in general and explain why. Another way to speed up learning is to observe that the Kullback-Leibler divergence is bounded from above by an effective free energy expression using Jensen’s inequality. Such an approach can be applied to architectures whose probability distribution does not contain a sum over all states for normalization, such as the Helmholz machine (Dayan, Hinton, Neal, & Zemel, 1995) and the sigmoid belief network (Saul, Jaakkola, & Jordan, 1996). The application of such an approach to Boltzmann machines is not as simple because it requires in addition an upper bound on Z, which is computationally more complex (Jaakkola & Jordan, 1996). We will argue that in the correct treatment of mean-field theory for Boltzmann machines, the correlations can be computed using the linear response theorem (Parisi, 1988). In the context of neural networks, this approach was first introduced by Ginzburg and Sompolinsky (1994) for the computation of time-delayed correlations and later by Kappen (1997) for the computation of stimulus-dependent correlations. We will show that this approximation can be used successfully to approximate the gradients in the Boltzmann machine. In section 2, we introduce learning in Boltzmann machines and show
Efficient Learning Using Linear Response Theory
1139
why the naive mean-field approximation of the gradients does not work. In section 3, we derive the mean-field approximation for the correlations based on the linear response theory. We argue that an effective self-coupling term can be included to obtain better results. In the absence of hidden units, the fixed-point equations for the learning rules can be solved directly in terms of the weights and thresholds of the network. In section 4, we show results of simulations. We compare our methods with the exact computation of the optimal weights and with a factorized probability model that assumes absence of correlations. We use the Kullback-Leibler divergence as a criterion for comparison on small networks. For large networks, this criterion can no longer be computed, because it requires exponential time. We propose an approximate criterion for comparison on large networks and show that it correlates well with the Kullback-Leibler divergence for small problems. Subsequently we show good performance of our method for increasing problem size. 2 Boltzmann Machine Learning 2.1 General Dynamics of Boltzmann Machines. The Boltzmann machine is defined as follows. The possible configurations of the network can be characterized by a vector Es = (s1 , .., si , .., sn ), where si is the state of the neuron i and n the total number of the neurons. Each neuron can be in two states (si = ±1), and its dynamics is governed by the following stochastic rule. At each time step, a neuron is selected at random. Its new value is determined as: ½ +1 with probability g(hi ) (2.1) si = −1 with probability 1 − g(hi ), with g(hi ) and hi (local field) defined by g(hi ) =
X 1 , hi = wij sj + θi . 1 + exp{−2βhi )} j6=i
(2.2)
The magnitude wij (weight) refers to the connection strength between the neuron i and neuron j, and θi is the threshold for neuron i. The weights are chosen symmetric, wij = wji . The parameter β controls the noise in the neuron dynamics. β is often interpreted as β = 1/T, where T acts like the temperature of a physical system. Since β is just a scaling of the weights and the thresholds, and the latter are optimized through learning, we can set β = 1 without loss of generality. Let us define the energy of the system for a certain configuration Es as −E(Es) =
X i<j
wij si sj +
X i
si θ i .
(2.3)
1140
H. J. Kappen and F. B. Rodr´ıguez
After long times, the probability of finding the network in a state Es becomes independent of time (thermal equilibrium) and is given by the Boltzmann distribution p(Es) =
1 exp{−E(Es)}. Z
(2.4)
P Z = Es exp{−E(Es)} is the partition function that normalizes the probability distribution. 2.2 Slow Learning in Boltzmann Machines. A learning rule for Boltzmann machines was introduced by Ackley et al. (1985). Let us partition the neurons in a set of nv visible units and nh hidden units (nv + nh = n). Let α and β label the 2nv visible and 2nh hidden states of the network, respectively. Thus, every state Es is uniquely described by a tuple αβ. Learning consists of adjusting the weights and thresholds P in such a way that the Boltzmann distribution on the visible units pα = β pαβ approximates a target distribution qα as closely as possible. A suitable measure for the difference between the distributions pα and qα is the Kullback divergence (Kullback, 1959), X qα qα log . (2.5) K= p α α It is easy to show that K ≥ 0 for all distributions pα and K = 0 iff pα = qα for all α. Therefore, learning consists of minimizing K using gradient descent, and the learning rules are given by Ackley et al. (1985) and Hertz, Krogh, and Palmer (1991): ³ ´ ³ ´ (2.6) 1θi = η hsi ic − hsi i , 1wij = η hsi sj ic − hsi sj i i 6= j. The parameter η is the learning rate. The brackets h·i and h·ic denote the “free” and “clamped” expectation values, respectively. The “free” expectation values are defined as usual: X αβ si pαβ hsi i = αβ
hsi sj i =
X Es
αβ αβ
si sj pαβ .
(2.7)
The “clamped” expectation values are obtained by clamping the visible units in a state α and taking the expectation value with respect to qα : X αβ si qα pβ|α hsi ic = αβ
hsi sj ic =
X αβ
αβ αβ
si sj qα pβ|α .
(2.8)
Efficient Learning Using Linear Response Theory
1141
αβ
si is the value of neuron i when the network is in state αβ. pβ|α is the conditional probability to observe hidden state β given that the visible state is α. Note that in equations 2.6 through 2.8, i and j run over both visible and hidden units. Thus, the Boltzmann machine learning rules contain clamped and free expectation values of the Boltzmann distribution. The computation of the free expectation values is intractible, because the sums in equations 2.7 consist of 2n terms. If qα is given in the form of a training set of p patterns, the computation of the clamped expectation values (see equations 2.8) contains p2nh terms. This is intractible as well, but usually less expensive than the free expectation values. As a result, the Boltzmann machine learning algorithm cannot be applied to practical problems. 2.3 The Naive Mean-Field Approximation. Peterson and Anderson (1987) proposed an approximation to calculate the expectation values based on mean-field theory. In their approach, the free and clamped expectation values in equation 2.6 are approximated by their mean-field values, hsi i ≈ mi , hsi sj i ≈ mi mj , i 6= j,
(2.9)
where mi is the solution to the set of coupled mean-field equations, X wij mj + θi . mi = tanh
(2.10)
j6=i
We will refer to this method as the naive mean-field approximation. In each step of the gradient descent procedure, one must solve the mean-field equations given by equation 2.10. This can be done quite easily using fixedpoint iteration. In section 3, we will give more details about mean-field theory. Peterson and Anderson found that this method was 10 to 30 times faster than the Monte Carlo method. However, there are many data sets for which the naive mean-field approximation does not work. Here, we show the consequences of their approach in the case that there are no hidden units. Consider a network with only two visible neurons and no hidden neurons. We want to learn the probability distribution given by two patterns (1, 1) and (−1, −1) with equal probability. Thus, hs1 ic = hs2 ic = 0 and hs1 s2 ic = 1. On this particular problem, the gradient descent procedure combined with the naive mean-field computation does not converge. The reason is simple. If we assume that the learning process converges to a fixed point (1wij = 0 and 1θi = 0), then we obtain from equations 2.6 and 2.9, hsi ic = mi , hsi sj ic = mi mj i 6= j.
1142
H. J. Kappen and F. B. Rodr´ıguez EVOLUTION of WEIGHT (W12) 10
0.8
8
Weight (W12)
Kullback Distance
EVOLUTION of KULLBACK DISTANCE 1
0.6 0.4 0.2
6 4 2
0 0
20 40 60 TIME of Learning Process EVOLUTION of BIAS 1
0 0
80
2
80
20 40 60 Time of Learning Process
80
2
1
1.5
Bias 2
Bias 1
20 40 60 Time of Learning Process EVOLUTION of BIAS 2
0 −1
1 0.5 0
−2 −3 0
−0.5 20 40 60 Time of Learning Process
80
−1 0
Figure 1: Gradient descent learning. The network consists of two visible neurons and no hidden neurons. The target distribution q is given by two patterns, (1, 1) and (−1, −1), with equal probability. The solid line shows the evolution of the Kullback divergence and the different network parameters when the exact gradient descent method is used. The dotted line shows the evolution of the different network parameters when the naive mean-field approximation gradient descent procedure is used. Learning rate η = 0.1, momentum α = 0.9.
Thus, the fixed-point equations of the learning process combined with the naive mean-field approximation imply that the data set has no nontrivial correlations. In our example, this condition is clearly violated since 0 = hs1 ic hs2 ic 6= hs1 s2 ic = 1. Thus, we expect that if we use the naive mean-field approximation for the computation of the gradients, the resulting learning process will not converge. This is illustrated in Figure 1. We compare the exact gradient descent method, where the correlations are calculated using equations 2.7 and gradient descent using the naive mean-field approximation. Although the mean-field method sometimes reaches close to optimal solutions, the gradients in equations 2.6 are not zero at these points, and therefore the solution does not remain there. From this example, we conclude that the naive mean-field approximation leads to a converging gradient descent algorithm only when the data are such
Efficient Learning Using Linear Response Theory
1143
that hsi sj ic = hsi ic hsj ic i 6= j.
(2.11)
For i and j visible units, this is simply a property of the data. It is equivalent to the statement that the target probability distribution qα is factorized in all its variables: q(Es) = 5i qi (si ). The quality of the naive mean-field approximation will depend on to what extent equation 2.11 is violated. This conclusion holds regardless of whether the network has hidden units. 3 The Mean-Field Method and the Linear Response Correction In this section we introduce an improved method to compute correlations within the mean-field framework. We will consider the mean-field approximation and its formulation in the first subsection. Then we will derive our main result based on the linear response theory. In the special case that the network has no hidden units, the optimal weights and thresholds can be computed directly from the fixed-point equations; that is, no gradient procedure needs to be applied. 3.1 Mean-Field Formulation. The basic idea of mean-field theory is to replace the quadratic term in the energy, wij si sj in equation 2.3 by a term linear in si . Such a linearized form allows for efficient computation of the sum over all states, such as equations 2.7 and 2.8 and the partition function Z. We define the mean-field energy −Emf (Es) =
X
si {Wi + θi },
(3.1)
i
where we introduce n mean fields Wi . The mean fields approximate the lateral interaction between neurons. The values of Wi must be chosen such that this approximation is as good as possible. How to do this will be shown below. We define the mean-field probability distribution as pmf (Es) =
exp{−Emf (Es)} , Zmf
(3.2)
with Zmf =
X
exp{−Emf (Es)} =
Y
Es
the mean-field partition function.
i
2 cosh(θi + Wi )
(3.3)
1144
H. J. Kappen and F. B. Rodr´ıguez
The expectations values for si and si sj in the mean-field approximation are given by: hsi imf ≡
X
si pmf (Es) = tanh(Wi + θi ) ≡ mi ,
(3.4)
si sj pmf (Es) = mi mj i 6= j,
(3.5)
Es
hsi sj imf ≡
X Es
where we have introduced the parameters mi , which are still to be fixed because of their dependence on Wi . The real partition function Z (see equation 2.4) can be computed in the mean-field approximation (Itzykson & Drouffe, 1989): Z=
X
exp(−E) =
Es
X
exp(−Emf + Emf − E)
Es
= Zmf hexp(Emf − E)imf ≈ Zmf exp(hEmf − Eimf ) = Z0 .
(3.6)
The mean-field approximation is in the last step and is related to the convexity of the exponential function hexp f i ≥ exph f i (Itzykson & Drouffe, 1989). Note that h·imf denotes expectation with respect to the mean-field distribution in equation 3.2 and not with respect to the Boltzmann distribution in equation 2.4. Therefore, the free energy in the mean-field approximation can be easily computed and is given by −F = log Z0 =
X
log(2 cosh(θi + Wi )) −
i
X i
Wi mi +
X
wij mi mj . (3.7)
i<j
We can calculate the mean fields Wi by minimizing the free energy: X ∂F 2 = (1 − mi ) Wi − wij mj = 0. ∂Wi j6=i
(3.8)
2 It can be shown, that the solutions P mi = 1 maximize F. The required minima are therefore given by Wi = j6=i wij mj , which, combined with equation equation 3.4, give the mean-field equations in equation 2.10. These equations can be solved for mi in terms of wij and θi using fixed-point iteration. The mean fields Wi can then be directly computed using equation 3.8.
3.2 Derivation of Linear Response Correction. We can go beyond the naive mean-field prediction hsi sj imf = mi mj of equation 3.5 in the following way. First, observe that the mean firing rates and correlations are hsi i =
1 dZ 1 dZ0 1 d2 Z0 ≈ 0 , hsi sj i ≈ 0 . Z dθj Z dθj Z dθi dθj
(3.9)
Efficient Learning Using Linear Response Theory
1145
We will compute these quantities using the approximation in equation 3.6. While computing dZ/dθj , using equation 3.7, we must be aware that the mean fields Wi depend on θi through equation 2.10 and equation 3.8: X ∂Wj ∂ ∂ d log Z0 = mi log Z0 = + (3.10) hsi i ≈ dθi ∂θi ∂θi ∂Wj j hsi sj i ≈
1 d ¡ 0 ¢ Z mi = mi mj + Aij , Z0 dθj
(3.11)
with Aij = dmi /dθj . The last step in equation 3.10 follows when we use the mean-field equations in equation 3.8. Thus, there are no linear response corrections to the mean firing rate. Equation 3.11 is known as the linear response theorem (Parisi, 1988). The inverse of the matrix A can be directly obtained by differentiating equation 2.10 with respect to θi . The result is: (A−1 )ij =
δij − wij . 1 − m2i
(3.12)
When the network is divided into visible and hidden units, the above approximation can be directly applied to computation of the free expectation values in equations 2.7. When the visible units are clamped, the above derivation can be repeated to compute the expectation values for the hidden units. The only difference is that the thresholds θi for the hidden units receive an extra contribution from the clamped visible neurons. Let us assume that the visible units are clamped in state α. The mean firing rates of the hidden units are denoted by hsi iα = mαi , i ∈ H where mαi satisfy the mean-field equations X X α α α wij mj + wij sj + θi , i ∈ H. (3.13) mi = tanh j∈H
j∈V
V and H denote the subsets of visible and hidden units, respectively. Note that mαi depends on the clamped state α. The correlations hsi sj iα are given as follows: i, j ∈ H : hsi sj iα = mαi mjα + Aαij α
i ∈ V, j ∈ H : hsi sj i = α
i, j ∈ V : hsi sj i = (Aα,−1 )ij =
sαi mjα sαi sjα
δij − wij . 1 − (mαi )2
(3.14) (3.15) (3.16) (3.17)
Finally, the clamped expectation values are given P P by taking the expectation value over qα : hsi ic = α hsi iα qα and hsi sj ic = α hsi sj iα qα .
1146
H. J. Kappen and F. B. Rodr´ıguez
Thus, our approximation consists of replacing the clamped and free expectation values in equations 2.6 by their linear response approximations. Equations 2.10 and 3.10 through 3.12 and equations 3.13 through 3.17 define the linear response approximations in the free phase and the clamped phase, respectively. The complexity of the method is dominated by the computations in the free phase. The computation of the linear response correlations involves the inversion of the matrix A, which requires O(n3 ) operations. The computation of the mean firing rates through fixed-point iteration of equawhether tion 2.10 requires O(n2 ) or O(n2 log n) operations, depending on P fixed precision in the components of mi or in the vector norm i m2i is required. Thus, the full mean-field approximation, including the linear response correction, computes the gradients in O(n3 ) operations. 3.3 TAP Correction to the Mean-Field Equations. It is well known that the standard mean-field description (see equation 3.7) is inadequate for the description of frustrated systems. In general, terms involving higher powers of the coupling matrix wij must be included. For example, for the Sherrington-Kirkpatrick (SK) model, the appropriate mean-field free energy becomes (Thouless, Anderson, & Palmer [TAP], 1977) −F =
X
log(2 cosh(θi + Wi )) −
X
i
+
i
1X 4
Wi m i +
1X wij mi mj 2 i,j
w2ij (1 − m2i )(1 − mj2 ),
(3.18)
i,j
and the corresponding mean-field equations become the TAP equations:
X
mi = tanh
j6=i
wij mj + θi − mi
X j6=i
w2ij (1 − mj2 ) .
(3.19)
The additional term is called the Onsager reaction term (Onsager, 1936). It describes how the mean firing of neuron i affects the polarization of the surrounding spins and thus affects the local field of spin i. The effect of this additional term, but in the absence of the linear response correction, was studied by Galland (1993). In general, there is an infinite sum of terms, each involving a higher power of the couplings wij (Fischer & Hertz, 1991). It is interesting to note that all higher-order terms in the fixed-point equation are proportional to mi and thus represent corrections to the self-coupling term. In the case of the SK model, it can be shown that all terms beyond the Onsager term are negligible (Plefka, 1982). (For unfrustrated systems, like the Ising model, the Onsager term itself is negligible.) One can obtain the linear response corrections for TAP and higher-order mean-field corrections in a similar way, as described (by variation around
Efficient Learning Using Linear Response Theory
1147
the TAP equations). These extensions will be explored in a future publication. In this article, we restrict ourselves to the linear response corrections to the lowest-order mean-field equations and ignore higher-order corrections. However, we will consider the effect of an effective self-coupling term wii mi . The mean-field equations (see equation 2.10) become X wij mj + θi , mi = tanh
(3.20)
j
where the sum now includes the diagonal term. The derivation of the linear response correction is unaltered, except that wij now has nonzero diagonal terms (e.g., in equation 3.12). We propose to fix the value of wii through learning. We will demonstrate that the inclusion of the self-coupling term is (1) beneficial to obtain a closed-form solution for the learning problem in the absence of hidden units and (2) gives significantly better results than without the self-coupling term. 3.4 No Hidden Units. For the special case of a network without hidden units and with the effective self-coupling, we can make significant simplifications. In this case, the gradients in equations 2.6 can be set equal to zero and can be solved directly in terms of the weights and thresholds; no gradientbased learning is required. First note that hsi ic and hsi sj ic can be computed exactly from the data for all i and j. Let us define Cij = hsi sj ic − hsi ic hsj ic . The fixed-point equation for 1θi gives 1θi = 0 ⇔ mi = hsi ic .
(3.21)
The fixed-point equation for 1wij , using equation 3.21, gives 1wij = 0 ⇔ Aij = Cij i 6= j.
(3.22)
Because we have introduced n self-coupling parameters, we must specify n additional constraints. An obvious choice is to ensure that hs2i i = 1 is also true in the linear response approximation: 1 = hs2i ilr = m2i + Aii ⇔ Aii = Cii . Then equation 3.22 is equivalent to (A−1 )ij = (C−1 )ij if C is invertible. Using equation 3.12, we obtain wij =
δij − (C−1 )ij . 1 − m2i
(3.23)
In this way we have solved mi and wij directly from the fixed-point equations. The thresholds θi can now be computed from equation 2.10: X wij mj . (3.24) θi = tanh−1 (mi ) − j
1148
H. J. Kappen and F. B. Rodr´ıguez
Note that this method does not require fixed-point iterations to obtain mean firing rates mi in terms of wij and θi . Instead, the “inverse” computation of θi given mi and wij is required in equation 3.24. Note also that the thresholds depend on the diagonal weights. The solution of the example task of two neurons discussed in section 2.3 is computed in the appendix. Although the choice of constraint is particularly convenient, keep in mind that in principle other choices could be made, leading to other solutions. The justification for our choice is that it gives a closed-form solution of high quality, as we will show. 4 Results In this section we will compare the accuracy of the linear response correction with and without self-coupling with the exact method and with a factorized model that ignores correlations. We restrict ourselves to networks without hidden units. Of course, there are many probability estimation problems, for which the Boltzmann machine without hidden units is a poor model. Our main concern is whether the linear response approximation will give a solution sufficiently close to the optimal solution, not whether the optimal solution is good or bad. The correct way to compare our method to the exact method is by means of the Kullback divergence. However, this comparison can be done only for small networks. The reason is that the computation of the Kullback divergence requires the computation of the Boltzmann distribution, (see equation 2.4), which requires exponential time due to the partition function Z. In addition, the exact learning method requires exponential time. The comparison by Kullback divergence on small problems is the subject of section 4.1. For networks with a large number of units, we will demonstrate the quality of the linear response method by means of a pattern completion task; the network must be able to generate the rest of a pattern when part of the pattern is shown. The comparison of pattern completion on larger problems is the subject of section 4.2. 4.1 Comparison Using Kullback Divergence. In order to show the performance of the linear response correction, we have compared it with the results obtained with a factorized model and with the exact method. For the exact method (ex) we have used conjugate gradient. The mean firing rates and correlations are computed using equations 2.7. For the linear response method without self-coupling term (lr0) we have solved the fixed-point equations 3.22 for i 6= j using least squares and the LevenbergMarquardt method. The matrix A is given by equation 3.12 with wii = 0. For the linear response method with self-coupling (lr) we obtain the weights and thresholds from equations 3.23 and 3.24. This method can be applied when det(C) > 0. When det(C)=0, we have solved the fixed-point equations 3.22
Efficient Learning Using Linear Response Theory
1149
3.5 K
ex
Klr
3
Klr0 K
mf
Kullback
2.5
2
1.5
1
0.5
0
1
7
10
20
30
40
50
64
Number of Patterns
Figure 2: Average Kullback divergence over five random training sets as a function of the number of patterns in the training set. The network consists of six neurons.
for all i, j using least squares and the Levenberg-Marquardt method. The matrix A is given by equation 3.12 with wii free parameters. In the case of the factorized model, we assume pmf (Es) =
Y1 i
2
(1 + si mi ).
(4.1)
The mean firing rates are given by mi = hsi ic . The four methods are compared by computing the Kullback divergence, using equation 2.5. In Figure 2, we present the results for a network of six neurons. The number of patterns in the training set is varied from p = 1 until p = 64. For each p, five data sets were randomly generated. Each of the p patterns in the data set is assigned a random probability such that the total probability on the p patterns sums to 1. The lr method used least-squares minimization for 2 ≤ p ≤ 6. For the methods lr0 and lr, we observed for 2 ≤ p ≤ 6 in approximately 10% of the cases that the fixed-point equations could not be solved. This can happen because the equations are approximations to the true gradients and therefore do not need to have a fixed-point solution. These cases were deleted from the computation of the average Kullbacks in Figure 2.
1150
H. J. Kappen and F. B. Rodr´ıguez 2 1.8 1.6
,
1.4 1.2 1 0.8 0.6 0.4 0.2 0 3
4
5
6 7 Number of neurons
8
9
10
Figure 3: Kullback divergence relative to exact method, for mean-field approximation (open circles) and linear response method with self-coupling (closed circles). The number of patterns p = 2n. Results are averaged over four data sets. The error bars indicate the variance over the data sets.
We see that the exact method approaches the target distribution (K = 0) for very small number of patterns and for p → 2n . For p = 1, the correlations in the target distribution are absent, and all methods yield Kullback zero. For p → 2n the factorized model approaches the exact model. This is because the target distribution becomes more or less constant over all patterns, and correlations are absent in the constant distribution. The most difficult learning tasks are for low and intermediate values of p. The difference between Kmf and Kex shows that correlations play a significant role. The linear response solutions with and without the self-coupling term give a significant improvement. Linear response with a self-coupling term gives the best approximation. In the remaining numerical studies, we will consider only the linear response method with self-coupling. We compare the performance of the various methods on networks with 3 to 10 neurons in Figure 3. For each problem size, training data were ranµ domly generated with p = 2n. Each neuron value si = ±1, i = 1, . . . , n, µ = 1, . . . , p is generated randomly and independently with equal probability. For each data set, we compute Klr −Kex and Kmf −Kex . In the figure, we show these values averaged over all data sets, as well as their variances. From the difference between Kex and Kmf , we see that correlations play an increasingly important role. The linear response approximation is often quite close to the exact result. The quality of the approximation does not deteriorate with increasing problem size.
Efficient Learning Using Linear Response Theory
1151
4.2 Comparison on Pattern Completion. In this subsection, we demonstrate the quality of the linear response method for larger networks. As we mentioned above, this cannot be done by comparison of the Kullback divergence. Therefore, we propose to compare the different methods on n pattern completion tasks. We first train the networks as before, as if the problem were a joint probability estimation problem—with no distinction between input and output. Subsequently, we measure the quality of the different solutions by computing Q=−
1 X µ µ µ µ µ µ µ log(p(si |˜si )), s˜i = (s1 , . . . , si−1 , si+1 , . . . , sn ) np iµ
(4.2)
µ µ
The quantity p(si |˜si ) is the conditional probability of finding neuron i in µ µ the state si , given that the rest of the state is s˜i . We can do this for the exact method (for small networks) for the linear response method and for the factorized model. Note that the computation of Q is fast because it does not require the computation of the partition function. In order to use Q to assess the quality of the various methods, we must establish that low Q implies low Kullback divergence K, and vice versa. This is shown in Figure 4. The left graph shows for the linear response solutions and for the factorized model solutions separately that there is a more or less linear relation between the quality in terms of K and in terms of Q. In the right graph, we show for the same data sets the difference in pattern completion quality, Qmf − Qlr , versus the difference in Kullback divergence, Kmf − Klr . From this we see that if one method has a lower Q than another method, we can expect that its Kullback divergence is lower as well. Thus, one can use the more or less linear relation between Q and K to test the performance of the linear response method for problems with a large number of neurons. In Figure 5, we show the pattern completion quality for the different methods as a function of the network size. The exact method was computed only up to 10 neurons because of the time required. (Depending on the stop criterion, the exact method requires approximately 10 to 30 minutes on a network of 10 neurons on a SPARC 5.) We can see that the linear response method is very close to the exact method. The much higher value of the factorized model indicates the obvious fact that correlations play an important role in this task. Note that the mean-field method approaches q = log 2 for large n, which is due to the fact that the mean-field µ method assigns p(si ) ≈ 12 (mi ≈ 0) for all i and µ. 5 Discussion We have proposed a new, efficient method for learning in Boltzmann machines. The method is generally applicable to networks with or without
H. J. Kappen and F. B. Rodr´ıguez 0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
Qmf−Qlr
Qlr, Qmf
1152
0.3
0.3
0.2
0.2
0.1
0.1
0 0
0.5
1
1.5
Kullback Distance (Kmf, Klr)
2
0 0
0.5
1
1.5
2
Kullback Distance (Kmf−Klr)
Figure 4: Variation of the pattern completion quality Q with respect to the Kullback divergence K, for 200 data sets on six neurons. Each data set consists of 10 patterns. In the left graph, the plus signs represent the linear response method and the open circles represent the factorized model. In the right graph we plot the difference between the two pattern completion qualities (Qmf − Qlr ) versus the difference of the Kullback divergence (Kmf − Klr ) for the same data sets.
hidden units. It makes use of the linear response theorem for the computation of the correlations within the mean-field framework. In our numerical experiments, we restricted ourselves to networks without hidden units. We believe that this is sufficient to show the advantage of the method, since the free expectation values are the most time-consuming part of the computation. We have observed numerically that the inclusion of self-coupling is important for good results. This is probably also true in the presence of hidden units. In that case, a gradient-based procedure is required, and no closedform solution exists. The presence of self-coupling was motivated from the TAP equations. A full treatment of the linear response correction in this case is the subject of a future publication. In the presence of hidden units, both the exact method and the linear response method require a gradient descent algorithm. The advantage of our method is that the gradients can be computed in O(n3 ) instead of in O(2n ), time. The required number of iterations may be somewhat more for the linear response method, because the gradients are computed only approximately.
Efficient Learning Using Linear Response Theory
1153
0.7 0.65 0.6
Qex, Qlr, Qmf
0.55 0.5 0.45 0.4 0.35 0.3 0.25
2
3
4
5
6 7 8 9 10
20
50
100
Number of neurons
Figure 5: Prediction quality for 27 different random problems with different number of neurons. In every problem, the number of patterns p = 2n. The plus signs represent the linear response correction (Qlr ). The open circles represent the factorized model (Qmf ). The closed circles represent the exact method (Qex ).
This brings us to an interesting point, which is the convergence of the gradient descent algorithm in the linear response approximation. Convergence requires the existence of a Lyapunov function. The Kullback divergence is clearly a Lyapunov function for the exact method, but we were not able to find a Lyapunov function for the linear response approximation. In fact, we would like to construct a cost function such that its gradients are equal to the gradients of K in the linear response approximation. Whether such a function exists is unknown to our knowledge. In addition to probability estimation, Boltzmann machines have been proposed for combinatoric optimization (Hopfield & Tank, 1985; Durbin & Willshaw, 1987; Yuille & Kosowsky, 1994). For optimization the naive mean-field framework can be successfully applied to combinatoric optimization problems (Yuille, Geiger, & Bulthoff, ¨ 1991; Kosowsky & Yuille, 1994). This method is known as deterministic annealing. Clearly the situation is different here, since one is mainly concerned with the quality of the solution at the end of the annealing schedule—when T → 0. Correlations vanish in this limit in unfrustrated systems but can be quite complex in spin glasses (see, for instance, Young, 1983, for numerical results). Whether the linear response correction can improve deterministic annealing is an open question. The naive mean-field approach arises as a special case of the variational techniques that have been recently proposed. Whether the linear response
1154
H. J. Kappen and F. B. Rodr´ıguez
correction can be applied in this context as well should be investigated further. Appendix In this appendix we illustrate the consequences of the linear response method for the simple case of two neurons, considered numerically in section 2.3. The general probability distribution in two neurons is parameterized by three numbers. Consider the symmetric case where hs1 i = hs2 i. Then only two parameters are needed, which we choose such that 1 (1 + m) − a 2 p(+, −) = p(−, +) = a 1 p(−, −) = (1 − m) − a. 2
p(+, +) =
We must require that 0 < a < 12 and 2a − 1 < m < 1 − 2a to ensure that all probabilities are positive. In this parameterization hs1 s2 i = 1 − 4a and hs1 i = hs2 i = m. The special case of section 2.3 is obtained for m = a = 0. The matrix C as defined in section 3.4 is given as ¶ µ 1 − 4a − m2 1 − m2 . C= 1 − 4a − m2 1 − m2 Equation 3.23 gives directly à −1 + 1 1 − m2 − 4a w= 1 8a 1 − m2 − 2a
4a 1−m2
1 −1 +
! 4a 1−m2
,
and the thresholds are computed using equation 3.24. Note that the diagonal weights play an important role in the computation of the thresholds. One can also compute the optimal weights and thresholds using the exact method. Setting 1wij = 0 and 1θi = 0 in equation 2.6, we obtain µ
¶ (1 − 2a)2 − m2 . w12 = log 4a2 ¶ µ 1 m . θi = tanh−1 2 1 − 2a The differences are illustrated for m = 0.1 and m = 0.5 for all allowed values of a in Figure 6. Note that the linear response approximation is very good when the optimal weights are small. For larger weights, the difference between the two methods increases.
Efficient Learning Using Linear Response Theory
1155
Figure 6: Examples of lateral connection and threshold(s) obtained by exact method and linear response method (LR) for a network of two neurons with m = 0.1 and m = 0.5.
Acknowledgments We thank the anonymous referees for valuable suggestions for improvement on earlier versions of this article. References Ackley, D., Hinton, G., & Sejnowski, T. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Dayan, P., Hinton, G., Neal, R., & Zemel, R. (1995). The Helmholtz machine. Neural Computation, 7, 889–904. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689–691. Fischer, K., & Hertz, J. (1991). Spin glasses. Cambridge: Cambridge University Press.
1156
H. J. Kappen and F. B. Rodr´ıguez
Galland, C. (1993). The limitations of deterministic Boltzmann machine learning. Network, 4, 355–380. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Physical Review E, 50, 3171–3191. Hertz, J., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Hinton, G. (1989). Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Computation, 1, 143–150. Hopfield, J., & Tank, D. (1985). Neural computation of decision in optimization problems. Biological Cybernetics, 52, 141–152. Itzykson, C., & Drouffe, J.-M. (1989). Statistical field theory. Cambridge: Cambridge University Press. Jaakkola, T., & Jordan, M. (1996). Recursive algorithms for approximating probabilities in graphical models. (MIT Computational Cognitive Science Tech. Rep. No. 9604). Cambridge, MA: MIT. Kappen, H. (1995). Deterministic learning rules for Boltzmann machines. Neural Networks, 8, 537–548. Kappen, H. (1997). Stimulus dependent correlations in stochastic networks. Physical Review E, 55, 5849–5858. Kosowsky, J., & Yuille, A. (1994). The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks, 3, 477–490. Kullback, S. (1959). Information theory and statistics. New York: Wiley. Onsager, L. (1936). Electric moments of molecules in liquids. Journal of the American Chemical Society, 58, 1486–1493. Parisi, G. (1988). Statistical field theory. Reading, MA: Addison-Wesley. Peterson, C., & Anderson, J. (1987). A mean field theory learning algorithm for neural networks. Complex Systems, 1, 995–1019. Plefka, T. (1982). Convergence condition of the TAP equation for the infiniterange Ising spin glass model. Journal of Physics A, 24, 2173. Saul, L., Jaakkola, T., & Jordan, M. (1996). Mean field theory for sigmoid belief networks. Journal of Artificial Intelligence Research, 4, 61–76. Saul, L., & Jordan, M. (1994). Learning in Boltzmann trees. Neural Computation, 6, 1174–1184. Sinclair, A. (1993). Algorithms for random generation and counting: A Markov chain approach. Basel: Birkh¨auser. Thouless, D., Anderson, P., & Palmer, R. (1977). Solution of “solvable model of a spin glass.” Philosophical Magazine, 35, 593–601. Young, A. (1983). Direct determination of the probability distribution for the spin-glass order parameter. Physical Review Letters, 51, 1206–1209. Yuille, A., Geiger, D., & Bulthoff, ¨ H. (1991). Stereo integration, mean field theory and psychophysics. Network, 2, 423–442. Yuille, A., & Kosowsky, J. (1994). Statistical physics algorithms that converge. Neural Computation, 6, 341–356. Received March 27, 1997; accepted October 23, 1997.
LETTER
Communicated by Donald Geman
A Learning Theorem for Networks at Detailed Stochastic Equilibrium Javier R. Movellan Cognitive Science Department, University of California at San Diego, La Jolla, CA 92093, U.S.A.
This article analyzes learning in continuous stochastic neural networks defined by stochastic differential equations (SDE). In particular, it studies gradient descent learning rules to train the equilibrium solutions of these networks. A theorem is given that specifies sufficient conditions for the gradient descent learning rules to be local covariance statistics between two random variables: (1) an evaluator that is the same for all the network parameters and (2) a system variable that is independent of the learning objective. While this article focuses on continuous stochastic neural networks, the theorem applies to any other system with Boltzmann-like equilibrium distributions. The generality of the theorem suggests that instead of suppressing noise present in physical devices, a natural alternative is to use it to simplify the credit assignment problem. In deterministic networks, credit assignment requires an evaluation signal that is different for each node in the network. Surprisingly, when noise is not suppressed, all that is needed is an evaluator that is the same for the entire network and a local Hebbian signal. This modularization of signals greatly simplifies hardware and software implementations. The article shows how the theorem applies to four different learning objectives that span supervised, reinforcement, and unsupervised problems: (1) regression, (2) density estimation, (3) risk minimization, and (4) information maximization. Simulations, implementation issues, and implications for computational neuroscience are discussed. 1 Introduction This article studies how to train equilibrium solutions of continuous stochastic neural networks. The networks proposed are specified by stochastic differential equations (SDE), an extension of ordinary differential equations that incorporates probabilistic dynamics. The article illustrates how these networks can be optimized so that their equilibrium distributions exhibit desired properties. From a more general point of view, the article is also a theoretical analysis of learning in systems that exhibit Boltzmann equilibrium distributions, regardless of whether this distribution is obtained using SDE models or c 1998 Massachusetts Institute of Technology Neural Computation 10, 1157–1178 (1998) °
1158
Javier R. Movellan
other methods (see Neal, 1993, and Gidas, 1986, for a review of methods to generate Boltzmann-type equilibrium distributions). As such, the article generalizes the original Boltzmann machine learning algorithm (Ackley, Hinton, & Sejnowski, 1985) to a very wide variety of architectures and learning criteria. The article proposes sufficient conditions for the gradients of the cost functions minimized during learning to be local covariance statistics between two random variables: (1) an evaluator variable that is the same for all the network parameters and (2) a local system variable that is independent of the particular cost function being minimized. This factorization avoids backpropagation of error signals specific to each network parameter, greatly simplifying hardware and software implementations. The analysis presented here suggests that probabilistic dynamics may play an integral part of learning in natural nervous systems by simplifying the solution to the credit assignment problem. Learning with stochastic networks has played an important theoretical role in the neural network literature (Cowan, 1968; Geman & Geman, 1984; Ackley et al., 1985; Smolensky, 1986), but curiously, even though most neural network applications use continuous representations, learning in the continuous stochastic case has seldom been studied. Analyzing the continuous stochastic case is important for the following reasons: • Many natural signals, like pixel gray-level object positions and orientations, are well described as continuous random processes. Experience shows that some practical applications benefit from the use of a continuous stochastic framework (Isard & Blake, 1996). • Randomness is essential when modeling natural computation because of the intrinsic variability of natural hardware. It has been proposed that a unified theory of cognition and neural computation should be based on models that are random, continuous, and interactive (McClelland, 1993). • Current digital technology suppresses structural and thermodynamic noise in hardware devices by creating high-energy barriers between states, thus requiring relatively high power supplies (Andreou, 1994; Landauer, 1992; Mead & Conway, 1980). This approach does not work with supplies on the order of 0.1 volt, characteristic of natural computers. As we move toward low-voltage systems, the computing environment becomes analog and stochastic, the realm of stochastic diffusion models. • There are existing VLSI implementations of continuous stochastic neural networks, and it is important to have a formal framework to understand the kind of things that we can do with them (Alspector, Jayakumar, & Luna, 1992).
A Learning Theorem
1159
This article focuses on the problem of optimizing networks for tasks where what matters is the stable distribution of responses (e.g., image completion) rather than the paths leading to those responses. For generality, the framework and the results are presented in an abstract manner. For concreteness, we show how to construct and train an SDE version of the continuous Hopfield model (Hopfield, 1984). Sections 2 and 3 introduce SDEs and discuss analytical solutions for their equilibrium distributions. Section 4 presents a theorem applicable to a general class of models and cost functions. When this theorem applies, gradient descent learning rules take the form of local covariance statistics between evaluators and system variables. Section 5 applies the theorem to derive evaluator variables for the following supervised, reinforcement and unsupervised problems: (1) regression, (2) density estimation, (3) risk minimization, and (4) information maximization. The theorem applies to any dynamical system that exhibits Boltzmann equilibrium distributions; however, the main focus of this article is on the application of the theorem to neural networks specified via SDEs. As such, section 6 derives the system variables for a stochastic version of the continuous Hopfield model. Sections 7 and 8 present simulations and discuss implications of this approach. 2 Introduction to SDEs The theory of SDEs is a well-known formalism for describing continuous strong Markov processes (Karatzas & Shreve, 1988, p. 81). SDEs are commonly used to model the effects of noise in electric circuits, computer networks, and control systems (Oksendal, 1992; Borkar, 1989). In the cognitive modeling literature, SDEs have been used to model human reaction time distributions (Ratcliff, 1979) and to illustrate the principles of random, graded, and interactive propagation of information (McClelland, 1993). In the neural network literature, SDE models have been used explicitly or implicitly to describe single neuron activity (Gerstein & Mandelbrot, 1964; Ricciardi, 1977; Hanson & Tuckwell, 1983), small pools of neurons (Matsuyama, Shirai & Akizuki, 1974) and noisy neural networks (Zipser, 1991; Ohira & Cowan, 1995). The solutions of SDE equations are known as “diffusion processes” because they can be thought of as the mathematical description of the motion of small particles in a moving fluid (Karatzas & Shreve, 1988; Oksendal, 1992). Hereafter we will refer to the general class of neural network models specified by SDEs as diffusion networks. From a formal point of view, diffusion networks are continuous hidden Markov models (HMM). In discrete-state HMMs, standard in current automatic speech recognition systems (Rabiner, 1989), the system dynamics are explicitly defined by a matrix of state transition probabilities and a matrix of output probabilities conditional on each hidden state. In diffusion networks, the state and observation probabilities are implicitly defined by a drift function, which is the deterministic kernel
1160
Javier R. Movellan
of the system (e.g., as for a deterministic neural network) and a dispersion function, which controls the level of uncertainty in the system. More specifically, diffusion networks are specified by the following SDE dYtλ = µ(λ, Ytλ , X) dt + σ (Ytλ ) dWt , Y0 = ν ,
(2.1) (2.2)
where {Ytλ ; λ ∈ Rp }t∈[0,∞) is a continuous Rn -valued random process representing the state of the n nodes in the network. X is an Rm valued random vector representing the input. It has a known probability density fX fixed by the environment. ν is an Rn valued random variable representing the initial conditions. σ : Rn → Rn ⊗ Rn is a matrix function called the dispersion. λ ∈ Rp is a vector of adaptive parameters (e.g., coupling weights between units). µ: Rp × Rn × Rm → Rn is a function known as the drift. The drift can be interpreted as the deterministic kernel of a neural network with vector parameter λ (see section 6). {Wt }t∈[0,∞) is an N-dimensional Wiener process, a mathematical model of Brownian motion (Papoulis, 1991, p. 346). The process is assumed independent of {Y0λ }λ∈Rp and X. To simplify the presentation, we will assume that our formal objects are mathematically well behaved. For example, if we take a partial derivative of a function, we implicitly assume that such a derivative exists. 3 Stochastic Equilibrium This article focuses on the probability densities of diffusion networks at stochastic equilibrium, the densities in Rn induced by the random variables λ , lim Ytλ }λ∈Rp , {Y∞ t→∞
(3.1)
where , stands for “defined.” Under conditions to be detailed later, such a limit exists in distribution and is independent of the initial conditions. In this article, we care about the diffusion dynamics only to the extent that they lead to an equilibrium solution; thus we will ease the notation by dropping the λ , and time index in the underlying process. For example, Yλ will stand for Y∞ λ Yi for its ith component. Moreover, we will denote the space of continuous λ represent the densities on Rn w.r.t the Lebesgue measure as Dn . Let fY|X
A Learning Theorem
1161
density induced by Yλ in response to input X. It is well known that this density uniquely satisfies the following equilibrium condition:1 ∇y · Jλ (y, x) = 0 , λ
J (y, x) ,
(3.2)
λ fY|X (y
| x) V (y, x) ,
Viλ (y, x) , µi (λ, y, x) − 2 (y) σi,j
λ
1 2
n X j=1
´ , σ (y) σ (y) , ³
2 σi,j (y)
(3.3) ³ ´ ∂ λ 2 log fY|X (y | x) σi,j (y) , (3.4) ∂yj
T
i,j
(3.5)
where Jλ : Rn × Rm → Rn is the equilibrium current, V λ : Rn × Rm → Rn the equilibrium velocity, ∇y · the divergence with respect to y, and ∇y the gradient with respect to y. For the purposes of this article, a diffusion network with parameter λ is a deterministic mapping Nλ : Rm → Dn , from input space into the space of densities on Rn . Equation 3.2 defines the mapping implicitly. Making it explicit is difficult in general; however, there is a special case with a wellknown q solution, the focus of this article. In particular, let β > 0 such that σ (y) = β2 In , where In is the n×n identity matrix, and let {U: Rp ×Rn ×Rm → R} be a function such that µ(λ, y, x) = −∇y U(λ, y, x) .
(3.6)
In this case it is easy to verify that the Boltzmann density 1 exp [ −β U(λ, y, x) ] , Zλ (x) Z exp [ −β U(λ, y, x) ]dy, Zλ (x) ,
λ (y | x) = fY|X
Rn
(3.7) (3.8)
satisfies equation 3.2. For equation 3.7 to be well defined, the integral in equation 3.8 needs to exist. Gidas (1986, p. 190) and Geman and Hwang (1986) specify sufficient conditions on U(λ, y, x) for this. Note that in fact equation 3.7 makes the velocity V λ (y, x) zero everywhere, a condition sufficient but not necessary to satisfy equation 3.2 and which is known as detailed balance (Poggio & Girosi, 1994). The function U(λ, y, x) is sometimes known as the potential, the energy, or the dissonance. When its sign is changed, it is known as the harmony (Smolensky, 1986). 1 This condition easily follows from the Kolmogorov forward equation (Oksendal, 1992, p. 127), assuming nonzero equilibrium density everywhere and positive definite dispersion. This simple transformation of the forward Kolmogorov equation ties up nicely with mathematical physics and emphasizes the idea of probability currents moving throughout the state-space.
1162
Javier R. Movellan
4 Learning Equation 3.7 defines a family of mappings {Nλ : Rm → Dn }λ∈Rp , and equation 2.1 tells us how to implement these mappings using diffusion networks. The learning task is that of finding members of the family that are optimal in some specific sense. In general we may care about only the distribution of a subset of d < n system variables, which we call the observables. In such a case, we divide the state random variables Yλ into observable and hidden components: Yλ = (Oλ , Hλ ). • {Oλ : Ä → Rd }λ∈Rp is called the observable component. • {Hλ : Ä → Rn−d }λ∈Rp is called the hidden component. Hidden variables are important to allow the marginal density of observables to be non-Boltzmann. In this section we propose a learning theorem that applies to two different cases: one typical in the neural network literature and the other in the stochastic filtering literature. Case 1: The joint state process (i.e., the observable process and the hidden processes) is defined by the SDE s 2 dWt , (4.1) dYtλ = −∇y U(λ, Ytλ , X)dt + β with an input density fX fixed by the environment. For each input x and parameter vector λ, the joint state equilibrium distribution is Boltzmann, as in equation 3.7. Case 2: In this case, the hidden process is defined by the SDE s 2 λ λ dWt , dHt = −∇h U(λ, Ht , X)dt + β
(4.2)
with Uλ : Rn−d × Rm → R. As in case 1, the input density is fixed by the environment. Moreover, the equilibrium densities of observables conditioned on inputs and hidden states2 fO|H,X do not depend on λ. Note in case 2 that the drift for the hidden process does not depend on the observable process. From a neural net point of view, this says that there is no feedback connection from the observables back into the hidden nodes, a classic constraint in stochastic filtering (Oksendal, 1992, p. 58) Let {(Ä, F , mλ )}λ∈Rp be an indexed family of probability spaces where Ä = Rd × Rn−d × Rm , F = B (Ä), the Borel sigma-algebra of Ä. The probability measures mλ are defined by densities induced by the random variables 2
These conditional densities could be optimized, but that is not the focus of this article.
A Learning Theorem
1163
{Oλ , Hλ }λ∈Rp and X. We represent these density functions with the symbol f and appropriate superscript and subscripts. Next, we define a series of random variables needed in the derivation of the learning theorem. We define these random variables in terms of auxiliary functions whose only role is to facilitate the derivations by making explicit dependencies between different variables. • Let {Sλi , i = 1, . . . , p}λ∈Rp , be a family of random variables named the system covariates and defined by auxiliary functions {S˜ λi : Rd × Rn−d × Rm → R}λ∈Rp , ¡ ¢ Sλi , S˜ λi Oλ , Hλ , X ,
(4.3)
∂U(λ, o, h, x) . S˜ λi (o, h, x) , − ∂λi
(4.4)
• Let {Qλ : Rd × Rm → R}λ∈Rp be a family of functions called the state ˜ Rd × Rm × R × R → R, cost and defined by auxiliary function Q: ´ ³ λ Qλ (u, v),Q˜ u, v, fOλ (u) , fO|X (u | v) .
(4.5)
• Let {Rλ }λ∈Rp be a family of random variables named the evaluators, defined by auxiliary functions {R˜ λ : Rd × Rm → R}λ∈Rp , Rλ , R˜ λ (O, X) , ´ ³ λ ∂ Q˜ u, v, fOλ (u) , fO|X (u | v) . R˜ λ (u, v) , λ ∂ fO|X (u | v)
(4.6) (4.7)
• Let {T˜ λ : Rd × Rm → R}λ∈Rp be a family of functions defined as follows:
T˜ λ (u, v),
´ ³ λ ∂ Q˜ u, v, fOλ (u) , fO|X (u | v) ∂ fOλ (u)
.
(4.8)
• Let C: Rp → R be an overall cost function defined as Z
Z C(λ),
Rm
dx g(x)
Rd
do Qλ (o, x) ,
(4.9)
where g: Rm → R is a function for which equation 4.9 is well defined
1164
Javier R. Movellan
and such that Z dx g(x) T˜ λ (o, x) = 0 . ∇o
(4.10)
Rm
This condition tells us that the integral in equation 4.10 ought to be a constant with respect to o, (i.e., its gradient with respect to o is the zero vector). The reasons for this condition will become apparent later. Our goal is to find values of λ that minimize C(λ) by using gradient-descent approaches. We will show that these gradients are linear combinations of covariance statistics. We refer to this fact as the Boltzmann covariance theorem (BCT). Boltzmann covariance theorem. For systems defined as in cases 1 and 2, and satisfying equation 4.10, the gradient of the overall cost with respect to the network parameters is a linear combination of covariance statistics between the evaluator and the system covariates, and it has the following form: Z ∂C(λ) =β dx g(x) Covλ (Rλ , Sλi | X = x) , (4.11) ∂λi Rm where Covλ is a covariance with respect to measure mλ . Proof. The proof consists of two steps. In the first step, we study the gradients of output probabilities. In the second step, we study the gradient of the cost function. Step 1. We explore two different cases as defined at the beginning of this section. In case 1, the joint states have Boltzmann density λ (o, h | x) = fO,H|X
1 Zλ (x)
exp [ −β U(λ, o, h, x) ].
(4.12)
Thus, λ (o, h | x) ∂ fO,H|X
∂λi
λ = β fO,H|X (o, h | x)
[Eλ (Sλi | Oλ = o, Hλ = h, X = x) − Eλ (Sλi | X = x)] ,
(4.13)
where Eλ represents expected values with respect to measure mλ . Therefore, λ (o | x) ∂ fO|X
∂λi
Z = =
dh
λ ∂ fO,H|X (o, h | x)
∂λi Rn−d λ β fO|X (o | x) [ Eλ (Sλi | Oλ = o, X
= x) − E(Sλi | X = x) ].
(4.14)
A Learning Theorem
1165
In case 2, the hidden states are Boltzmann and the outputs have a fixed conditional density model fO|H,X . Thus, λ (h | x) = fH|X
λ (h | x) ∂ fH|X
1 Zλ (x)
exp [ −β U(λ, h, x) ] ,
(4.15)
λ = β fH|X (h | x) ¤ £ λ λ E (Si | Hλ = h, X = x) − Eλ (Sλi | X = x) ,
∂λi
(4.16)
and Z
λ (o | x) ∂ fO|X
=
∂λi
=
λ dh fO|H,X (o | h, x)
Rn−d λ β fO|X (o | x) λ λ [E (Si | Oλ =
λ (h | x) ∂ fH|X
∂λi (4.17) λ
o, X = x) − E
(Sλi
| X = x)],
(4.18)
which is equivalent to equation 4.14. Thus, from now on cases 1 and 2 behave identically. Step 2. Applying the chain rule on equation 4.9, Z ∇λ C(λ) =
Z
Rm
Z
+
λ do R˜ λ (o, x)∇λ fO|X (o | x) Z dx g(x) do T˜ λ (o, x) ∇λ fOλ (o),
dx g(x)
Rm
Rd
Rd
(4.19)
where R˜ λ and T˜ λ are defined in equations 4.7 and 4.8. The last term in equation 4.19 vanishes, Z dx g(x) do T˜ λ (o, x) ∇λ fOλ (o) Rm Rd Z Z do ∇λ fOλ (o) dx g(x) T˜ λ (o, x) = Rd Rm Z Z do ∇λ fOλ (o) = ∇λ dxfOλ (o) = 0 ∝
Z
Rd
(4.20) (4.21) (4.22)
Rd
R since Rd do fOλ (o) = 1 and we assume the equation 4.10 holds. Moreover, the components of the second term have the desired covariance form, ∂C(λ) = ∂λi
Z Rm
Z dx g(x)
Rd
do R˜ λ (o, x)
λ (o | x) ∂ fO|X
∂λi
(4.23)
1166
Javier R. Movellan
Z
Z
=β
dx g(x)
Rm λ λ [E (Si
Z
=β
ZR
×
m
Rd
λ do R˜ λ (o, x) fO|X (o | x)
| Oλ = o, X = x) − Eλ (Sλi | X = x)] Z λ dx g(x) do R˜ λ (o, x) fO|X (o | x) Rd
Rn−d
λ dh S˜ λi (o, h, x) fO,H|X (o, h | x)
Z dx g(x) Eλ (Rλ | X = x)Eλ (Sλi | X = x) −β Rm Z dx g(x) =β Rm λ λ λ [E (R Si
Z
=β
Rm
(4.24)
(4.25)
| X = x) − Eλ (Rλ | X = x)Eλ (Sλi | X = x)]
dx g(x) Covλ (Rλ , Sλi | X = x) .
Let an overall cost function C` have the form Z ` , dx g(x) q (x, C(λ, x)) , C(λ) m ZR ´ ³ λ do Q˜ o, x, fOλ (o) , fO|X C(λ, x) , (o | x) ,
(4.26)
Corollary.
Rd
(4.27) (4.28)
where q: Rd × R → R is a well-behaved function and Q˜ is defined in equation 4.5, satisfying equation 4.10. In this case the gradient of the overall cost with respect to the network parameters is also a linear combination of covariance statistics between an evaluator and system random variables Z ` ∂ C(λ) =β dx g(x) Covλ (R` λ , Sλi | X = x) , m ∂λi R ∂q (x, C(λ, X(ω)) λ λ ` ;ω ∈ Ä, R (ω) , R (ω) ∂C(λ, X(ω))
(4.29) (4.30)
where Rλ is defined in equation 4.6. Proof.
Applying the chain rule,
` ∂ C(λ) = ∂λi
Z Rm
dx g(x)
∂q (x, C(λ, x)) ∂C(λ, x) . ∂C(λ, x) ∂λi
Note that C can be expressed as an overall cost function, Z Z ´ ³ λ dx δ(x − u) do Q˜ o, x, fOλ (o) , fO|X C(λ, u) = (o | x) , Rm
Rd
(4.31)
(4.32)
A Learning Theorem
1167
where δ is the Dirac delta function. Thus, applying the BCT, Z ∂C(λ, u) =β dx δ(x − u) Covλ (Rλ , Sλi | X = x) ∂λi Rm = β Covλ (Rλ , Sλi | X = u) ,
(4.33) (4.34)
and ` ∂ C(λ) =β ∂λi
Z Rm
dx g(x)
∂q (x, C(λ, x)) Covλ (Rλ , Sλi | X = x) . ∂C(λ, x)
(4.35)
Moreover, since the partial derivative in equation 4.35 is a function of λ and x, it can be moved inside the covariance. 5 Evaluators for Common Optimization Problems An interesting aspect of the BCT is that the evaluators Rλ can be derived directly from the cost function without specifying the system being optimized. Moreover, the system covariates Sλ can be derived directly from the system dynamics, regardless of the cost function. This modularization of learning signals greatly simplifies software and hardware implementations. In this section, we derive the evaluators for four different learning problems that span supervised, unsupervised, and reinforcement situations: (1) regression, (2) density estimation, (3) risk minimization, and (4) information maximization. In section 6 we derive the system variables for continuous stochastic neural networks. 5.1 Regression. In regression problems, the goal is to learn the expected values of a random vector O conditional on an input vector X with respect to a probability measure P of input and outputs. In other words, the goal of regression is to approximate the function ζ (x) = EP (O | X = x), from inputs to conditional expectation of the output (Papoulis, 1991, p. 179). Thus, in regression problems we care only about the expected value of the distribution of outputs and disregard its higher-order statistics. Most of the applications on supervised neural network learning, and classical signal filtering can be seen as regression problems. A popular cost function for such problems is the expected Euclidean distance, or sum of squares. To simplify the presentation, we focus on the case with only one observable node (d = 1), Z 1 ` dx fX (x) k ζ (x) − Eλ (Oλ | X = x) k2 , (5.1) C(λ) = 2 Rm which has the form studied in equation 4.27, C(λ, u) = Eλ (Oλ | X = u) ,
(5.2)
1168
Javier R. Movellan
q (x, C(λ, u)) = 12 k ζ (x) − C(λ, u) k2 , ´ λ λ (o | x) o . Q˜ o, x, fOλ (o) , fO|X (o | x) = fO|X ³
(5.3) (5.4)
Moreover, C(λ, x) satisfies equation 4.10 since ´ ³ λ ∂ Q˜ o, x, fOλ (o) , fO|X (o | x) ∂ fOλ (o)
= 0.
(5.5)
Applying equation 4.29 the evaluator random variable follows, Rλ = Oλ , ¢ ¡ R` λ = −Oλ ζ (X) − Eλ (Oλ | X) .
(5.6) (5.7)
In general, if there is more than one observable node, d > 1, the evaluator is as follows: R` λ = −
d X
Oj
¡
¢ ζj (X) − Eλ (Oj | X) .
(5.8)
j=1
Note that this evaluator corresponds to the classic backpropagation delta signal (Rumelhart, Hinton, & Williams, 1986) evaluated at the outputs. In diffusion networks, however, the very same evaluator is sent to all the adaptive parameters (e.g., the weights), with no need of further transformation as we move to hidden layers. 5.2 Density Estimation. In this case the problem is that of approximating an entire mapping from inputs into probability densities on Rd . This is a much harder problem than regression, since we care about the entire density of outputs, not just the expected value. In general, density estimation is important when unimodal uncertainty models are not appropriate (Movellan & McClelland, 1993). A popular cost function for density estimation is the Kullback-Leibler information criterion (KLIC). In our case we need the KLIC between the desired and obtained conditional distributions averaged with respect to the input density (Haykin, 1994, p. 447), Z
Z C(λ) =
Rm
dx fX (x)
Rd
do p(o | x) log
p(o | x) , | x)
λ (o fO|X
(5.9)
where p(o | x) is the desired conditional output density. In this case, ´ ³ p(o | x) λ . Q˜ o, x, fOλ (o) , fO|X (o | x) = p(o | x) log λ fO|X (o | x)
(5.10)
A Learning Theorem
1169
The partial derivative of Q˜ with respect to fOλ (o) is zero, and thus equation 4.10 holds. The evaluator random variable, defined in equation 4.6, easily follows: Rλ = −
p(Oλ | X) . λ (Oλ | X) fO|X
(5.11)
This evaluator measures to what extent the desired density is larger than the obtained density (i.e, whether state regions are visited at the desired rate). It requires computing densities of individual states. Good results can be obtained by discretizing the states into a finite number of regions. Although the number of states to keep track of in principle grows exponentially with the number of units, in practice only a few regions with nonnegligible measure are visited. Those are the only ones we need to care for when computing our covariance statistic. Note that this method avoids the two different learning phases of the Boltzmann machine learning algorithm (Ackley et al., 1985). In any case, it is easy to show that the gradient obtained using the BCT can also be expressed as a generalized form of the standard Boltzmann learning algorithm: ∂C(λ) =β ∂λi
Z
dxfX (x) Covλ (Rλ , Sλi | X = x) Z Z dx fX (x) do p(o | x) = −β Rm
Rm
Rd
[Eλ ( Sλi | Oλ = o, X = x ) − Eλ ( Sλi | X = x )] .
(5.12)
where Eλ ( Sλi | Oλ = o, X = x ) is estimated by clamping the observable and input units, and Eλ ( Sλi | X = x ) is estimated by clamping the input units. For applications of diffusion networks to density estimation problems, see Movellan and McClelland (1993). Methods to accelerate learning are discussed in Stark and McClelland (1994). 5.3 Risk Minimization. The objective in this case is to minimize the expected loss, which in Bayesian decision theory is known as the risk (Duda & Hart, 1973, p. 14), C(λ) = Eλ [ ρ λ ],
(5.13)
where {ρ λ : Ä → R}λ∈Rp is the loss random variable, defined by an auxiliary loss function ρ: ˜ Rd × Rm → R, ˜ λ , X). ρ λ = ρ(O
(5.14)
Risk minimization is at the heart of most reinforcement problems. The definition of the loss function, ρ, ˜ is entirely general (e.g., it may be discrete
1170
Javier R. Movellan
or continuous; it may be based on the entire output state or on just a few dimensions of the state). In this case, ´ ³ λ λ (o | x)ρ(o, ˜ x) , Q˜ o, x, fOλ (o) , fO|X (o | x) = fO|X
(5.15)
which satisfies equation 4.10 since the partial derivative of Q˜ with respect to fOλ (o) is zero. Moreover, the evaluator is the loss itself, Rλ = ρ λ .
(5.16)
5.4 Information Maximization. Information maximization (infomax) is a classic criterion for unsupervised learning problems. Information maximization has been studied by Linsker (1988), and more recently by Nadal and Parga (1994) and by Bell and Sejnowski (1995), among others. It turns out that the infomax criterion satisfies the constraints of the BCT, and thus infomax learning can be performed using covariance statistics. An appropriate cost function for this problem is the negative mutual information between input and outputs, which is defined as follows (Haykin, 1994, p. 451): C(λ) = −Eλ [ Iλ ] ,
(5.17)
where Iλ : Ä → R is the mutual differential information random variable, defined by auxiliary functions {I˜λ : Rm × Rd → R}λ∈Rp , Iλ = I˜λ (O, X) , λ (o | x) fO|X . I˜λ (o, x) , log fOλ (o)
(5.18) (5.19)
Note that equation 5.17 is not a special case of risk minimization because I˜λ varies with λ whereas ρ˜ does not. In this case, λ (o | x) ´ ³ fO|X λ λ , (o | x) log Q˜ o, x, fOλ (o) , fO|X (o | x) = fO|X fOλ (o)
(5.20)
which satisfies equation 4.10, since Z ∇o
Rm
dx fX (x) T˜ λ (o, x) = −∇o
Z Rm
dx
λ (o | x) fX (x) fO|X
fOλ (o)
=0,
(5.21)
where T˜ is defined in equation 4.8. Therefore, the BCT applies, and the evaluator random variable easily follows, Rλ = −Iλ (Oλ , X) − 1.
(5.22)
A Learning Theorem
1171
Since constants do not affect covariance statistics, we can drop the −1. Equation 5.22 tells us that the smaller the mutual information in an outcome, the less we like that outcome. This evaluator requires computing mutual information of individual states. Good results can be obtained by discretizing the states into a finite number of regions. Although the number of states to keep track of in principle grows exponentially with the number of units, in practice only a few regions have nonnegligible measure and thus an influence on the covariance. 5.5 Remarks. Since covariances are linear operators, it is possible to optimize a weighted sum of learning criteria by using an evaluator that is a weighted sum of the evaluators for each criterion. Second-order optimization methods (Gill, Murray, & Wright, 1981, p. 105) would also be based on covariance statistics. The reason is that covariances are linear operators that satisfy the conditions of the BCT. Therefore, if the gradient is a linear combination of covariances, the gradient of gradients (the Hessian matrix) is also a linear combination of covariances. 6 System Covariates Up to now, we have derived evaluators for a variety of cost functions. In this section, we construct diffusion neural networks and derive their system covariates. This completes all that is needed to apply the BCT. There are many ways to construct diffusion neural networks with detailed equilibrium solutions. The one we worked with in our simulations is based on the continuous Hopfield model (Hopfield, 1984) and is constructed as follows. We are given: • An n × n symmetric matrix, w, whose elements represent coupling strengths between nodes in the network. • An n × m matrix, v, representing the coupling strengths between input lines and the nodes in the network. • A vector α ∈ Rn , of activation gains. • An invertible activation function g: R → R. • A constant β that controls the level of uncertainty in the system. The adaptive parameter vector λ is the elements of w, v, and α organized as a vector. We then define the following potential function, U(λ, y, x) =
n X i=1
−
Z αi
g(yi )
g−1 (s)ds −
g(0)
m n X X i=1 j=1
g(yi )vij xj ,
n 1X g(yi )wij g(yj ) 2 i,j=1
(6.1)
1172
Javier R. Movellan
where g−1 (g(y)),y. We then define the drift as the negative gradient of the potential, ∂U(λ, y, x) µi (λ, y, x) = −g0 (yi ) ∂g(yi ) n m X X wij g(yj ) + vij xj , = g0 (yi ) −αi yi + j=1
(6.2)
j=1
where g0 is the derivative of g. The state variables Yiλ are commonly interpreted as presynaptic activations and g(Yiλ ) as postsynaptic activation. The network dynamics are described by equation 4.1 defining a stochastic variation of the continuous Hopfield model (Hopfield, 1984). In this case, the Wiener process represents presynaptic noise. The system covariates (see equation 4.3) easily follow: g(Yj ) g(Yk ) g(Yj ) Xk Si = − R g(Yj ) g−1 (s) ds g(0)
if λi is wj,k , if λi is vj,k ,
(6.3)
if λi is αj .
Note that since the potential function is separable into additive secondorder terms, the system covariates are local. Moreover, the Si for the weight parameters are products of activations, and thus the necessary gradients can be computed with Hebbian-like operations. 7 Simulation: Image Reconstruction The purpose of this simulation was to test whether the BCT could be used in practice to train the equilibrium distribution of diffusion networks. To do so, we need to substitute the covariances called for by the BCT by estimates of these covariances based on discrete time approximations to SDEs. For this reason, it is unclear whether the learning algorithms would work at all in computer simulations. We chose a problem for which the true population covariances of the continuous system can be obtained analytically and which has potential applications for hardware implementations of diffusion networks (Alspector et al., 1992). The technique employed for the simulation was a simple forward-Euler approach. Equation 2.1 is replaced with the discrete-time stochastic difference equation, p λ = Ytλ + µ(λ, Ytλ , X)1t + σ 1t Zt , Yt+1t
(7.1)
where 1t is a small constant, and for each {t = k1t; k = 1, 2, . . .}, the random vectors Zt are independent identically distributed N-dimensional gaussian
A Learning Theorem
1173
with zero mean and identity covariance matrix. Moreover, we fixed σ = In , the n×n identity matrix. When the sampled points are linearly interpolated, this defines a stochastic process that converges in distribution to the solution of the original SDE as 1t → 0 (Gillespie, 1992, p. 193). 7.1 Task and Network Architecture. The task was to do optimal reconstruction3 of noise-contaminated samples from TULIPS1 (Movellan, 1995), a database of 935 human lip images.4 Each image consists of 100 × 75 pixels with gray-level values ranging from 0 to 255. The images were contaminated on a pixel-by-pixel basis with i.i.d. zero mean Gaussian noise and standard deviation ranging from 10 to 80. The network consisted of 100 × 75 observable units, one per pixel, an equal number of input units, and no hidden units. Thus, in our notation, n = m = d = 7500. There was a one-to-one correspondence of input units, output units, and pixels. Each input unit represented a pixel value in a noisy version of an image. The corresponding output unit represented the clean value of the same pixel. To do the reconstruction, each output unit used a 15 × 15 receptive field: each output unit received input from the corresponding input unit and from 224 surrounding input units arranged as a square patch. Image borders were treated using a toroidal wrap-around of the input image. All receptive fields were constrained to share the same kernel of weights, effectively performing a convolution operation. Thus, the total number of free parameters was p = 225. The activation function was linear g(y) = y since for this case the optimal solution can be shown to be a classical Wiener filter (Jain, 1989), which can be calculated analytically. This allowed us to compare the solution found by the network with the optimal solution. The BCT works for equilibrium solutions, the limiting density as t → ∞. In practice, the initial state of each output unit was set equal to the inner product between the input weight vector of that unit and the input image being processed. Then we cycled 10 times using equation 7.1, with 1 t = 0.1. After these 10 settling cycles, the system was considered sufficiently closed to stochastic equilibrium. Equilibrium statistics were then calculated by running the network for 10 additional cycles and estimating expected values and covariances based on the states obtained during those 10 cycles. 7.2 Training Sample. Each training sample consisted of a noisy input image, which was used as input (100 × 75 noisy pixel values), and the corresponding clean image, which was used as a teacher for the output units. Training was performed with respect to the sum of squares criterion
3
We are using optimal in the mean square sense. It is well known that other filters may perform better with respect to other criteria. 4 Available at http://cogsci.ucsd.edu.
1174
Javier R. Movellan
Figure 1: (Left) Weight kernels analytically derived. Each image has 15 × 15 pixels, with each pixel representing a weight. Large weights appear white, small weights dark. (Right) The weight kernels learned using a diffusion network. The four kernels reflect four different noise conditions. From top to bottom and left to right, the standard deviation of noise is 10, 20, 40, and 80, respectively, where each pixel can take values from 0 to 255.
presented in section 5.1. Thus, the goal was for the expected value of each output unit to approximate as closely as possible the value of the clean pixel corresponding to that output unit. Learning was done using Newton’s second-order method (Gill et al., 1981, p. 105), with a single pass over the entire image database. As mentioned in section 5.5, computation of the Hessian matrix is also accomplished through the use of covariance statistics. We had to resort to second-order methods because first-order methods were too slow for this task, a problem shared by deterministic linear systems with a large number of correlated inputs. Figure 1 shows the impulse response of the optimal Wiener filter, analytically obtained, and the weight kernels obtained using the discrete time approximation to the diffusion network. Each image in Figure 1 has 15 × 15 pixels, with each pixel representing a weight. Large weights appear white, small weights dark. The weights were obtained for four different levels of noise in the input images. As the figure shows, the solutions obtained by sampling in discrete time were very close to the analytical solutions. Note how the kernels increase in size as the noise power increases. Note also that the kernels are elongated horizontally, capturing the fact that lips are mostly a horizontal structure and, thus, on average, pixel values correlate more with horizontal than with vertical neighbors. Figure 2 shows example reconstructions of three images performed by the simulated diffusion network. The first column shows the original images, the second column shows the images contaminated with gaussian noise (mean = 0, SD = 80), and the third column is the reconstruction. Each pixel of the reconstruction image is the conditional average activation of an output unit given the noisy input image.
A Learning Theorem
1175
Figure 2: (Left to right) Clean images, contaminated images (SD = 80), and reconstructed images.
8 Discussion We explored the problem of learning equilibrium solutions in diffusion networks, a stochastic extension of continuous neural networks. A learning theorem is proposed that specifies sufficient conditions for the gradients of cost functions to be computable by simple covariance statistics of a uniform evaluator and a set of local system variables. The conditions proposed by the theorem apply to a variety of cost functions that span common supervised, unsupervised, and reinforcement problems. Although our focus was on training diffusion networks, the learning theorem applies to any system whose equilibrium solution is Boltzmann (see Neal, 1993, and Gidas, 1986, for a review of such systems). The article suggests an approach to learning that may serve as inspiration for hardware design and computational neuroscience. The approach is consistent with von Neumann’s views of the brain as a system in which “error . . . is not [seen] as an extraneous and misdirected or misdirecting accident, but as an essential part of the process under consideration” (Neumann, 1956). Instead of suppressing noise present in physical devices, natural computers may use it to simplify the credit assignment problem. In deterministic approaches, like backpropagation, proper credit assignment requires an evaluation signal that is different for each node in the network. Surprisingly, when noise is not suppressed, all that is needed is an evaluator that is the same for the entire network. The covariance between this uniform evaluator and Hebbian signals is sufficient for the proper distribution of credit throughout the entire network. Acknowledgments This article emerged through interactions with James McClelland, who sparked my interest in noisy natural computers. Ongoing interactions with Ruth Williams and Paul Mineiro helped formalize my thinking about diffu-
1176
Javier R. Movellan
sion networks. Comments from an anonymous reviewer provided a missing link for the BCT. I am grateful to David Zipser for his guidance. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9(2), 147–169. Alspector, J., Jayakumar, A., & Luna, S. (1992). Experimental evaluation of learning in a neural mycrosystem. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems (Vol. 4, pp. 871–878). San Mateo, CA: Morgan Kaufman. Andreou, A. (1994). On physical models of neural computation and their analog VLSI implementation. In Workshop on physics and computation, Phys-comp 94 conference (pp. 255–264). Los Alamitos, CA: IEEE Computer Society. Bell, T., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Borkar, S. V. (1989). Optimal control of diffusion processes. New York: Longman. Cowan, J. D. (1968). Statistical mechanics of nervous nets. In E. R. Caianiello (Ed.), Neural networks (p. 181). Berlin: Springer-Verlag. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions of Pattern Analysis and Machine Intelligence, PAMI-6, 721–741. Geman, S., & Hwang, C.-R. (1986). Diffusions for global optmization. SIAM J. Control and Optimization, 24, 1031–1043. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models of the spike activity of a single neuron. Biophysics J., 4, 41–68. Gidas, B. (1986). Metropolis-type Monte Carlo simulation algorithms and simulated annealing. In J. L. Snell (Ed.), Topics in contermporary probability and its applications (pp. 159–232). Boca Raton, FL: CRC Press. Gill, E. P., Murray, W., & Wright, M. H. (1981). Practical optimization. London: Academic Press. Gillespie, D. T. (1992). Markov processes: An introduction for physical scientists. San Diego: Academic Press. Hanson, F. B., & Tuckwell, H. C. (1983). Diffusion approximations for neural activity including synaptic reversal potentials. J. Theoretical Neurobiology, 2, 127–153. Haykin, S. (1994). Neural networks: A comprehensive foundation. New York: Macmillan. Hopfield, J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Science, 81, 3088–3092. Isard, M., & Blake, A. (1996). Contour tracking by stochastic propagation of conditional density. In Proc. European Conf. Computer Vision (Vol. 58, pp. 343– 356). Cambridge, UK.
A Learning Theorem
1177
Jain, A. K. (1989). Fundamentals of digital image processing. Englewood Cliffs, NJ: Prentice Hall. Karatzas, I., & Shreve, S. E. (1988). Brownian motion and stochastic calculus. New York: Springer-Verlag. Landauer, R. (1992). Information is physical. In Proceedings of the 1992 Physics of Computation Workshop (pp. 1–4). Los Alamitos, CA: IEEE Computer Society. Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer, 21, 105–117. Matsuyama, M., Shirai, K., & Akizuki, K. (1974). On some properties of stochastic information processes in neurons and neuron populations. Kybernetik, 15, 127–145. McClelland, J. L. (1993). Toward a theory of information processing in graded, random, and interactive networks. In D. E. Meyer & S. Kornblum (Eds.), Attention and performance XIV: Synergies in experimental psychology, artificial intelligence, and cognitive neuroscience (pp. 655–688). Cambridge, MA: MIT Press. Mead, C. A., & Conway, L. (1980). Introduction to VLSI. Reading, MA: AddisonWesley. Movellan, J. R. (1995). Visual speech recognition with stochastic neural networks. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press. Movellan, J. R., & McClelland, J. L. (1993). Learning continuous probability distributions with symmetric diffusion networks. Cognitive Science, 17(4), 463– 496. Nadal, J. P., & Parga, N. (1994). Nonlinear neurons in the low-noise limit: A factorial code maximizes information transfer. Network: Computation in Neural Systems, 5, 565–581. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. (Tech. Rep. No. CRG-TR-93-1). Toronto: Department of Computer Science, University of Toronto. Neumann, J. (1956). Probabilistic logics and the synthesis of reliable organisms from unreliable components. In C. E. Shannon & J. McCarthy (Eds.), Automata studies (pp. 43–98). Princeton: Princeton University Press. Ohira, T., & Cowan, J. D. (1995). Stochastic single neurons. Neural Computation, 7, 518–528. Oksendal, B. (1992). Stochastic differential equations. Berlin: Springer-Verlag. Papoulis, A. (1991). Probability, random variables, and stochastic processes. New York: McGraw-Hill. Poggio, T., & Girosi, F. (1994). Continuous stochastic cellular automata that have a stationary distribution and no detailed balance (Tech. Rep. A.I. Memo No. 1168). Cambridge, MA: MIT, AI Lab. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286. Ratcliff, R. (1979). Group reaction time distributions and an analysis of distribution statistics. Psychological Bulletin, 86, 446–461. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology: Lecture notes in biomathematics. Berlin: Springer-Verlag.
1178
Javier R. Movellan
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, & the PDP Research Group (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 194–281). Cambridge, MA: MIT Press. Stark, C., & McClelland, J. L. (1994). Tractable learning of probability distributions using the contrastive Hebbian algorithm. In Proceedings of CogSci94 (pp. 818–823). Hillsdale, NJ: Erlbaum. Zipser, D. (1991). Recurrent network model of the neural mechanism of shortterm active memory. Neural Computation, 3, 179.
Received June 21, 1996; accepted September 9, 1997.
LETTER
Communicated by Marcus Meister
Asymmetric Dynamics in Optimal Variance Adaptation Michael DeWeese Sloan Center, Salk Institute, La Jolla, CA 92037, U.S.A.
Anthony Zador Salk Institute MNL/S, La Jolla, CA 92037, U.S.A.
It has long been recognized that sensory systems adapt to their inputs. Here we formulate the problem of optimal variance estimation for a broad class of nonstationary signals. We show that under weak assumptions, the Bayesian optimal causal variance estimate shows asymmetric dynamics: an abrupt increase in variance is more readily detectable than an abrupt decrease. By contrast, optimal adaptation to the mean displays symmetric dynamics when the variance is held fixed. After providing several empirical examples and a simple intuitive argument for our main result, we prove that optimal adaptation is asymmetrical in a broad class of model environments. This observation makes specific and falsifiable predictions about the time course of adaptation in neurons probed with certain stimulus ensembles. 1 Introduction Many real-world signals of interest to both biological and synthetic systems are characterized by their large dynamic range. This dynamic range poses a challenge to both wetware and hardware, which are often constrained to operate within a much more limited dynamic range. For example, photoreceptors in the retina saturate over 2 orders of magnitude of light intensity, yet the retina can operate over 10 orders of magnitude (Barlow & Mollon, 1982). How can a device operating over only 2 orders of magnitude deal with signals that span 10? The retina exploits the nonstationary statistics that characterize light intensity in the real world; on short time scales, light intensity tends to fluctuate over a much smaller range. That is, the retina adapts to the mean light intensity. Adaptation is a basic strategy used by other sensory modalities as well, including the auditory and somatosensory systems (Barlow & Mollon, 1982). The mean is the simplest statistical characteristic, but it is not the only one. A well-designed system might also be expected to adapt to the variance of a signal. For example, if the input-output response function of a system were sigmoidal, the mean might be used to fix the midpoint of the sigmoid, while the variance might determine the slope at the midpoint. Adaptation to c 1998 Massachusetts Institute of Technology Neural Computation 10, 1179–1202 (1998) °
1180
Michael DeWeese and Anthony Zador
variance (contrast) is well established in the visual system (Shapley, EnrothCugell, Bonds, & Kirby, 1972; Shapley & Enroth-Cugell, 1984; Enroth-Cugell & Robson, 1966; Shapley & Victor, 1978, 1979; Kelly, 1961; deLange, 1958; Giaschi, Douglas, Marlin, & Cynader, 1993; Bonds, 1991; Shapley, 1997). If a system is to adapt to a signal’s nonstationary statistics, it must first estimate those statistics. Optimal estimation of a nonstationary mean is a well-understood problem (Papoulis, 1991). Optimal estimation of a nonstationary variance has received much less attention. Here we consider the problem of estimating a nonstationary variance. First we define a broad class of processes generated by hidden Markov models that have a well-defined nonstationary variance. We then derive the Bayesian optimal causal1 estimator of the instantaneous variance. Finally, we show that for many processes within this class, the dynamics of the optimal estimator show an asymmetry: an abrupt increase in variance is more readily detectable than an abrupt decrease. This asymmetry offers a falsifiable test of the hypothesis that sensory systems adapt optimally to nonstationary variance. 2 General Framework Our goal in this section is to define a class of discrete-time processes si with an instantaneous time-varying variance σi2 , and then derive the Bayesian optimal causal estimate of the standard deviation at time ti given its a priori statistics and a realization2 sj≤i up to ti . In order to isolate the features of optimal adaptation to variance, we will consider time series with fixed means, and whose third and higher moments are completely determined by the time-varying second moment. In addition, we will study optimal estimation for cases where the mean and variance are covarying. Under both of these conditions, knowing the current value of the variance is tantamount to knowing the whole distribution. 2.1 Markov Generating Process. We begin by writing a discrete-time description of a hidden Markov model with one internal state variable, σ , and one observable output, s: σi = F1 [σi−1 , zi , ui , . . .],
(2.1)
si = F2 [σi , yi ],
(2.2)
1 We call our estimator causal since its estimate for the standard deviation at (discrete) time ti does not depend on its input at any later times tj , j > i. 2 We use s j
Asymmetric Dynamics in Optimal Variance Adaptation
1181
where subscripts index time steps and ui , yi , zi , and all other variables appearing after σi−1 in the argument of F1 are independent and identically distributed (i.i.d.) random variables drawn from their a priori distributions P(u), P(y), P(z), and so forth.3 The signal si will be the input to our estimator. This formulation includes a broad class of models, and it can be generalized to describe arbitrary nonstationary processes by adding more hidden variables. We will consider particular choices of F2 and P(yi ) in which σi2 can be interpreted as the “instantaneous variance” of the signal, si . For example, we will sometimes define si = σi × yi , where P(y) is a gaussian with unit variance, so that the variance of si is given by σi2 . Our task is to use all observations s1 , s2 , . . . , si up to the present time ti to estimate the current value of σi . This would be a trivial problem if si were a deterministic function of σ , since then σ = F−1 (s). We therefore focus on the nontrivial case, where u, y, z, and so forth are stochastic. 2.2 Optimal Bayesian Estimator. We now derive the optimal Bayesian estimator of the standard deviation. Because the generating process (see equation 2.1) is Markovian, at every point in time our knowledge about σi is completely summarized by the probability distribution, P(σi |sj≤i ), of σi given all previously observed data sj≤i . From this conditional distribution, we can compute specific estimators such as the mean, the mode (most likely value), or any other estimator with whatever properties we choose, but we emphasize that the fundamental object for any estimation task is the distribution P(σi |sj≤i ) itself. First, we take advantage of the Markovian nature of σ to write an expression for the distribution for the current value of σ given all data up to the last time step as a functional of P(σi−1 |sj
dσi−1 P(σi |σi−1 )P(σi−1 |sj
(2.3)
where P(σi |σi−1 ) is the distribution of the current value of σ given its previous value. Next we use Bayes’s rule to combine P(σi |sj
P(σi |sj≤i ) =
(2.4)
where we have explicitly rewritten the distribution in the denominator as a 3 We use the convention that all probability distributions are written as P(. . .); the argument of P determines which probability distribution we mean.
1182
Michael DeWeese and Anthony Zador
normalization constant, which can be obtained by integrating the numerator: Z (2.5) Ä = dσi P(si |σi )P(σi |sj
(3.1)
si = σi yi .
(3.2)
Asymmetric Dynamics in Optimal Variance Adaptation
1183
Here both zi and yi are gaussian i.i.d. variables with zero mean and unit p 2 2 2 variance, so that for each time step P(s√ i |σi ) = exp(−si /2σi )/ 2π σi , −∞ < 2 si < ∞, and P(σi ) = 2 × exp(−σi /2)/ 2π , 0 < σi < ∞. On average, then, the signal, s, is distributed according to Z
∞
P(si ) =
Z dσi P(si |σi )P(σi ) =
0
∞
K0 (|si |) e−si /2σi −σi /2 , = π σi π 2
dσi 0
2
2
(3.3)
where K0 is the modified Bessel function of the second kind. Note that by adding a constant to the right-hand side of equation 3.2, we could give si a nonzero mean value, but this would change the optimal estimation strategy only in a superficial way, so we will absorb any nonzero mean into our definition for s. We use Bayes’s theorem to compute the conditional distribution for σi given the observation si , P(si |σi )P(σi ) P(si ) 2 2 2 e−si /2σi e−σi /2 . = σi K0 (|si |)
P(σi |sj≤i ) = P(σi |si ) =
(3.4)
This is just equation 2.6 for the special case where σi is drawn afresh at each step independent of σj
∞
σ¯ ≡
dσi σi P(σi |si ) =
0
p e−|si | . π/2 K0 (|si |)
(3.5)
For comparison, the maximum likelihood estimator for σ is obtained by solving dP(σi |si )/dσi = 0, r σm.l. =
−1/2 +
q 1/4 + s2i .
(3.6)
Despite the rather different forms of these expressions, these two estimators agree well for s > ∼ 1 where the distribution is not too asymmetrical about the peak. (However, as s → 0, σm.l. → 0, while σ¯ remains finite.) The corresponding the variance give essentially the same result: p estimators involving p ¯ 2 2 σ¯ ≈ σ and σ ≈ (σ ) . Thus, our results reflect the statistics of the m.l.
m.l.
input, but are not sensitive to the exact form of our estimator, as we will discover in the upcoming sections. In this simple example, we were able to find closed-form analytic expressions for a variety of estimates of σi at each time step. In the next few
1184
Michael DeWeese and Anthony Zador
sections, we consider dynamics that require the optimal estimator to make use of the history of the process. For these cases, we will obtain analytical results for only the first time step. Fortunately, it will be possible to compute the time course of our optimal estimator σ¯ (and σm.l. ) numerically for any prior distribution on σ , and any distribution for the signal being used to probe the estimator. 3.2 Variance Switching. 3.2.1 Two Values. We will now derive the optimal estimator for a simple nonstationary environment. Consider a world where the variance switches 2 and a low-value σ 2 . We will draw the switching between a high-value σhigh low times from a homogeneous Poisson process so that they are totally uncorrelated with each other; knowing exactly when switches have occurred in the past gives no information about when they will occur in the future. We will then construct the optimal causal estimator, σ¯ i , for the time-dependent standard deviation of this process, which minimizes the root mean squared (rms) error. Note that our estimator has access only to the signal, s, not to the underlying switching times of σ , so even though the switching times of the true σ are uncorrelated, our estimator must incorporate the entire past time course of s to do the optimal job. Even in this impoverished environment, we will find that the optimal estimator will behave in a subtle way: it will respond to an increase in the variance more quickly than a decrease. The process we are interested in can be written as σi = zi σi−1 + (1 − zi )(σhigh + σlow − σi−1 ),
(3.7)
si = σi yi .
(3.8)
Here zi is a binary variable that assumes values 0 with probability x and 1 with probability 1 − x, and yi is gaussian with zero mean and unit variance. The probability of switching per time step is thus x. The standard deviation 1 : σi of the signal si now has a memory, with a correlation time τc ∼ 2x hσi σi+t iP(sj ) =
(σhigh − σlow )2 −2xt (σhigh + σlow )2 + e , 4 4
(3.9)
for times long compared to the time step. Sample realizations of σ and s are shown in Figure 1. The optimal estimator for this simple problem can make use of the fact that σ takes on only two values by expressing the entire distribution P(σi |sj≤i ) ≡ P(σi = σlow |sj≤i ). 1 − Plow as a single parameter: Plow i i is then the probability that σi = σhigh . Since x is the a priori probability that a switch occurred in the last time step, low P(σi = σlow |sj
(3.10)
Asymmetric Dynamics in Optimal Variance Adaptation
1185
5 Standard deviation of signal
0 0 10
20
40
60
80
100
120
140
160
80
100
120
140
160
120
140
160
Instantiation of signal
0 −10 0
20
40
60
5 Estimate of standard deviation
0 0
20
40
60 80 100 Time (iteration number)
Figure 1: Dynamics of the optimal estimator for two-state variance switching. (Top) The standard deviation during one cycle of the “probe” signal. (Center) A specific instantiation of this signal over the same period of time, generated according to equation 3.7 from the time-varying standard deviation shown at the top; this is the signal s available to the estimator. (Bottom) The optimal causal estimate (obtained from s) of the standard deviation, averaged over 1000 cycles of the signal. For this example, the prior assumes that the signal, s, is gaussian with a standard deviation that randomly switches with probability .001 per time step between σlow = 1 and σhigh = 4, the same two values used to generate the “probe” (top curve). Notice that the estimate responds more quickly to an increase in the signal’s standard deviation (solid portion of curve) than it does after a decrease (dashed curve). This difference in adaptation rates is more apparent in Figure 2.
and P(σi = σhigh |sj
= Plow i
where the normalization is Ä=
+ x(1 − Plow (1 − x)Plow i−1 ) −s2i /2(σlow )2 pi−1 e + 2π(σlow )2
(3.11)
1186
Michael DeWeese and Anthony Zador low xPlow i−1 + (1 − x)(1 − Pi−1 ) −s2i /2(σhigh )2 q e . 2π(σhigh )2
(3.12)
Thus, at each time step, equations 3.10 and 3.11 combine the new observation low si with Plow i−1 to obtain the updated probability Pi . Note that for x = 1/2, the current estimate for the standard deviation, σi , given by equation 3.10, depends on only the current value of the signal si , as in the previous section. The prior in this case assumes that the variance of s switches between two known values at randomly chosen times. The variance spends equal time at the high and low values, and switches between these values are instantaneous in both directions, so the underlying dynamics of σ is totally symmetric under the exchange σlow ↔ σhigh . Despite this, the optimal estimator behaves asymmetrically to increases and decreases in variance. This can be shown analytically for the first time step after abrupt switches in the variance (see the appendix). To study the entire time course of the adaptation, we use equation 3.11 numerically at every time step while probing the estimato compute Plow i tor with a square wave in standard deviation. The estimator is optimized for an ensemble of different waveforms of σ (here, the random telegraph signal), but to illustrate the behavior of the estimator, we use a single such signal (the square wave). In this case, the probe stimulus is not too unlikely in the estimator’s prior distribution, which will not always be the case in later sections. We calculate the estimate for the standard deviation, which minimizes the rms error via ³ ´ low (3.13) σhigh . σ¯ i = Plow i σlow + 1 − Pi Figure 1 shows the trajectory of this estimate averaged over many periods of the standard deviation square wave. Figure 2 redisplays the portions of the curve immediately following switches in the input standard deviation in a way that makes it easier to compare the response times to abrupt increases and decreases in standard deviation. The observations s are drawn independently at every time step, so successive presentations of the up and down jumps in standard deviation result in new instantiations of s. It is clear from the figure that the optimal estimate for the standard deviation tracks the upward step faster than the downward step. In the next section, we confirm that this is true for a more complex prior. 3.2.2 Many Values. We now consider a prior in which σ jumps from one value to the next at random times, but unlike the previous example, σ will now assume a spectrum of values rather than just two: σi = zi σi−1 + (1 − zi )ui ,
(3.14)
si = σi yi ,
(3.15)
1187
1
4 After upward switch
3
After downward switch (inverted)
3
2
1 0
2
5 10 Time after last switch (iterations)
4 15
Estimate of standard deviation after downward switch
Estimate of standard deviation after upward switch
Asymmetric Dynamics in Optimal Variance Adaptation
Figure 2: Dynamics of optimal adaptation are faster following an increase than a decrease in variance. Here we have replotted the optimal causal estimate of the standard deviation σ¯ from the bottom plot of Figure 1 so that the rates of adaptation after increases and decreases can be more easily compared. The solid curve is the average σ¯ following upward jumps in the signal’s standard deviation, as before. The dashed curve is the average σ¯ following downward jumps, but it has been inverted to make the comparison easier. For both curves, the time of the last switch is set to zero. Clearly, optimal adaptation in this case is faster following abrupt increases in standard deviation than it is after decreases.
where u is uniformly distributed over some finite range of nonnegative values, [a, b], and y and z are gaussian and binary i.i.d., respectively, as in the last section. In other words: P(σi |σi−1 ) = xFlat(σi ) + (1 − x)δ(σi − σi−1 ),
(3.16)
where Flat(σ ) is 1/(b − a) whenever a ≤ σ ≤ b, and 0 otherwise so that all allowed values for the standard deviation are equally likely. From equation 3.15, si is gaussian with variance σi2 , so equation 2.6 becomes µ ¶ 1 1 s2i exp − 2 [xFlat(σi )+(1−x)P(σi−1 = σi |sj
Michael DeWeese and Anthony Zador
1
4 After upward switch
2
3
After downward switch (inverted) 3
2
1 0
5
10 15 20 Time after last switch (iterations)
4 25
Estimate of standard deviation after downward switch
Estimate of standard deviation after upward switch
1188
Figure 3: Dynamics of optimal adaptation for a uniform, “jumping” prior probed with a step. As in the plot in Figure 2, the solid curve shows the dynamics of average σ¯ following an upward jump in standard deviation from 1 to 4, while the dashed curve (inverted) shows the behavior following a downward jump from 4 to 1. Here, the estimator is optimized for a gaussian distributed signal with a standard deviation that jumps with probability per time step (x) of .001 to a new value drawn at random from all values between 0.1 and 8. Note that following an upward jump, the initial response is more rapid, the asymptotic behavior is slower, and the estimate overshoots the correct value.
The estimate for the standard deviation that minimizes the rms error is given by the mean of this distribution: Z (3.18) σ¯ i ≡ dσi σi P(σi |sj≤i ). Figure 3 illustrates the dynamics of σ¯ for x = 1/1000 and σhigh /σlow = 4. Once again, for times immediately following a jump in variance, the initial upward adaptation is faster than downward. For longer times, however, the upward adaptation asymptotes more slowly. This is related to the fact that the number of independent examples of the signal needed to get an estimate of either the mean or the standard deviation to some fixed level of accuracy grows quadratically with standard deviation. Curiously, the upward adaptation tends to overshoot before asymptoting to its final value, provided that the larger of the two standard deviation values we are probing with is not too close to the upper cutoff (8 in this
Asymmetric Dynamics in Optimal Variance Adaptation
1189
case). There is a simple explanation. After receiving an unexpectedly large value for si , the estimator infers that the variance must have jumped to a higher value. Initially, it weights all standard deviation values between si and the upper cutoff about equally, since its prior on the standard deviation is flat. After measuring the signal for many time steps, the estimator homes in on the true value. Neither of the two curves in Figure 3 is well fit by a single exponential, nor are they obviously related by some other simply parameterizable family of curves. For this reason, the effective rate of adaptation for either curve is a function of the delay since the last jump in the probe variance, which makes quantitative comparisons between different parameter settings difficult. Qualitatively, the differences between the curves are more pronounced as σhigh /σlow is increased or x is decreased. In the other extreme, if we set σlow = σhigh , then no adaptation is required. In the limit of long delays, the behavior is predictable from simple statistics. Intermediate delays give mixed results, which depend on the details of the prior. In the limit of short delays, we observe universal behavior: the asymmetrical dynamics, which we have emphasized. If we minimize the error of our estimate of the variance rather than the standard deviation, we arrive at essentially the same results, whether we use a flat a priori distribution for the standard deviation or the variance. In addition, the dynamics of the mode, σm.l. , of P(σi |sj≤i ) are essentially the same as for the mean and display the same asymmetry. So far we have probed our estimator with a signal that is not too unlikely in its expected input ensemble. This will not be the case in the next section. 3.3 Smoothly Changing Variance. We now consider a prior in which the variance changes smoothly with time according to diffusive dynamics (or a random walk in the discrete version shown here) with reflecting boundaries at a and b: σi−1 + 2D1tzi , 2b − σi−1 − 2D1tzi , σi = 2a − σi−1 − 2D1tzi ,
if a ≤ σi−1 + 2D1tzi ≤ b if σi−1 + 2D1tzi > b, if σi−1 + 2D1tzi < a,
si = σi yi ,
(3.19)
(3.20)
where b > a > 0, 1t is the duration of each time step, D is the (one dimensional) diffusion constant, and y and z are both gaussian distributed i.i.d. processes with unit variance and zero mean. Far from the boundaries, this implies that µ
(σi − σi−1 )2 exp − P(σi |σi−1 ) = √ 4D1t 4π D1t 1
¶ (3.21)
1190
Michael DeWeese and Anthony Zador
and e−si /2σi . P(si |σi ) = p 2πσi2 2
2
(3.22)
D is defined as the inverse of the correlation time constant, D = 1/τc . Like the example in the previous section, the underlying dynamics of σ are symmetric with respect to increases and decreases in standard deviation. By imposing reflecting boundary conditions, we ensure that the time-averaged distribution for the standard deviation is flat, so that any asymmetry in adaptation time is not due to the relative likelihood of big and small standard deviation values. Combining equations 3.21 and 3.22 with equation 2.6 as before, we obtain an expression for updating the conditional distribution for the standard deviation after receiving si : P(σi |sj≤i ) =
2 2 Z b 1 1 e−si /2σi p dσi−1 √ Ä 2πσi2 a 4π D1t ¶ µ (σi − σi−1 )2 P(σi−1 |sj
(3.23)
where4 we have again introduced a and b, the lower and upper cutoffs for σ . We will again use the mean of this distribution as our estimate for the current standard deviation, σ¯ i (see equation 3.18). For the first time step after an abrupt change in variance, we can derive a compact expression for the rate of adaptation following an upward jump in variance σlow → σhigh (see the appendix): rateup ≡
up σˆ i
− σlow = D1t
"µ
2 σlow
σhigh σlow
¶2
# −1 .
(3.24)
Dividing this by the corresponding expression for a downward jump yields: up
rateup σˆ i − σlow ≡ down = ratedown σˆ i − σhigh
µ
σhigh σlow
¶3 .
(3.25)
This is certainly asymmetric, with the upward adaptation faster than the downward adaptation for any pair of variances. 4 For brevity, we dropped some terms on the right-hand side of equation 3.23 which we included in our algorithm to enforce our reflecting boundary conditions. Since we always placed our boundaries far from the extreme values of our probe stimulus compared to D1t, our results were independent of the boundary conditions.
Asymmetric Dynamics in Optimal Variance Adaptation
1191
Figure 4: Dynamics of optimal adaptation for a diffusive prior probed with a step. As in Figures 2 and 3, the solid curve shows the dynamics of the average σ¯ following an upward jump in standard deviation from 1 to 4, while the dashed curve (inverted) shows the behavior following a downward jump from 4 to 1. In this case, the estimator is optimized for a gaussian distributed signal with a standard deviation that diffuses (i.e., takes a random walk) between 0.05 and 5.0 with reflecting walls at the boundaries. The correlation time (the inverse of the one-dimensional diffusion constant) of the diffusion process is 300 time steps. Note that following an upward jump, the initial response is more rapid and the asymptotic behavior is slower.
For the complete dynamics, we must compute the solution numerically. To do this, we discretize the probability distribution P(σi |sj≤i ) between the reflecting boundaries at .05 to 4.95 in bins of width .025. The results for D1t = 1/300 are plotted in Figure 4. For the sake of comparison between the different models, we can relate the correlation time of this diffusing prior to that of the switching prior of the last section by estimating the expected time for the standard deviation to diffuse from σlow to σhigh . For σhigh = 4, σlow = 1 and D1t = 1t/τc = 1/300, the average time for the rms displacement of the standard deviation to reach σhigh − σlow = 3 is
3=
p 2t/τ ⇒ t = 4.5τ ∼ 1400 time steps,
(3.26)
1192
Michael DeWeese and Anthony Zador
so the correlation time for these parameter settings roughly corresponds to that of the switching prior with the probability of switching per time step (x) near 1/1400. As in the previous examples, adaptation immediately following an upward jump is faster than it is after a downward jump. As before, the upward adaptation is slower to asymptote to its final value due to the larger value of the final variance. Once again, we find that neither of the two curves is well fit by a single exponential, which makes quantitative comparisons between different parameter settings complicated for intermediate to long times after jumps in standard deviation. Qualitatively, the differences between the two curves become more pronounced with increasing σhigh /σlow and decreasing D1t. Unlike the previous example, the diffusion prior produces monotonic behavior with no overshoot, so the curves cross at some intermediate time during the adaptation. Another difference is that in this example, the curves are independent of the boundary conditions, whereas the dynamics of the last example were strongly cutoff dependent. Clearly, qualitatively universal behavior is present immediately after abrupt changes in the variance, while the dynamics are prior dependent at later times. As in the previous section, we repeated the analysis of this section for a diffusing variance rather than standard deviation and found no significant difference in our results whether we minimized the error of the standard deviation or the variance. Again, the mode, σm.l. , of P(σi |sj≤i ) displayed the same asymmetrical dynamics as did the mean of that distribution. To summarize our results so far, we computed the optimal estimator σ¯ for three processes with different dynamics for their respective time-varying variances. At any single moment in time, these processes were all gaussian with zero mean. In each case, when dynamics were probed with a square wave of standard deviation, the adaptation to an upward jump was faster than for a downward jump. We now consider nongaussian signal distributions and environments where the mean and variance fluctuate together. 3.4 Other Signal Distributions. 3.4.1 Nongaussian Signal Distributions and Nonflat Priors on the Standard Deviation. In all of the previous examples, we considered nonstationary signals that were gaussian distributed at every moment in time. We repeated each of these examples for exponentially distributed signals, and the results were qualitatively the same as for gaussian signals in every case. In fact, the asymmetry in the dynamics persisted even if we biased the prior distribution on the standard deviation to favor smaller values. Specifically, we sometimes used a flat prior for 1/σ rather than for σ so that P(σ ) = dσ/σ 2 . As we explain in section 4, the adaptation dynamics are largely determined by the occurrence of outliers in the signal distribution immediately after abrupt changes in the probe stimulus. For that reason, the dynamics are roughly independent of the exact shape of the distribution everywhere
Asymmetric Dynamics in Optimal Variance Adaptation
1193
except in the tails. In the appendix, we prove that under weak assumptions, the dynamics are asymmetric immediately following jumps in the probe stimulus for the binary switching standard deviation model, provided that the tails fall off like exp(−asni ) for all positive a and n. This is a rich set of functions for the shape of the tails for which the distribution has well-defined moments at all orders. The proof is valid for a distribution of nearly any shape, so long as it does not get arbitrarily small anywhere except in the tails. For the diffusing standard deviation model with P(si |σi ) ∝ exp(−asni ), we can derive a compact expression for the ratio of the rate of adaptation immediately following an upward jump in variance σlow → σhigh versus the corresponding downward jump (see the appendix), up
rateup σˆ i − σlow ≡ down = ratedown σˆ i − σhigh
µ
σhigh σlow
¶n+1 ,
(3.27)
which is clearly asymmetric for all positive a and n, though the exact ratio depends on n. 3.4.2 Simultaneously Adapting to the Mean and Variance. In each of the examples above, the optimal estimate of the standard deviation σ¯ is a function of the absolute value of the signal s since P(−si |σi ) = P(si |σi ) in each case. Therefore, our results would be the same for a different signal s0 , which is positive definite:5 ½ 2P(si |σi ) if s0i > 0 (3.28) P(s0i |σi ) = 0 otherwise. A gaussian, nonnegative signal distributed according to s µ ¶ (s0i )2 2 0 exp − 2 P(si |σi ) = πσi2 2σi
(3.29)
has an “instantaneous” mean, µ0i , and standard deviation, σi0 , given by r 2 0 , (3.30) µi = σi π r 2 (3.31) σi0 = σi 1 − . π For an exponentially distributed nonnegative signal, µ 0¶ 1 s P(s0i |σi ) = exp − i , σi σi 5
(3.32)
Here we assume that the probability of si being exactly zero is of measure zero.
1194
Michael DeWeese and Anthony Zador
the “instantaneous” mean and standard deviation are the same: µ0i = σi0 = σi .
(3.33)
For all the models we have studied, the dynamics of optimal adaptation look the same whether the mean is constant or tightly coupled to the fluctuating standard deviation. 4 Discussion 4.1 Main Result. A system whose inputs span a large dynamic range can use adaptation to exploit their nonstationary statistics. Adaptation requires an estimate of the present statistics of the signal based on its recent past. Optimal estimation of the mean of a nonstationary signal is a well-studied problem. Here we have considered optimal estimation of the second-order statistics. We have shown that optimal adaptation to variance leads to asymmetrical dynamics. In particular, the optimal estimate for the variance tracks an abrupt increase in variance more closely than an abrupt decrease in environments, where the true dynamics of the variance is symmetric. This is true whether the mean is fixed or allowed to vary with the standard deviation. It is easy to show that this is not a feature of optimal adaptation to a time-varying mean in environments where the variance is fixed and the mean fluctuates in a symmetric fashion. 4.2 Intuition Behind the Main Result. Our basic result can be readily understood with the following intuitive argument. Consider the generic case for the instantaneous signal distribution, where P(si |σi ) has a single maximum that roughly coincides with its mean, and parameterize the distribution so that changing σi merely rescales si while leaving the mean unchanged (e.g., P(si |σi ) could be gaussian with zero mean and a standard deviation of σi ). When we probe with an upward jump in variance, the estimator expects to see an s that is not much larger than the old standard deviation, but instead measures a value that is on average equal to the new, larger standard deviation. After receiving this unlikely outlier, the estimator immediately infers that the standard deviation has increased. On the other hand, if the standard deviation jumps down, then the estimator will most likely receive an s that is much smaller than the old standard deviation, but this is near the peak of the distribution of s for any standard deviation, so it will wait for more data before lowering its estimate for the current standard deviation. This argument is based on the occurrence of outliers in the likelihood P(si |σi ) of the observed signal si given the current estimate for the standard deviation σi . If many independent presentations of the signal can be observed in the average time it takes for the standard deviation to fluc-
Asymmetric Dynamics in Optimal Variance Adaptation
1195
tuate between the high and low values of the probe stimulus, we should expect outliers in the likelihood distribution to dominate the behavior of our estimator immediately following jumps in the true standard deviation. This separation between the time scales of signal detection and changes in the environment is what makes the problem of adaptation interesting. If the environment completely changes between observations, then the optimal estimation strategy does not require dynamic adaptation, as we saw in section 3.1. As long as the prior distribution for the dynamics of the standard deviation is not too asymmetrical, our intuition about the likelihood should hold for the full Bayesian treatment. This is true no matter what form the likelihood takes—it can be asymmetric, multimodal, finite for positive values only, or something else—provided that it approaches zero only out in the tail(s). In the appendix, we prove that optimal adaptation is indeed faster after abrupt increases in the standard deviation for a large class of model environments, just as we saw for the previous examples. 4.3 Connection with Experiment. It is common to identify the firing rate of a cell in response to a stimulus of given intensity with its state of adaptation. Our formulation of the variance estimation problem does not specify what aspect of the output of any given cell type will reflect its state of adaptation, so we are not restricting ourselves to cells whose firing rates encode the mean or variance of a sensory signal. Implicit in our framework is the idea that adaptation is a useful property that is not due solely to fatigue of cellular mechanisms or saturation of their inputs. It is well documented that adaptation occurs even when saturation is not present (Shapley & Victor, 1978). We have stated our results in terms of ratios of adaptation times rather than absolute times, partly because we found a very general behavior for this ratio. In order to convert discrete time into experimental units, one would need to measure the “integration time” of the biological system being studied. In other words, observations of the signal at consecutive time steps in our framework represent effectively independent measurements, so our discrete time steps reflect both the filtering of the sensory system and the effective noise at the input. If the system is visual, then the spatial statistics of scenes from the creature’s environment can be convolved into the temporal statistics of the signal impinging on local regions of the retina through saccades and body motion. In qualitative agreement with our findings, recent experiments by Smirnakis, Berry, Warland, Bialek, and Meister (1997) have shown that ganglion cells in the salamander and rabbit retinas adapt faster to abrupt increases than decreases in variance. In these experiments, the activity of individual ganglion cells was monitored while the retina was stimulated with light whose contrast switched between a high and low value while the mean was fixed, as in Figure 1 (top). Between the jumps, in contrast, the trial-averaged
1196
Michael DeWeese and Anthony Zador
firing rate followed an exponential path, the time constant of which was always faster after upward jumps in contrast. In section 3.4.2, we showed that optimal adaptation is asymmetrical in several environments in which the mean fluctuates with the standard deviation. In natural scenes, the mean and standard deviation of light intensity are probably correlated. Naively, one might imagine that the mean and standard deviation are proportional since they have the same units. This would be the case in a room with spatially fixed light sources all controlled by a single dimmer switch and walls that have a reflectance that does not depend on the intensity of the light. Natural environments could in principle be much more complex than this.6 Unfortunately, the true relationship between the mean and standard deviation of natural scenes is not yet fully understood, but it is reasonable to assume that the standard deviation will be roughly proportional to the mean light level in many environments. For example, for one data set, the local standard deviation is roughly three times the mean with a correlation coefficient of about 1/2 (Ruderman, personal communication). It is well known that sensory cells (photoreceptors and ganglion cells in the retina) adapt more quickly to sudden increases than decreases in the mean light intensity (Barlow & Mollon, 1982). This is usually attributed to limitations of the biological machinery, but our results indicate that this behavior could reflect the optimal strategy for the environment they evolved in. These ideas give parameter-free predictions once the relevant biological constraints and input statistics are known. Appendix: A Proof and Some Exact Results A.1 Proof of Asymmetric Dynamics for Binary Variance Switching. We will now prove that under weak assumptions, optimal adaptation is faster immediately following upward jumps in the standard deviation σ of its input signal s than it is for downward jumps for a system that expects the standard deviation to switch between two values, σlow and σhigh , at random times (as in section 3.2.1). For clarity, we will prove our result for a specific signal distribution and then state a more general result, which can be proved following the same steps.
6 As a toy example of an environment where the mean and standard deviation are not proportional, consider a bird’s-eye view of a plowed field with parallel furrows running from north to south. Early in the day, when the sun is low on the horizon, the mean light level is low, and the standard deviation is comparable to the mean due to the shadows cast into the furrows. At noon, when the sun is high in the sky, there are few shadows since the sun can illuminate the bottoms and shallow sides of the furrows, so the entire scene is equally illuminated, resulting in a standard deviation much smaller than the (high) mean light intensity.
Asymmetric Dynamics in Optimal Variance Adaptation
1197
We make the following assumptions: 1. The standard deviation switches between a low value, σlow , and a high value, σhigh > 2σlow , at (discrete) times drawn from a homogeneous Poisson process; x is the probability of switching per time step. 2. The probability distribution of the (nonnegative) signal for fixed standard deviation is bounded everywhere except in the tail, y ≤ P(si |σi ) ≤ z for 0 ≤ si ≤ 3σi , and falls off exponentially in the tail, P(si |σi ) ∝ exp(−a(si /σi )) for si > 3σi . Note that P(si |σi ) need not be monotonically decreasing for 0 ≤ si ≤ 3σi . 3. The estimator makes many observations of the signal in the average time required for the standard deviation to drift between σlow and σhigh : x ¿ 1. All other parameters are of order unity so that x ¿ {1, y/z, yσlow , σlow /σhigh , 3}. Consider the probability that the optimal estimator will correctly detect an upward jump from σlow to σhigh . More precisely, we want the probability that σ¯ i = σhigh one time step after our estimator was “sure” that the standard deviation was at a low value, σlow , given that the true standard deviation is currently at the high value, σhigh : Z Pup ≡ dsi P(si |σi = σhigh )P(σi = σhigh |si , σi−1 = σlow ) Z P(si |σi = σhigh )P(σi = σhigh |σi−1 = σlow ) = dsi P(si |σi = σhigh ) P(si |σi−1 = σlow ) Z P(si |σi = σhigh ) , (A.1) = dsi P(si |σi =σlow ) 1 + 1−x x P(si |σi =σhigh ) where we have used Bayes’s theorem between the first and second lines, and reintroduced x, the probability of switching per time step. Our claim is that this is greater than the probability that the estimator will correctly detect a downward jump in the reverse situation: Z P(si |σi = σlow ) . (A.2) Pdown = dsi P(si |σi =σhigh ) 1 + 1−x x P(si |σi =σlow ) Under the conditions described above, P(si |σi = σhigh )/P(si |σi = σlow ) is never less than y/z for all allowed values of si , so we can always set x sufficiently small in equation A.2 to guarantee that Pdown is proportional to x: Z ∞ h i P2 (si |σi = σlow ) + O x2 dsi Pdown = x P(si |σi = σhigh ) 0 Z 3σhigh P2 (si |σi = σlow ) + dsi =x P(si |σi = σhigh ) 0
1198
Michael DeWeese and Anthony Zador
Z x
∞ 3σhigh
dsi
aσhigh 2 σlow
" exp −a
Ã
2 σlow #
"µ ¶ 2 σhigh z 2 + ≤x 2 y 2σhigh σlow − σlow
−
1 σhigh
! # si + O[x2 ]
+ O[x2 ].
We have made use of the fact that 0 < 3σhigh ≤ 1/y in the last line. Clearly, Pdown is proportional to our small parameter x as long as x ¿ σlow /σhigh since z/y is of order one. However, P(si |σi = σhigh )/P(si |σi = σlow ) grows arbitrarily large as we move further out in the tail, so Pup can remain finite as x → 0. If we restrict our attention to si > 3σhigh so we are in the tail of the distribution for either value of the standard deviation, the denominator of the integrand in the last line of equation A.1 is much greater than one only for h ln
s i ¿ s0 ≡
σhigh 1 σlow x
i
¢. ¡ a 1/σlow − 1/σhigh
(A.3)
If s0 ≥ σhigh 3, then we can exploit the fact that the integrand in equation A.1 is nowhere negative to rigorously bound Pup from below by isolating the contribution to the integral from the region si > s0 : Z Pup ≥
s0
≥N ≥
∞
dsi Z
∞
s0
yσhigh 2
P(si |σi = σhigh ) 1+
1−x P(si |σi =σlow ) x P(si |σi =σhigh )
£ ¤ exp −asi /σhigh dsi 2 ³ ´ 1 " # σ /σ −1 high low ea3 σlow x , a σhigh
(A.4)
where we have replaced the normalization factor N with its smallest possible value yea3 for fixed a and 3. By choosing a probe stimulus with σhigh > 2σlow , we can make the first term in the final line of equation A.4 sublinear in x, so that Pup is larger than Pdown for x sufficiently small. On the other hand, if s0 is less than σhigh 3, then we can safely integrate over the entire tail region of P(si |σhigh ) to get a different bound: Z Pup ≥
∞
σhigh 3
dsi
P(si |σi = σhigh ) 1+
1−x P(si |σi =σlow ) x P(si |σi =σhigh )
¤ £ exp −asi /σhigh ≥N dsi 2 3σhigh yσhigh , ≥ 2a Z
∞
(A.5)
Asymmetric Dynamics in Optimal Variance Adaptation
1199
which is greater than Pdown for all a such that yσhigh a< 2x
"µ ¶ #−1 2 σhigh z 2 + . 2 y 2σhigh σlow − σlow
(A.6)
If a is too large to satisfy this inequality, then we focus on the region in the tail of P(si |σlow ) but not of P(si |σhigh ): Z Pup ≥
3σlow
Z ≥
3σhigh
dsi
3σhigh
dsi
P(si |σi = σhigh ) 1+
1−x P(si |σi =σlow ) x P(si |σi =σhigh )
y 1 z exp(a3−asi /σlow ) x y
3σlow 1+ · ¸ xy exp[a3(σhigh /σlow − 1)] + z σlow y ln ≥ a µ xy + z · ¸¶ xy σlow y a3(σhigh /σlow − 1) + ln ≥ a xy + z ≥ y3(σhigh − σlow ) − O[x ln(x)],
(A.7)
where we have assumed that a is at least of order 1/x in the last line, which is the one case not covered by the two previous bounds on Pup . Again, 3, yσlow , and yσhigh are all much greater than x, so this bound is greater than our upper bound for Pdown . We have shown that the optimal estimator adapts more quickly to abrupt increases than decreases in the standard deviation of its input for the conditions enumerated above. By following the same steps, one can prove the same result by relaxing condition 2 to include all signal distributions with either one or two tails that decrease monotonically with |si |; each tail must take the same form for any value of the standard deviation, so that changing the standard deviation amounts to rescaling s and renormalizing: ¯ à ! σlow σlow ¯¯ P s¯ σlow P(s|σhigh ) = σhigh σhigh ¯ ¶[µ ¶ µ si si < −3L > 3R . for σi σi
(A.8)
The tails must fall off faster than a power law to ensure that the variance and all higher moments are well defined. We can also permit a region about zero that is less than y and even zero so long as the form of P(si |σi ) in this region remains unchanged up to a factor of order one (i.e., s does not scale with σ in this region), as would be the case for a detector whose sensitivity falls off for small signal strengths placed in an environment where low signal values always occur with some finite probability. In other words,
1200
Michael DeWeese and Anthony Zador
y < P(si |σi ) < z for −σi 3L ≤ si ≤ −σhigh λL and σhigh λR ≤ si ≤ σi 3R ; but P(si |σi = σhigh ) = P(si |σi = σlow ) for −σhigh λL ≤ si ≤ σi λR . In each case one has to check that the ratio σhigh /σlow is sufficiently large to complete the proof. For example, if the tail with the slowest rate of decrease falls off like exp[−a|si /σi |n ], then our result holds provided that σhigh /σlow > (n + 1)/n. This is a very general set of functions that give well-defined moments. As one might expect, the faster the tail(s) fall off, the more inclusive is the range of values of the probe’s standard deviation. A.2 Analytic Expressions for Smoothly Changing Variance. We can derive an exact expression for the rate of adaptation to the standard deviation σ of a stochastic variable s immediately after an abrupt jump in the standard deviation for a system optimized for a diffusing standard deviation. The quantity we want to calculate is the average value of σ¯ one time step after our estimator was “sure” that the standard deviation was at a low value, σlow , given that the standard deviation is currently at a high value, σhigh . We will assume that s is gaussian for our derivation, and then state the general solution for P(s|σ ) ∝ exp(−a|s|n ) for all positive a and n. We begin our derivation by replacing P(σi−1 |sj
σi
Z ≡
Z dσi σi
dsi P(si |σi = σhigh )P(σi |si , σi−1 = σlow ) ³ 2 ´ R s −σlow )2 Z 2 2 dσi exp − 2σi 2 − (σi4D1t e−si /2(σhigh ) ³ i2 ´. = dsi q R s −σlow )2 2π(σhigh )2 dσi σ1i exp − 2σi 2 − (σi4D1t
(A.9)
i
The amount that this differs from σlow is proportional to the effective rate of adaptation immediately after the upward jump in variance. The effective upward adaptation rate can then be compared to the corresponding difference for a switch from σlow to σhigh . The main trick for solving the integrals in equation A.9 is to make use of the fact that each time step is short compared to the correlation time of the diffusion process, which allows us to expand about D1t = 0: µ ¶ (σi − σlow )2 1 exp − → δ(σi − σlow ) √ 4D1t 4D1t +
2D1t ∂ 2 δ(σi − σlow ) . 2! ∂σi2
(A.10)
We do not write this as an equality since it is valid to make this substitution only when the gaussian appears inside an integral over its argument σi . To
Asymmetric Dynamics in Optimal Variance Adaptation
1201
derive this substitution, consider a gaussian with a small variance v multiplied by a doubly differentiable function f inside an integral. Expanding f in a Taylor series gives: Z
¯ 2 v ∂ 2 f ¯¯ e−x /2v f (x) ≈ f (0) + . dx √ 2! ∂x2 ¯x=0 2π v
(A.11)
This is exactly what we would have gotten by making the above substitution for the gaussian within the integral and integrating by parts. With this substitution, we expand both integrals over σi to first order in D1t, then expand the full si integrand to first order and perform the integral to obtain: # "µ ¶ σhigh 2 2 up −1 . (A.12) σi = σlow + D1t σlow σlow Finally, we repeat this procedure for a downward jump from σhigh to σlow and find a simple form for the ratio of upward-to-downward adaptation rates immediately after a jump in the standard deviation: up
σi − σlow = σhigh − σidown
µ
σhigh σlow
¶3 .
(A.13)
This result is valid for gaussian s; by repeating the calculation for P(si |σi ) = exp(−a|si |n ), we can derive a more general expression: up
σi − σlow = σhigh − σidown
µ
σhigh σlow
¶n+1 ,
(A.14)
which is valid for all positive a and n. Again we see that optimal adaptation is faster following abrupt increases in standard deviation for signal distributions with well-defined moments at all orders. Acknowledgments We thank W. Bialek, E. J. Chichilnisky, B. Pearlmutter, and R. Shapley for many useful discussions. We are especially grateful to M. Berry, M. Meister, S. Smirnakis, and D. Warland for sharing their findings with us before publication. This work was supported by the Sloan Foundation for Theoretical Neuroscience. References Barlow, H., & Mollon, J. (1982). The senses (2nd ed.). Cambridge: Cambridge University Press.
1202
Michael DeWeese and Anthony Zador
Bonds, A. (1991). Temporal dynamics of contrast gain in single cells of the cat striate cortex. Visual Neuroscience, 6, 239–255. deLange, H. (1958). Research into the dynamic nature of the human fovea— cortex systems with intermittent and modulated light. I. Attenuation characteristics with white and colored light. J. Opt. Soc. Am., 48, 777–784. Enroth-Cugell, C., & Robson, J. (1996). The contrast sensitivity of retinal ganglion cells of the cat. J. Physiol., 187, 517–552. Giaschi, D., Douglas, R., Marlin, S., & Cynader, M. (1993). The time course of direction-selective adaption in simple and complex cells in cat striate cortex. J. Neurophysiol., 70, 2024–2034. Kelly, D. (1961). Visual responses to time-dependent stimuli. I. J. Opt. Soc. Am., 51, 422. Papoulis, A. (1991). Probability, random variables and stochastic processes (3rd ed.). New York: McGraw-Hill. Shapley, R. (1997). Adapting to the changing scene. Current Biology, 7 (7), 421– 423. Shapley, R., & Enroth-Cugell, C. (1984). Visual adaptation and retinal gain controls. In N. N. Osborne & G. J. Chader (Eds.), Progress in retinal research (vol. 3, pp. 263–346). Oxford: Pergamon Press. Shapley, R., Enroth-Cugell, C., Bonds, A., & Kirby, A. (1972). Gain control in the retina and retinal dynamics. Nature, 236, 352–353. Shapley, R., & Victor, J. (1978). The effect of contrast on the transfer properties of cat retinal ganglion cells. J. Physiol., 285, 275–298. Shapley, R., & Victor, J. (1979). Nonlinear spatial summation and the contrast gain control of cat retinal ganglion cells. J. Physiol., 290, 141–161. Smirnakis, S., Berry, M., Warland, D., Bialek, W., & Meister, M. (1997). Retinal processing adapts to image contrast and spatial scale. Nature, 385, 69–73. Received January 29, 1997; accepted November 14, 1997.
LETTER
Communicated by Radford Neal
Computation with Infinite Neural Networks Christopher K. I. Williams Neural Computing Research Group, Department of Computer Science and Applied Mathematics, Aston University, Birmingham B4 7ET, U.K.
For neural networks with a wide class of weight priors, it can be shown that in the limit of an infinite number of hidden units, the prior over functions tends to a gaussian process. In this article, analytic forms are derived for the covariance function of the gaussian processes corresponding to networks with sigmoidal and gaussian hidden units. This allows predictions to be made efficiently using networks with an infinite number of hidden units and shows, somewhat paradoxically, that it may be easier to carry out Bayesian prediction with infinite networks rather than finite ones. 1 Introduction To someone training a neural network by maximizing the likelihood of a finite amount of data, it makes no sense to use a network with an infinite number of hidden units; the network will “overfit” the data and so will be expected to generalize poorly. However, the idea of selecting the network size depending on the amount of training data makes little sense to a Bayesian; a model should be chosen that reflects the understanding of the problem, and then application of Bayes’ theorem allows inference to be carried out (at least in theory) after the data are observed. In the Bayesian treatment of neural networks, a question immediately arises as to how many hidden units are believed to be appropriate for a task. Neal (1996) has argued compellingly that for real-world problems, there is no reason to believe that neural network models should be limited to nets containing only a “small” number of hidden units. He has shown that it is sensible to consider a limit where the number of hidden units in a net tends to infinity and that good predictions can be obtained from such models using the Bayesian machinery. He has also shown that for fixed hyperparameters, a large class of neural network models will converge to a gaussian process prior over functions in the limit of an infinite number of hidden units. Neal’s argument is an existence proof. It states that an infinite neural net will converge to a gaussian process but does not give the covariance function needed to specify the particular gaussian process. In this article (which is a revised and expanded version of Williams, 1997), I show that for c 1998 Massachusetts Institute of Technology Neural Computation 10, 1203–1216 (1998) °
1204
Christopher K. I. Williams
certain weight priors and transfer functions in the neural network model, the covariance function that describes the behavior of the corresponding gaussian process can be calculated analytically. This allows predictions to be made using neural networks with an infinite number of hidden units in time O(n3 ), where n is the number of training examples.1 The only alternative currently available is to use Markov chain Monte Carlo (MCMC) methods (e.g., Neal, 1996) for networks with a large (but finite) number of hidden units. However, this is likely to be computationally expensive, and there are possible concerns over the time needed for the Markov chain to reach equilibrium. The availability of an analytic form for the covariance function also facilitates the comparison of the properties of neural networks with an infinite number of hidden units as compared to other gaussian process priors that may be considered. The gaussian process analysis applies for fixed hyperparameters θ . If it were desired to make predictions based on a hyperprior P(θ ), then the necessary θ -space integration could be achieved by MCMC methods. (This is described in more detail in section 3.4.) In comparison, for networks with a large but finite number of hidden units, MCMC methods (e.g., Neal, 1996) require us to carry out integration over both the weight and hyperparameter spaces. Thus, the great advantage of integrating out the weights analytically for infinite networks is that it yields much lower-dimensional integrals to be computed by MCMC methods than with finite nets and should thus improve the computational efficiency of these techniques. 1.1 From Priors on Weights to Priors on Functions. Bayesian neural networks are usually specified in a hierarchical manner, so that the weights w are regarded as being drawn from a distribution P(w|θ ). For example, the weights might be drawn from a zero-mean gaussian distribution, where θ specifies the variance of groups of weights. A full description of the prior is given by specifyingRP(θ ) as well as P(w|θ ). The hyperprior can be integrated out to give P(w) = P(w|θ )P(θ ) dθ , but in our case it will be advantageous not to do this as it introduces weight correlations that prevent convergence to a gaussian process. In the Bayesian view of neural networks, predictions for the output value y∗ corresponding to a new input value x∗ are made by integrating over the posterior in weight space. Let D = ((x1 , t1 ), (x2 , t2 ), . . . , (xn , tn )) denote the n training data pairs where ti denotes the target corresponding to input xi , t = (t1 , . . . , tn )T , and let f∗ (w) denote the value output by the network for input x∗ given weights w. Each ti is assumed to be a noise-corrupted version of an underlying yi . P(w|t, θ ) is the weight posterior given the
1 For large n, various approximations to the exact solution that avoid the inversion of an n × n matrix are available; see, e.g., Gibbs and MacKay (1997) and the discussion in section 4.
Computation with Infinite Neural Networks
1205
training data.2 Then the predictive distribution for y∗ given the training data and hyperparameters θ is Z P(y∗ |t, θ ) =
δ(y∗ − f∗ (w))P(w|t, θ ) dw.
(1.1)
We will now show how this can also be viewed as making the prediction using priors over functions rather than weights. Let f (w) denote the vector of outputs corresponding to inputs (x1 , . . . , xn ) given weights w. Then, using Bayes’ theorem, we have P(w|t, θ ) = P(t|w)P(w|θ )/P(t|θ ), R and P(t|w) = P(t|y ) δ(y − f (w)) dy . Hence equation 1.1 can be rewritten as P(y∗ |t, θ ) =
1 (1.2) P(t|θ ) Z Z × P(t|y ) δ(y∗ − f∗ (w))δ(y − f (w)) P(w|θ ) dw dy .
However, the priorR over (y∗ , y1 , . . . , yn ) is given by P(y∗ , y |θ ) = P(y∗ |y , θ )P(y |θ ) = δ(y∗ − f∗ (w)) δ(y − f (w))P(w|θ ) dw and thus the predictive distribution can be written as Z 1 P(t|y )P(y∗ |y , θ )P(y |θ ) dy P(t|θ ) Z = P(y∗ |y , θ )P(y |t, θ ) dy .
P(y∗ |t, θ ) =
(1.3)
Hence, in a Bayesian view, it is the prior over function values P(y∗ , y |θ ) that is important; specifying this prior by using weight distributions is one valid way to achieve this goal. In general, we can use the weight space or function space view, whichever is more convenient, and for infinite neural networks the function space view is more useful. 2 Gaussian Processes A stochastic process is a collection of random variables {Y(x)|x ∈ X} indexed by a set X. In our case X will be Rd , where d is the number of inputs. The stochastic process is specified by giving the probability distribution for every finite subset of variables Y(x1 ), . . . , Y(xk ) in a consistent manner. A gaussian process is a stochastic process that can be fully specified by its mean function µ(x) = E[Y(x)] and its covariance function C(x, x0 ) = E[(Y(x) − µ(x))(Y(x0 ) − µ(x0 ))]; any finite set of Y variables 2
For notational convenience, we suppress the x-dependence of the posterior.
1206
Christopher K. I. Williams
will have a joint multivariate gaussian distribution. For a multidimensional input space, a gaussian process may also be called a gaussian random field. Below we consider gaussian processes that have µ(x) ≡ 0, as is the case for the neural network priors discussed in section 3. A nonzero µ(x) can be incorporated into the framework at the expense of a little extra complexity. A widely used class of covariance functions is the stationary covariance functions, whereby C(x, x0 ) = C(x − x0 ). These are related to the spectral density (or power spectrum) of the process by the Wiener-Khinchine theorem and are particularly amenable to Fourier analysis as the eigenfunctions of a stationary covariance kernel are exp ik.x. Many commonly used covariance functions are also isotropic, so that C(h) = C(h) where h = x − x0 and h = |h|. For example, C(h) = exp(−(h/σ )ν ) is a valid covariance function for all input dimensionalities d ≥ 1 and for 0 < ν ≤ 2 (Yaglom, 1987, p. 137). Note that in this case, σ sets the correlation length scale of the random field, although other covariance functions (e.g., those corresponding to power-law spectral densities) may have no preferred length scale. A simple example of a nonstationary covariance function is obtained by letting Y(x) = wT φ(x) for some set of basis functions {φi (x)}. Letting w ∼ N(0, 6), we obtain Cφ (x, x0 ) = φT (x)6 φ(x0 ). Gaussian random fields can have very different properties depending on the choice of covariance function. For example, in one dimension, the Ornstein-Uhlenbeck process (with covariance function e−h ) has sample paths that are not mean-square differentiable, implying that the sample paths are very “rough.” By contrast, choosing y = wT φ(x) with φ(x) = (1, x)T leads to straight-line sample paths of the form y = w0 + w1 x. 2.1 Prediction with Gaussian Processes. The model for the observed data was generated from the prior stochastic process, and independent gaussian noise (of variance σν2 ) was then added. Given a prior covariance function CP (xi , xj ), a noise process CN (xi , xj ) = σν2 δij (i.e., independent noise of variance σν2 at each data point), and the training data, the prediction for the distribution of y∗ corresponding to a test point x∗ is obtained simply by applying equation 1.3. Because both the prior and noise model are gaussian, the integral can be calculated analytically and P(y∗ |t, θ ) is gaussian with mean and variance ˆ x∗ ) = kTP (x∗ )(KP + KN )−1 t, y( σyˆ2 (x∗ )
= CP (x∗ , x∗ ) −
kTP (x∗ )(KP
(2.1) −1
+ KN ) kP (x∗ ),
(2.2)
where [KP ]ij = CP (xi , xj ), [KN ]ij = CN (xi , xj ), and kP (x∗ ) = (CP (x∗ , x1 ), . . . , CP (x∗ , xn ))T . ˆ x∗ ) (as given It is interesting to understand the form of the predictor y( by equation 2.1). For any given data set (KP + KN )−1 t is a fixed vector. If ˆ x∗ ) is a linear combination of radial the covariance function is isotropic, y(
Computation with Infinite Neural Networks
1207
basis functions, as observed by Poggio and Girosi (1990), as each element in kP (x∗ ) is a radial function. (Poggio and Girosi derived radial basis functions by regularization theory, which is equivalent to the gaussian process view; see Girosi, Jones, & Poggio, 1995, and Wahba, 1990, for more details.) Equations 2.1 and 2.2 are the analog for spatial processes of WienerKolmogorov prediction theory. They have appeared in a wide variety of contexts, including geostatistics, where the method is known as “kriging” (Journel & Huijbregts, 1978; Cressie, 1993), multidimensional spline smoothing (Wahba, 1990), and in the work of Whittle (1963). It is also possible to view regression with support vector machines (Vapnik, 1995) in terms of gaussian processes. The prior over functions is taken to be a gaussian process, but instead of using a squared error loss function (which corresponds to gaussian noise), a modified version of the l1 error metric |ti − yi | is used, called the ²-insensitive loss function. Finding the maximum a posteriori y-values for the training and test points can now be achieved using quadratic programming (see Vapnik, 1995, for details). 3 Covariance Functions for Neural Networks Consider a network that takes an input x, has one hidden layer with H units, and then linearly combines the outputs of the hidden units with a bias b to obtain f (x). The mapping can be written f (x) = b +
H X
vj h(x; uj ),
(3.1)
j=1
where h(x; u) is the hidden unit transfer function (which we shall assume is bounded), which depends on the input-to-hidden weights u. This architecture is important because it has been shown by Hornik (1993) that networks with one hidden layer are universal approximators as the number of hidden units tends to infinity, for a wide class of transfer functions (but excluding polynomials). Let b and the v’s have independent zero-mean distributions of variance σb2 and σv2 , respectively, and let the weights uj for each hidden unit be independently and identically distributed. Denoting all weights by w, we obtain (following Neal, 1996), Ew [ f (x)] = 0 0
Ew [ f (x) f (x )] =
σb2
+
X
(3.2) σv2 Eu [hj (x; u)hj (x0 ; u)]
(3.3)
j
= σb2 + Hσv2 Eu [h(x; u)h(x0 ; u)],
(3.4)
where equation 3.4 follows because all of the hidden units are identically distributed. The final term in equation 3.4 becomes ω2 Eu [h(x; u)h(x0 ; u)] by letting σv2 scale as ω2 /H.
1208
Christopher K. I. Williams
The sum in equation 3.3 is over H identically and independently distributed random variables. As the transfer function is bounded, all moments of the distribution will be bounded, and the central limit theorem can be applied. Hence, if we assume that b has a gaussian distribution, the stochastic process will converge to a gaussian process in the limit as H → ∞. From general central limit theorem results, one would expect a rate of convergence of H−1/2 toward a gaussian process prior. By evaluating Eu [h(x)h(x0 )] for all x and x0 in the training and testing sets, we can obtain the covariance function needed to describe the neural network as a gaussian process. These expectations are, of course, integrals over the relevant probability distributions of the biases and input weights. In the following sections, two specific choices for the transfer functions are considered: a sigmoidal function and a gaussian. Gaussian weight priors are used in both cases. This analysis cannot be taken a stage further to integrate out any hyperparameters as well. For example, the variance σv2 of the v weights might be drawn from an inverse gamma distribution. In this case, the distribution R P(v ) = P(v |σv2 )P(σv2 ) dσv2 is no longer the product of the marginal distributions for each v weight (in fact, it will be a multivariate t-distribution). A similar analysis can be applied to the u weights with a hyperprior. The effect is to make the hidden units nonindependent, so that the central limit theorem can no longer be applied. 3.1 Sigmoidal Transfer Function. A sigmoidal transfer function is a very common choice in neural networks research; nets with this architecture are usually called multilayer perceptrons. P Below we consider the transfer function h(x; u) = erf(u0 + di=1 uj xi ), √ Rz 2 where erf(z) = 2/ π 0 e−t dt is the error function, closely related to the cumulative distribution function for the gaussian distribution. Appropriately scaled, the graph of this function is very similar to the tanh function, which is more commonly used in the neural networks literature. def
In calculating V(x, x0 ) = Eu [h(x; u)h(x0 ; u)] we make the usual assumptions (MacKay, 1992) that u is drawn from a zero-mean gaussian distribution ˜ = (1, x1 , . . . , xd )T be with covariance matrix 6, that is, u ∼ N(0, 6). Let x an augmented input vector whose first entry corresponds to the bias. Then Verf (x, x0 ) can be written as Verf (x, x0 ) =
1
Z
˜ ) erf(uT x ˜ 0) erf(uT x |6|1/2 ¶ µ 1 × exp − uT 6 −1 u du. 2 (2π)
d+1 2
(3.5)
Let C be the square root of 6 −1 , so that CT = C and 6 −1 = C2 . Let u0 = Cu.
Computation with Infinite Neural Networks
1209
Then equation 3.5 becomes Verf (x, x0 ) =
1
(3.6) d+1 (2π) 2 µ ¶ Z 1 ˜ ) erf(u0T C−1 x ˜ 0 ) exp − u0T u0 du0 , × erf(u0T C−1 x 2
which can be evaluated analytically (see the appendix) to give Verf (x, x0 ) =
˜ T 6x ˜0 2x 2 sin−1 q . π ˜ T 6x ˜ )(1 + 2x ˜ 0T 6 x ˜ 0) (1 + 2x
(3.7)
We observe that this covariance function is not stationary, which makes sense because the distributions for the weights are centered about zero, and hence translational symmetry is not present. Note that the range of Verf (x, x0 ) is [−1, 1], and thus negative values of the covariance function are allowed.3 To examine this covariance function in a little more detail, we now consider a diagonal weight prior 6 = diag(σ02 , σI2 , . . . , σI2 ), so that the inputs i = 1, . . . , d have a different weight variance to the bias σ02 . Then for |x|2 and |x0 |2 À (1 + 2σ02 )/2σI2 , we find that Verf (x, x0 ) ' 1 − 2θ/π , where θ is the angle between x and x0 . Again this makes sense intuitively; if the model is made up of a large number of sigmoidal functions in random directions (in x space), then we would expect points that lie diametrically opposite (at x and −x) to be anticorrelated, because they will lie in the +1 and −1 regions of the sigmoid function for most directions. In the other extreme, σ02 À max(σI2 |x|2 , σI2 |x0 |2 , 1), the zero crossings of most of the sigmoids will be further away from the origin than x and x0 ; thus, x and x0 will lie in the same tail of each sigmoid and hence Verf (x, x0 ) ' 1, meaning that any prediction in this region is just the average of all of the t’s. It is also worth noting that for “projective” basis functions (those for ˜ ) for some function ψ(·)), the covariance structure which h(x; u) = ψ(uT x ˜ with u ∼ N(0, 6) obtained by passing the gaussian process Y(x) = uT x through the function ψ is the same as the covariance term Eu [h(x; u)h(x0 ; u)] as found in equation 3.4. In this section, we have derived the covariance function for an erf network with an infinite number of hidden units. In the reverse direction, Girosi et al. (1995) and Wahba (1996) have discussed how the prior covariance function ˆ x), which is a linear combination may be chosen so as to obtain a predictor y( of sigmoidal basis functions. 3 Although many commonly used covariance functions are nonnegative, there are examples such as the exponentially damped cosine C(x, x + τ ) = exp(−ατ ) cos ωτ , which obtain negative values; see Yaglom (1987, pp. 366–367).
1210
Christopher K. I. Williams
3.2 Gaussian Transfer Function. One other very common transfer function used in neural networks research is the gaussian, so that h(x; u) = exp[−(x − u)T (x − u)/2σg2 ], where σg2 is the width parameter of the gaussian. Gaussian basis functions are often used in radial basis function (RBF) networks (Poggio & Girosi, 1990; Broomhead & Lowe, 1988). For a gaussian prior over the distribution of u so that u ∼ N(0, σu2 I), VG (x, x0 ) =
1 (2πσu2 )d/2 × exp −
Z exp −
(x − u)T (x − u) 2σg2
(x0 − u)T (x0 − u) uT u exp − 2 du. 2 2σg 2σu
(3.8)
By completing the square and integrating out u, we obtain, VG (x, x0 ) =
µ
½ ¾ xT x exp − 2 2σm ¾ ½ ¾ ½ x0T x0 (x − x0 )T (x − x0 ) exp − , × exp − 2σs2 2σm2 σe σu
¶d
(3.9)
where 1/σe2 = 2/σg2 + 1/σu2 , σs2 = 2σg2 + σg4 /σu2 , and σm2 = 2σu2 + σg2 . This formula can be generalized by allowing covariance matrices 6b and 6u in place of σg2 I and σu2 I; rescaling each input variable xi independently is a simple example. Again this is a nonstationary covariance function, although it is interesting to note that if σu2 → ∞ (while scaling ω2 appropriately), we find that VG (x, x0 ) ∝ exp{−(x − x0 )T (x − x0 )/4σg2 }. For a finite value of σu2 , VG (x, x0 ) is a stationary covariance function “modulated” by the gaussian decay function exp(−xT x/2σm2 ) exp(−x0T x0 /2σm2 ). Clearly if σm2 is much larger than the largest distance in x-space, then the predictions made with VG and a gaussian process with only the stationary part of VG will be very similar. It is also possible to view a network with gaussian transfer functions as an example of a shot-noise process based on an inhomogeneous Poisson process. A realization of a shot-noise process is obtained by placing a kernel (or shape function) at each location defined by a Poisson process in Rd . It is also possible to allow the kernel to have some random component such as a magnitude. In this case, we can write, Y(x) =
X
Vn h(x, xn ),
(3.10)
n
where the xn ’s are the Poisson distributed, h(·, ·) is the kernel (which will often be a function of x − xi ), and the Vn ’s are i.i.d. random variables. It can be shown that the mean and covariance functions of this shot-noise process
Computation with Infinite Neural Networks
1211
are given by (see Parzen, 1962, sec. 4.5 for details), Z E[Y(x)] = E[V] cov[Y(x)Y(x0 )] = E[V 2 ]
Z
h(x, ξ )ν(ξ ) dξ h(x, ξ )h(x0 , ξ )ν(ξ ) dξ ,
(3.11) (3.12)
where ν(ξ ) is the rate of the underlying Poisson process. These results are known as Campbell’s theorem. By choosing the kernel function to be a gaussian and the rate function of the form ν(x) = ν0 exp(−xT x/2σu2 ), we obtain a gaussian process in the limit of ν0 → ∞ with E[V 2 ] scaled such that ν0 E[V 2 ] remains finite. This gives a geometrical interpretation of a realization of the network as comprising kernels centered at random points uj in x-space and with each kernel scaled by a random variable corresponding to the hidden-to-output weight vj . For the case that the Poisson process has constant rate, Mat´ern (1980, Table 1, p. 30) gives some examples of other kernel functions and their corresponding covariance functions. 3.3 Comparing Covariance Functions. The priors over functions specified by sigmoidal and gaussian neural networks differ from the stationary covariance functions that are usually employed when using gaussian processes for prediction. The complex exponential exp ik.x is an eigenfunction of a stationary and isotropic covariance function, and hence the spectral density (or power spectrum) S(k) (k = |k|) nicely characterizes the corresponding stochastic process. Roughly speaking, the spectral density describes the “power” at a given spatial frequency k; for example, splines have S(k) ∝ k−β . Nonstationary processes cannot be analyzed in exactly this fashion because the complex exponentials are not (in general) eigenfunctions of a nonstationary kernel. Instead, we must consider the eigenfunctions defined R by C(x, x0 )φ(x0 )dx0 = λφ(x). This difference in eigenfunctions makes direct comparison of stationary and nonstationary covariance functions difficult. The distinction between stationary and nonstationary covariance functions can be reduced to some extent if we consider gaussian process prediction in the presence of a nonnegative weight function w(x). In this case, it turns out that the covariance function that we need to consider is K(x, x0 ) = w1/2 (x)C(x, x0 )w1/2 (x0 ) (see Zhu, Williams, Rohwer, & Morciniec, 1997; Castro, Lawton, & Sylvestre, 1986), where C(x, x0 ) is the original covariance function that applies under uniform weighting. For example, there is a close connection between the eigenfunctions of the covariance function VG (x, x0 ) obtained in the gaussian transfer function case and K(x, x0 ) derived from a squared exponential covariance function C(x, x0 ) ∝ exp −(x − x0 )T (x −
1212
Christopher K. I. Williams
x0 )/2σg2 under a gaussian weight function. The eigenvalues and functions for this K(x, x0 ) are given in Zhu et al. (1997).4 3.4 Integration over the Hyperparameters. In the derivations above, the hyperparameters θ were fixed. However, in a real data analysis problem, it would be unlikely that appropriate values of these parameters would be known. Given a prior distribution P(θ ), predictions should be made by integrating over the posterior distribution P(θ |t) ∝ P(θ )P(t|θ ), where P(t|θ ) is the likelihood of the training data t under the model; P(t|θ ) is easily computed for a gaussian process. The prediction y(x) for test input x is then given by Z y(x) =
yˆ θ (x)P(θ |D) dθ ,
(3.13)
where yˆ θ (x) is the predicted mean (as given by equation 2.1) for a particular value of θ . This integration is not tractable analytically, but MCMC methods such as hybrid Monte Carlo can be used to approximate it. This strategy was used in Williams and Rasmussen (1996) and Rasmussen (1996), but for stationary covariance functions, not ones derived from neural networks. 4 Discussion The work described shows how to calculate the covariance function for sigmoidal and gaussian basis function networks. It is probable that similar techniques will allow covariance functions to be derived analytically for networks with other kinds of basis functions as well; these may turn out to be similar in form to covariance functions already used in the gaussian process literature. This article extends the work of Neal (1996) in examining what kinds of priors over functions one obtains by specifying particular weight priors. This is important, as it is necessary to understand prior assumptions when carrying out Bayesian modeling. The existence of analytic expressions for the covariance functions allows predictions to be made efficiently from neural networks that have an infinite number of hidden units. It is also interesting to consider the differences between finite and infinite neural network priors. One difference is that in functions generated from finite neural networks, the effects of individual basis functions can be seen. For example, with sigmoidal units, a steep step may be observed where one basis function (with large weights) comes into play. A judgment 4 For the one-dimensional problem, the eigenfunctions are an appropriately scaled gaussian multiplied by the Hermite polynomials. These are also the eigenfunctions of Schrodinger’s equation for the simple harmonic oscillator. For the d-dimensional case, the eigenfunctions are the tensor product of the one-dimensional eigenfunctions.
Computation with Infinite Neural Networks
1213
about whether this type of behavior is appropriate should depend on prior beliefs about the problem at hand. Of course, it is also possible to compare finite neural networks and gaussian process predictions empirically. This has been done by Rasmussen (1996), where gaussian process predictions (using MCMC for the hyperparameters) were compared to those from Neal’s MCMC Bayesian neural networks. Rasmussen’s results show that for a number of problems, the predictive performance of gaussian processes and Bayesian neural networks is similar. This similarity may be due to the fact that Rasmussen used networks with relatively large numbers of hidden units, so that results similar to a gaussian process predictor are not unexpected. Finite and infinite neural networks can also be compared on the basis of the computational requirements for prediction. With small data sets, as we argued above, it should be more effective to use gaussian processes, as this dramatically reduces the dimensionality of the space over which MCMC integrations are done. However, for large problems, the matrix inversion required in equation 2.1 can be expensive. Computational approximations can be used (e.g., Gibbs & MacKay, 1997), but it may also be interesting to investigate the use of finite neural networks as approximations to infinite ones. This would need to be contrasted with standard eigenfunction decompositions of the covariance, as described, for example, in Zhu et al. (1997). In this article, we have studied the covariance function derived from a given weight prior. It is also interesting to ask how one might try to specify a weight prior to approximate a given covariance function. There is some recent work by Moody and Rognvaldsson ¨ (1997) in this direction, where correlations between vj and uj in the prior are discussed. Appendix We wish to evaluate Verf (z , z 0 ) =
Z
1 (2π)
d+1 2
µ ¶ 1 erf(uT z˜ ) erf(uT z˜ 0 ) exp − uT u du, 2
(A.1)
where z˜ is the concatenation of 1 and z . The first step is to realize that the d + 1 dimensional integral in equation A.1 can be reduced to a two-dimensional integral by rotating the basis so that two basis vectors (e1 and e2 ) lie in the plane defined by z˜ and z˜ 0 and the rest are orthogonal to this plane, to give z˜ = a11 e1 + a12 e2 and z˜ 0 = a21 e1 + a22 e2 . Hence uT z˜ = a11 u1 + a12 u2 , and similarly for uT z˜ 0 . Thus, equation A.1 reduces to ! Ã Z Z u21 +u22 1 0 T T du1 du2 , (A.2) erf(a1 u) erf(a2 u) exp − Verf (z , z ) = 2π 2
1214
Christopher K. I. Williams
where a1 = (a11 , a12 )T and a2 = (a21 , a22 )T . Consider I(λ) =
1 2π
Z
erf(λaT1 u) erf(aT2 u)e−u
2
/2
du
(A.3)
where u2 denotes uT u. Clearly I(1) = Verf (z , z 0 ). Differentiating with respect to λ dI(λ) dλ Z 1 1 2 aT1 u erf(aT2 u) exp − uT (I + 2λ2 a1 aT1 )u du. (A.4) = √ 2π π 2
I0 (λ) =
˜ = B1/2 u, c1 = B−1/2 a1 , and c2 = B−1/2 a2 , we Putting B = I + 2λ2 a1 aT1 , u obtain Z 1 2 ˜ erf(cT2 u ˜ )e−u˜ /2 du ˜. (A.5) I0 (λ) = 3/2 1/2 cT1 u π |B| ˜ = c11 u˜ 1 + c12 u˜ 2 and integrating each of the two terms by parts Writing cT1 u yields Z 2cT1 c2 1 T ˜ (I + 2c2 cT2 )u ˜ du ˜ I (λ) = 2 1/2 exp − u π |B| 2 4 cT c2 1 = 1 π |B|1/2 |I + 2c2 cT2 |1/2 0
=
aT1 B−1 a2 4 . π |I + 2λ2 a1 aT1 + 2a2 aT2 |1/2
(A.6) (A.7) (A.8)
Using (I + gg T )−1 = I − gg T /(1 + g2 ) we find that
aT1 B−1 a2 =
aT1 a2 . 1 + 2λ2 a21
(A.9)
By explicit computation of the 2 × 2 determinant, it can be shown that 1 = |I + 2λ2 a1 aT1 + 2a2 aT2 | = 1 + 2λ2 a21 + 2a22 + 4λ2 [a21 a22 − (aT1 a2 )2 ].
(A.10) (A.11)
Thus, I0 (λ) =
aT1 a2 4 . π (1 + 2λ2 a21 )11/2
(A.12)
Computation with Infinite Neural Networks
1215
Using the substitution 2λaT1 a2 q , θ= q 1 + 2λ2 a21 1 + 2a22
(A.13)
we find I0 (λ) =
1 2 dθ . √ π 1 − θ 2 dλ
(A.14)
Thus I(λ) = 2/π sin−1 θ and Verf (z , z 0 ) = I(1) =
2aT1 a2 2 q sin−1 q . π 1 + 2a21 1 + 2a22
(A.15)
˜ , and z˜ 0 = C−1 x ˜0 Finally, using z˜ T z˜ = a21 , z˜ T z˜ 0 = aT1 a2 , z˜ 0T z˜ 0 = a22 , z˜ = C−1 x we obtain equation 3.7. Acknowledgments I thank David Saad and David Barber for help in obtaining the result in equation 3.7, and David Barber, Chris Bishop, Peter Dayan, Ian Nabney, Radford Neal, David Saad, and Huaiyu Zhu for comments on various drafts of the article. I also thank the anonymous referees for improvements to the article, as suggested by their comments. This work was partially supported by EPSRC grant GR/J75425, Novel Developments in Learning Theory for Neural Networks. References Broomhead, D., & Lowe, D. (1988). Multivariable functional interpolation and adaptive networks. Complex Systems, 2, 321–355. Castro, P. E., Lawton, W. H., & Sylvestre, E. A. (1986). Principal modes of variation for processes with continuous sample curves. Technometrics, 28(4), 329– 337. Cressie, N. A. C. (1993). Statistics for spatial data. New York: Wiley. Gibbs, M., & MacKay, D. J. C. (1997). Efficient implementation of gaussian processes. Available from http://wol.ra.phy.cam.ac.uk/mackay/ homepage.html. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7(2), 219–269. Hornik, K. (1993). Some new results on neural network approximation. Neural Networks, 6(8), 1069–1072. Journel, A. G., & Huijbregts, C. J. (1978). Mining geostatistics. San Diego: Academic Press.
1216
Christopher K. I. Williams
MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4(3), 448–472. Mat´ern, B. (1980). Spatial variation (2nd ed.). Berlin: Springer-Verlag. Moody, J. E., & Rognvaldsson, ¨ T. S. (1997). Smoothing regularizers for projective basis functions. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Parzen, E. (1962). Stochastic processes. San Francisco: Holden-Day. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of IEEE, 78, 1481–1497. Rasmussen, C. E. (1996). Evaluation of gaussian processes and other methods for non-linear regression. Unpublished Ph.D. dissertation, Department of Computer Science, University of Toronto. Available from http:// www.cs.utoronto.ca/˜carl/. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Wahba, G. (1990). Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics. Wahba, G. (December, 1996). RBF’s, SBF’s, TreeBF’s, Smoothing Spline ANOVA: Representers and pseudo-representers for a dictionary of basis functions for penalized likelihood estimates. Slides presented at the NIPS workshop on Model Complexity. Available from http://www.stat.wisc.edu/ ˜wahba/wahba.html. Whittle, P. (1963). Prediction and regulation by linear least-square methods. London: English Universities Press. Williams, C. K. I. (1997). Computing with infinite networks. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 514–520). Cambridge, MA: MIT Press. Yaglom, A. M. (1987). Correlation theory of stationary and related random functions. Volume 1: Basic Results. Berlin: Springer-Verlag. Zhu, H., Williams, C. K. I., Rohwer, R. J., & Morciniec, M. (1997). Gaussian regression and optimal finite dimensional linear models (Tech. Rep. No. NCRG/97/011). Aston University, UK. Available from http://www.ncrg. aston.ac.uk/Papers/. Received March 17, 1997; accepted October 15, 1997.
LETTER
Communicated by Frederico Girosi
Bayesian Radial Basis Functions of Variable Dimension C. C. Holmes B. K. Mallick Department of Mathematics, Imperial College, London SW7 2BZ, U.K.
A Bayesian framework for the analysis of radial basis functions (RBF) is proposed that accommodates uncertainty in the dimension of the model. A distribution is defined over the space of all RBF models of a given basis function, and posterior densities are computed using reversible jump Markov chain Monte Carlo samplers (Green, 1995). This alleviates the need to select the architecture during the modeling process. The resulting networks are shown to adjust their size to the complexity of the data. 1 Introduction The regression of a target variable Y on an input set of covariates X given the data pairings D = {(y1 , x1 ), (y2 , x2 ), . . . , (yN , xN )} is an important topic in data analysis. The regression curve (surface) is assumed to be the conditional mean function m(x) = E[Y | X = x] and the recorded observations Y to be corrupted with gaussian noise ², so that yi = m(xi ) + ²i ,
(1.1)
where ²i are independent and identically distributed (i.i.d.) ∼ N(0, σ 2 ). In this article, we are concerned with the approximation of m(x) by an estimate ˆ m(x) of the form ˆ i) = m(x
K X
wj gj (xi ),
(1.2)
j=1
where g() represents a radial basis function (RBF) and w are output coefficients (weights). RBF networks prove ultimately flexible in the surfaces that they can fit (Powell, 1987). This can lead to severe overfitting of the data if the complexity of the model is not managed properly. This problem is commonly avoided by adjusting the number of basis functions or including a regularization term that penalizes oscillatory behavior in the models’ output. Here we take a Bayesian approach and consider the number of basis functions and values c 1998 Massachusetts Institute of Technology Neural Computation 10, 1217–1233 (1998) °
1218
C. C. Holmes and B. K. Mallick
of the coefficients to be unknown. We define a joint probability distribution over both model parameters and model dimension. Using suitably adjusted Markov chain Monte Carlo methods, we can then perform predictions and make inferences by integrating over both the model dimension and the parameter values within dimensions. This suppresses the need to select a single model or set of models for comparison or averaging, or both. When applying this approach to the analysis of RBF networks, we show that the resulting size of the most probable model is determined by the complexity of the data. Simpler problems lead to smaller networks. On this count, it can be compared with non-Bayesian approaches for choosing the network size such as the resource allocating network of Platt (1991) or the support vector methods of Vapnik, Golowich, & Smola (1997) (see also Roberts & Tarransenko, 1994, and Yingwei, Sundararajan, & Saratchandran, 1997). The ability of the network size to be driven by the complexity of the data has particular importance in the analysis of large data sets, where traditional approaches, which use a basis function located at every data point, carry computational overheads that are too large. In section 2 we provide an overview of Bayesian approaches to neural networks. Radial basis functions are briefly discussed in the beginning of section 3, where we also deal with the computational aspects of integrating in high-dimensional spaces of varying dimension. Section 4 lists the results of our approach on a number of benchmark test series, including the robot arm data set analyzed by MacKay (1992b) and the Wolf sunspot data given in Weigend, Huberman, & Rumelhart (1992). 2 Bayesian Neural Networks Bayesian statistics has been applied to the analysis of neural networks by, among others, MacKay (1992a, b) and Neal (1996). Bishop (1995) presents an excellent overview. The central process of the Bayesian framework is the calculation of a probability distribution on the unknown parameter (weight) vector θ. Prior knowledge that we might have, say for small weights, is updated in the light of experimental data. These posterior distributions are used in model predictions, with point forecasts given as expectations, Z E[Y | x, D] =
m(x, θ)p(θ | D)dθ,
(2.1)
where p(θ | D) represents the posterior probability of the parameters of the model m(·, θ) given the training data D. Ripley (1996) refers to this as the “predictive approach.” In this article, we also wish to take account of uncertainty between models M of different dimension (number of basis
Bayesian Radial Basis Functions of Variable Dimension
1219
functions), and we make this fact explicit by writing the expectation as E[Y | x, D] =
K Z X
m(x, θk , Mk )p(θk | D, Mk )p(Mk | D)dθk ,
(2.2)
k=0
where M = {M0 , . . . , MK } is the set of models entertained. We can compare equation 2.2 with the classical “plug-in” approach, ˆ k ), yˆ = m(x, θˆk , M
(2.3)
ˆ k set to some optiwhere θˆk represents the parameters of the “best” model M mum value. Equations 2.2 and 2.3 highlight the difference between classical and Bayesian methodologies for prediction. Bayesian analysis involves integrating out uncertainty in the parameter values; classical methods involve their optimization, as noted by Bishop (1995). The predictive approach in equation 2.2 can be considered a form of model averaging, whereby each model’s prediction is weighted by its posterior probability. In this respect, it is similar to such methods as stacked generalization (Wolpert, 1992). However, the weighting in a Bayesian framework is given as a combination of likelihood and prior, rather than by a value to be optimized during the modeling procedure. In addition, the method described here does not require the prior specification of the set of models M, which we construct dynamically during the model-fitting stage. For an overview of model averaging, see Jacobs (1995), Min and Zellner (1993), and the review paper by Genest and Zidek (1986), which includes an annotated bibliography of over 90 references. The Bayesian case for averaging over different models is presented by Draper (1995). In previous approaches to Bayesian neural networks, the model’s architecture has either been fixed prior to the data analysis or a small number of models are analyzed for averaging or selection.1 Clearly this will tend to produce suboptimal solutions if the “best” (in some sense) model happens not to be tested or the data are not well approximated by the architectures chosen. Recently Green (1995) described a Markov chain sampling method for Bayesian computation that can approximate the integrals in equation 2.2 when the set of models M is unknown. The technique accommodates uncertainty in both M and the parameter values θk . Green’s “reversible jumps” utilize Markov chains that can switch between dimensions while at the same time exploring the parameter space within a particular dimension (Richardson & Green, 1997). This method has many practical applications and is particularly well suited to nonparametric regression techniques of the form given in equation 1.2. In the analysis that follows, we consider 1 Numerous Bayesian model choice criteria exist, analogous to those in classical statistics (Key, 1996).
1220
C. C. Holmes and B. K. Mallick
RBFs, although the methods we adopt are generic and readily applicable to other types of networks, including multilayer perceptrons and multivariate adaptive regression splines (Denison, Mallick, & Smith, in press). 3 The Bayesian RBF Model 3.1 Radial Basis Functions. RBF networks have proved a popular method for approximating data (see Girosi, Jones, & Poggio, 1995, for a review). The models are feedforward networks of radial functions where each basis is parameterized by a knot or position vector µ located in the d-dimensional covariate space x. Conventionally there are as many basis functions as data points to be approximated with the position vectors set to the data values. The model output m(x) is given by a linear combination of the basis functions response and a low-order polynomial term, m(x) =
N X
wi φi (k x − µi k) +
i=1
p X
am qm (x),
(3.1)
m=0
where k · k denotes a distance metric, usually Euclidean or Mahalanobis, and qm (x) represents a polynomial of degree m. The coefficients w and a P are calculated by least squares where the constraint N i=1 wi qm (xi ) = 0, m = 0, . . . , p is imposed to ensure the uniqueness of the solution. The theory of RBFs specifies many permissible forms that the function φ() can take. Each of the acceptable basis types corresponds to a priori assumptions on the true regression surface being approximated (Girosi, 1994). Common choices include: • Cubic φ(z) = z3 . • Thin plate spline φ(z) = z2 log z. • Multiquadric φ(z) = (z2 + c2 )1/2 . • Linear φ(z) = z. • Inverse multiquadric φ(z) = (z2 + c2 )−1/2 . • Gaussian φ(z) = exp(−cz2 ).
Bayesian Radial Basis Functions of Variable Dimension
1221
There is some empirical evidence to suggest that nonlocal bases, where φ(z) → ∞ as z → ∞ , perform better than local basis functions and are, in addition, less sensitive to user set basis parameters c (Franke, 1982; Lowe, 1995). Following this, we choose to analyze the cubic, multiquadric, and thin plate spline functions. We will always include a linear term as the loworder polynomial in our model (see equation 2.3) and from now on take the vector w to incorporate the polynomial coefficients a. The RBF model specified by equation 2.3 interpolates the data. However, interpolation is rarely our goal when analyzing the majority of real-world data sets, which tend to be corrupted by measurement noise or fail to record salient independent variables. In these circumstances we need to reduce the flexibility of the model. This can be achieved by adding a small, positive regularization term, η, to the diagonal terms in the design matrix of the least-squares solution of w in equation 3.1 (Wahba, 1990). The larger the value of η, the smoother the output of the model. In the neural network community, it is more fashionable to control complexity by reducing the number of centers to less than the number of data points, a suggestion first advocated by Broomhead and Lowe (1988). The reduction in the dimension leads to simpler models with less variance but greater bias. In the extreme with no radial functions, we recover the standard linear model. These two approaches have the same aim: to reduce the complexity of the full model given in equation 2.3. In the next section, we describe a Bayesian framework for the RBF network and show how the complexity management is explicitly incorporated in a prior density that we define over the modelparameter space. 3.2 RBF Likelihood and Prior. The RBF networks are uniquely determined by the number of basis functions k, the position vectors µk , and the output coefficients wk . In a Bayesian framework, we are interested in the posterior probability densities of these parameters given the data set p(k, µ, w | D). This distribution is used for predictions and model inference. Following equation 2.2, we now define a “model” Mk to include the number and location of basis functions Mk = {k, µk } and the parameter θk to represent the coefficients wk within the model structure. We shall use the terms {Mk , θk } and {k, µk , wk } interchangeably. The posterior distribution p(Mk , θk | D) is given as a combination of likelihood and prior p(Mk , θk | D) =
l(D | θk , Mk )p(θk , Mk ) , p(D)
(3.2)
where l(D | θk , Mk ) is the likelihood function and p(θk , Mk ) is the prior density.
1222
C. C. Holmes and B. K. Mallick
Assuming a normally distributed noise term, we obtain the following log-likelihood of the RBF network, up to an additive constant, L(D | θk , Mk ) = −n log σ −
n 1 X [yi − m(xi , θk , Mk )]2 , 2σ 2 i=1
(3.3)
The prior term in equation 3.2 is used to express subjective beliefs about the nature of the posterior density of θk and Mk . It is the prior density that is used to incorporate preferences for simpler models or smoother model outputs. Fixing the number of basis functions and location vectors is equivalent to setting a point prior on single values of k and µ, that is, M = {Mk }. The flexibility of the model is then controlled through the use of a prior on w. Typically this will be a form of shrinkage prior, p(wk ) ∼ N(wk | 0, λ−1 I), that penalizes large values. The prior is controlled through the precision parameter λ. This has the same effect as the regularization term η described in the previous section for classical RBF networks. The analysis here accounts for the additional uncertainty present in the choice of Mk . We place a proper prior over the whole of RBF model space M of a given basis function type. Originally we looked at using a Poisson prior on k (and a uniform prior on µ). Setting the mean of this Poisson prior to a small value penalizes networks with a large number of basis functions. However, this fails to take account of the amount of smoothing that is already being achieved by the setting of λ in the prior on the output coefficients, and our beliefs really lie on the flexibility of the model rather than the dimension per se. Therefore, rather than placing a prior on the dimension of the model k, we choose to place one on the degrees of freedom (DF) which is a function of k, µ, and λ. The DF of the network is readily computable (see the appendix) and measures the amount of fitting that the model achieves.2 We place a gamma prior on DF so that our prior term in equation 3.2 becomes p(θk , Mk ) = p(wk , µk , k)
(3.4)
= p(wk )p(µk , k) = N(wk | 0, λ−1 I)gamma (DF | α, β), where α and β are prior parameters.3 An advantage of using DF above model dimension is that most users have a good understanding of the concept of DF, and hence subjective beliefs are more easily elicited. For instance, setting 2 In fact, the measure defined in the appendix is an (under)approximation of the true degrees of freedom of our networks. For computational convenience, we treat the location and number of basis functions as independent of the data. It does, however, serve as a useful approximation for the prior ratios we are interested in, as described in section 3.4. 3 The mean of a gamma distribution is α/β, and its variance is α/β 2 .
Bayesian Radial Basis Functions of Variable Dimension
1223
α/β = 1 indicates a preference for modeling the data by the average value of the target series. In d dimensions, setting the mean equal to d+1 indicates a preference for a linear fit. Ideally we should accommodate uncertainty in the values of the prior parameters {α, β, λ}. This is achieved in a hierarchical set-up by placing hyperpriors on the prior parameters. Uncertainty in these hyperpriors should be expressed as further priors. In reality, there is a fast diminishing return in adding further layers to the hierarchical model. We therefore keep the prior parameters fixed throughout. (See section 5 for a discussion of allowing λ to vary.) Finally, we place a vague gamma prior on the precision (inverse variance) of the noise term in equation 3.3, p(σ −2 ) ∼ Ga(10−3 , 10−3 ). To recall, our ultimate aim is to use the densities defined in equation 3.2 for the predictions given in equation 2.2 by integrating over the distributions. Unfortunately, these posterior densities are typically complex and of varying dimension. This makes the integration intractable to analytical methods, and we must turn to approximation methods. The methods we use are described in the next section. 3.3 Markov Chain Monte Carlo and Reversible Jump Samplers. Bayesian inference involves integration, and Markov chain Monte Carlo (MCMC) methods form an important tool for approximating the integrals we are interested in. In this section, we provide a brief overview of this method. (For further details, see Bernardo & Smith, 1994; Tierney, 1994; Smith & Roberts, 1993; and Besag, Green, Higdon, & Mengerson, 1996.) Suppose we wish to make model predictions using an integral of the form given in equation 2.1: Z I=
f (π)p(π | D)dπ.
(3.5)
MCMC methods proceed by drawing samples of π in direct proportion to their probability p(π | D) and then approximating equation 3.5 by I≈
N X 1 f (πt ), N − n0 t=n
(3.6)
0
where N is the total number of samples generated (chain length), and n0 is a “burn-in” period. The burn-in ensures that the Markov chain generating this sample has converged to the stationary distribution of interest p(π | D). The Metropolis-Hastings algorithm provides one example of this technique (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Hastings, 1970). To begin, an initial sample point π1 is drawn; this could be from the prior p(π). A candidate for the next point in the chain, πˆ 2 , is then proposed
1224
C. C. Holmes and B. K. Mallick
based on the current point π1 and a randomly generated vector u drawn from a (possibly symmetric) proposal distribution S, giving πˆ 2 = π1 + u,
(3.7)
where u is drawn from S and the hat, πˆ , indicates that π is currently only a proposal for the next state. The proposal distribution S is commonly chosen to be N(0, 6). Generally we can write πˆ t+1 = πt + u. The proposed sample is now “accepted” with probability ¾ ½ p(πˆ t+1 )S(πˆ t+1 , πt ) , α(πt , πˆ t+1 ) = min 1, p(πt )S(πt , πˆ t+1 )
(3.8)
where p(x) is the probability of x, and S(x, x0 ) is the probability of proposing a move from x to x0 . Obviously the proposal ratio, S(x, x0 )/S(x0 , x), equals 1 if S() is symmetric. If the proposed state is accepted, then the next point in the chain, πt+1 , is set to πˆ t+1 ; otherwise it is set to the previous point (πt+1 = πt ). The new sample πt+1 now forms the starting point for the next proposed state, and the algorithm is iterated until a large number of samples have been drawn. Each updated state is just a function of the previous state, and they are hence referred to as Markov chains. Independent samples from the distribution p(π | D) (the distribution of interest) are obtained by first discarding an initial portion of the chain, to ensure the chain has converged, and then taking every nth sample to remove correlations (Tierney, 1994). Until recently MCMC methods were mainly restricted to densities of fixed dimension. However, Green (1995) developed an MCMC method that samples from p(π), where π is of unknown dimension. 3.4 Reversible Jump Radial Basis Functions. Green’s (1995) reversible jump MCMC method allows us to approximate the integral in equation 2.2 when the number of basis functions is unknown. The algorithm proceeds by augmenting the usual proposal step of a conventional Metropolis-Hastings sampler with a number of other possible move types surrounding a change in the dimension of the density. At each iteration, in addition to the possibility of attempting a move within a particular parameter subspace, the sampler can propose to “jump” dimension, either up or down, by adding or removing a basis function from the network. We refer to these jumps as birth and death steps. The probability of attempting a birth or death step when the current state has k basis functions is given by bk and dk , respectively. In this article, we set d1 = 0, bN = 0, and bk = dk = 0.1 for all other values of k, where N is the number of data points. It is common when jumping between dimensions to generate some random vector u that augments the current state π to form π 0 . For a birth step, this vector u is just a datum (position) vector drawn at random from those
Bayesian Radial Basis Functions of Variable Dimension
1225
points that do not already have a basis function located on them. The jump move is accepted with probability ¯ ¯¾ ½ p(π 0 | D)rm (π 0 ) ¯¯ ∂(π 0 ) ¯¯ , α(π, π 0 ) = min 1, p(π | D)rm (π )q(u) ¯ ∂(π, u) ¯
(3.9)
where rm (π) is the probability of choosing a jump of type m when the current state is π and q(u) is the density function of u. The final term, a Jacobian, arises from the change of variables from (π, u) to π 0 . For our radial basis model, we can rewrite equation 3.9 as α({k, µ, w}, {k0 , µ0 , w0 }) = min[1, (likelihood ratio) × (prior ratio) × (proposal ratio)].
(3.10)
The Jacobian is not required because we are drawing our new location vector µ independent of the current parameters. To illustrate this model, we will consider a birth step, from {k, µk , wk } to {k + 1, µk+1 , wk+1 } (death steps are just an inversion of the following ratios). The prior ratio for a birth is p(DFk+1,µk+1 )[(k + 1)!(N − k + 1)!/N!]p(wk+1 ) , p(DFk,µk )[k!(N − k)!/N!]p(wk )
(3.11)
where p(DF) and p(w) are taken from equation 3.4, and the term k!(N−k)!/N! represents the probability of choosing the k basis locations from the N data points. The birth proposal ratio (PR) is given by prob. of death move × prob. proposing wk × prob. of deleting basis prob. of birth move × prob. proposing wk+1 × prob. of creating basis =
dk+1 p(wk | D)/(k + 1) , bk p(wk+1 | D)/(N − k)
(3.12)
where we have used the full conditionals of p(w | D) as the proposal distribution for w (see section 3.5 for details). Finally, the likelihood ratio is taken from equation 3.3. Note that when multiplying the prior and proposal ratios in equation 3.10, the terms involving factorials in equation 3.11 and the (k + 1) and (N − k) terms in equation 3.12 cancel out, as will dk+1 and bk when the number of basis functions is between 1 and N − 1. 3.5 Reversible Jump Algorithm. The algorithm can be written in pseudocode as follows:
1226
C. C. Holmes and B. K. Mallick
Starting with one RBF: 1. Draw the noise variance, σ 2 , from its prior, gamma(σ −2 | 10−3 , 10−3 ). 2. Draw the output coefficients w in a Gibbs sampling step. This uses the full conditionals of w given the data, p(w | D) = N(w | (9 T 9 + λI)−1 9y, σ 2 9 T 9), where λ is the prior precision for w and 9 is the design matrix of outputs from the hidden layer of RBFs and polynomial terms. (See Smith & Roberts, 1993, for details of the Gibbs sampler.) 3. Iterate until convergence is assumed.4 a. Draw a uniform random variable u ∼ U(0, 1). b. Propose the next state of the chain as follows: i. If u is less than bk , then perform BIRTH step. ii. Else if u is less than bk + dk , perform DEATH step. iii. Else perform MOVE step. c. Redraw the coefficients w, as before. d. Draw a uniform random variable u ∼ U(0, 1). i. If u < α, where α is from equation 3.10, then accept the proposed state. ii. Else set the next state to be the current state. e. Draw the noise variance, σ 2 , in a Gibbs sampling step using the full conditionals σ −2 = gamma (σ −2 | 10−3 + N/2, 10−3 + χ 2 /2), where N is the number of data points and χ 2 is the sum of squared residuals for the current model. f. Repeat. The MOVE, BIRTH, and DEATH steps are simple. MOVE selects a basis function at random and resets its location vector to another datum drawn randomly from the data set. BIRTH adds another basis function at a randomly selected point in the data set that does not already contain one. DEATH selects just one basis at random and removes it. The output of the algorithm is a Markov chain that has p(k, w, µ | D) as its stationary distribution. An initial portion of the chain is discarded to ensure convergence, and then every mth sample is used to make predictions using equations 1.2, 2.2, and 3.6.
4 This usually involves inspection of statistics from the Markov chain such as mean values of sampled parameters; see Tierney (1994) for details and heuristics.
Bayesian Radial Basis Functions of Variable Dimension
1227
4 Performance on Some Test Sets This section illustrates the accuracy of the Bayesian RBF networks on some standard test sets. All the simulations were run with a burn-in period of 5000 iterations of the reversible jump MCMC algorithm followed by 12,000 samples, of which every third is used in the calculations. The value of β in the gamma prior on the degrees of freedom is set to 3, and the α value is set to 10. This indicates a strong preference for simpler models. These values could be adjusted to each data set using an empirical Bayesian approach, which might well improve the accuracy of the networks. We chose not to do so in order to present a unified approach. The prior for the output coefficients is kept vague with a log precision value log λ = −5. Each run took between three and six hours on a DEC Alpha 500, depending on the size of the data set and the complexity of the problem. 4.1 Bivariate Test Functions. We tested the method on five nonlinear functions analyzed by Hwang, Lay, Maechler, Martin, and Schimert (1994) in a study of multilayer perceptrons and projection pursuit. The five functions are as follows: 1. Simple interaction function f (1) (x1 , x2 ) = 10.391[(x1 − 0.4)(x2 − 0.6) + 0.36]. 2. Radial function f (2) (x1 , x2 ) = 24.234[r2 (0.75 − r2 )] where r2 = (x1 − 0.5)2 + (x2 − 0.5)2 . 3. Harmonic function f (3) (x1 , x2 ) = 42.659[0.1 + xˆ 1 (0.05 + xˆ 41 − 10xˆ 21 xˆ 22 + 5xˆ 42 )] where xˆ 1 = x1 − 0.5, and xˆ 2 = x2 − 0.5. 4. Additive function f (4) (x1 , x2 ) = 1.3356{1.5(1 − x1 ) + e2x1 −1 sin[3π(x1 − 0.6)2 ] +e3(x2 −0.5) sin[4π(x2 − 0.9)2 ]}. 5. Complicated interaction function f (5) (x1 , x2 ) = 1.9{1.35 + ex1 sin[13(x1 − 0.6)2 ]e−x2 sin(7x2 )}. In accordance with Hwang et al.’s approach, 225 data points were generated on the unit square, and the response, y, was calculated by yi =
1228
C. C. Holmes and B. K. Mallick
Table 1: Results for Hwangs’ Test Set. Function
Classical Cubic
Bayesian Cubic
Classical Tps
Bayesian Tps
Simple Radial Harmonic Additive Complex
0.0067 0.0049 0.0320 0.0210 0.0369
0.0060 (11) 0.0054 (15) 0.0455 (41) 0.0220 (33) 0.0413 (33)
0.0094 0.0053 0.0932 0.0240 0.0380
0.0074 (13) 0.0094 (17) 0.0576 (49) 0.0232 (32) 0.0447 (41)
Note: The accuracy of the two methods differs by over 0.5% of the variance in only 3 of the 10 tests.
f (x1i , x2i )+²i where f () is the true test function and ²i is gaussian white noise drawn from a N(0, 0.252 ) distribution. The test set comes from generating 10,000 data points on a 100 by 100 grid over the unit square—[(1/200, 1/200), (3/200,1/200), . . . , (199/200,1/200),(1/200,3/200), . . . , (199/200,199/200)]. Table 1 lists the results in units of fraction of variance unexplained (FVU), which is given by
FVU =
E[ fˆ(x) − f (x)]2 , E[ f (x) − f¯]2
(4.1)
where fˆ(x) is the model’s prediction, f (x) is the true value of the function, and f¯ is the mean of the true function over the test set. We compare our method with a classical RBF network using 225 basis functions (one at every data point). The regularization term for the classical model was set using cross-validation. We tested both cubic and thin-plate splines (Tps) on the data. The mode of the number of basis functions in the MCMC chain is included in parentheses alongside the error value in Table 1. The mode of the chain gives some indication of the region of the marginalized p(k) that the MCMC chain had converged to. The modes in brackets give an indication that the model adjusts the dimension to the complexity of the problem. Both methods perform well on these data sets, accounting for over 95% of the variance in all but one test. The results are comparatively very similar, differing by more than 1% of the variance unexplained in only the harmonic test set. However, the Bayesian RBF models use substantially fewer basis functions. 4.2 Robot Arm Data Set. This next task compares the Bayesian RBF model with other Bayesian neural network approaches. The problem is to model the mapping of a two-dimensional “joint angle” (x1 , x2 ) to the end
Bayesian Radial Basis Functions of Variable Dimension
1229
Table 2: Results for Robot Arm Data Set. Average Squared Error Gaussian approximation method of MacKay Solution with highest evidence Solution with lowest test error Hybrid MCMC of Neal with 150 supertransitions Best over three runs Neal’s method with 30 supertransitions Best over three runs Bayesian RBF Cubic
0.00573 0.00557 0.00554 0.00557 0.00379 (24)
arm position (y1 , y2 ). The true relationship is given by: y1 = 2.0 cos(x1 ) + 1.3 cos(x1 + x2 ) + ²1
(4.2)
y2 = 2.0 sin(x1 ) + 1.3 sin(x1 + x2 ) + ²2 , where ²i ∼ N(0, σ 2 ), σ = 0.05. The data set can be found at MacKay’s web site.5 Neal (1996) compares his Bayesian multilayer perceptron, computed using a hybrid MCMC (Duane, Kennedy, Pendleton, & Roweth, 1987), with that of MacKay, who chooses to approximate the integral in the predictive expectations using gaussians fitted at local modes. The results are presented in Table 2. Again the mode of the number of basis functions in the chain is given in parentheses next to the error value. The Bayesian RBF network appears slightly more accurate than Bayesian multilayer perceptrons on this data set. 4.3 Sunspots. The Wolf sunspot time series has served as a benchmark data set for a number of statistical models (Weigend et al., 1992; Tong & Lim, 1980). The data represent 280 yearly averages of sunspots (dark patches on the sun) recorded between the years 1700 and 1979. Weigend trained a multilayer perceptron (MLP) using the records from 1700 to 1920. The model was evaluated on two test sets representing the years 1921–1955 and 1956– 1979. The data sets are generated by lagging 12 years of values as inputs and using the next year’s sunspot value as a target, that is, it is modeled as an autoregressive AR(12) process. To compare our Bayesian models with that of Weigend, we tested multiquadrics (Mq) with the basis parameter c set to 0.5. The results are presented in Table 3, in terms of FVU, alongside those of a classical model using 220 basis functions. 5
http://wol.ra.phy.cam.ac.uk/mackay/.
1230
C. C. Holmes and B. K. Mallick
Table 3: Results for Sunspot Data Set. Method
Mode p(k)
Training Error, 1713–1920
First Test Set, 1921–1955
Second Test Set, 1956–1979
Weigend’s MLP Bayesian Mq Classical Mq
N.A. 54 220
0.082 0.078 0.073
0.086 0.089 0.096
0.35 0.249 0.275
5 Summary A Bayesian approach to RBFs has been presented where the dimension (number of basis functions) is one of the things that we do not know. Using a specially constructed Markov chain based on reversible jumps we are able to draw inferences on the dimension of the model. Predictions are made by averaging over many models of varying dimension and basis location. By examining the marginal distribution of the number of basis functions, we see that the networks automatically adjust their size to the complexity of the problem. We have chosen to keep λ, the prior precision (regularization parameter) for the output coefficients, fixed at a small value and mix over the dimension and position of the basis functions. Alternately, it is straightforward to fix the dimension and mix over the regularization parameter by placing a hyperprior on λ or allowing both dimension and regularization to vary. An advantage of the reported method is that smaller models result from clamping the prior precision to a small value, and so computational savings are made during the simulations. Other amendments to the approach could be to include sampling from the input dimension during the jump steps. This would be appropriate if we had the knowledge to suggest that some covariates might be irrelevant. The marginal densities, giving the posterior probability of each covariate, are readily obtainable from the final chain. Appendix Consider a model S that smooths target values y given input values x. This action is said to be linear if S (ay1 + by2 | x) = aS (y1 | x) + bS (y2 | x) for any constants a and b (Hastie & Tibshirani, 1990). As such, classical RBF networks are linear smoothers. The model’s output fˆ at points x1 , . . . , xn can be written in matrix notation as fˆ = Sy, where S is an n × n smoother matrix.
Bayesian Radial Basis Functions of Variable Dimension
1231
To see this, let 9 denote the design matrix given by the output of the basis functions (hidden layer) at the data points x1 , . . . , xn , and let w denote the final layer output coefficients for the network. Then we can write the least-squares approximation as w = (9 T 9 + λI)−1 9 T y,
(A.1)
where λ is the regularization parameter (or prior precision in a Bayesian setting). Point forecasts are made by fˆ = 9w, which from equation A.1 can be written as fˆ = 9(9 T 9 + λI)−1 9 T y, and hence S = 9(9 T 9 + λI)−1 9 T . In accordance with Hastie and Tibshirani (1990), we define the DF of our RBF networks: DF = tr(S).
(A.2)
This is the sum of the eigenvalues of S, which gives a measure of the amount of fitting that S, the expected smoothing matrix of our model, achieves. Acknowledgments We acknowledge the helpful comments of D. Denison and two anonymous referees regarding this work. C. C. H. was assisted by an EPSRC research award and sponsored by the Water Research Centre, Swindon, United Kingdom. References Bernardo, J., & Smith, A. (1994). Bayesian theory. New York: Wiley. Besag, J., Green, P., Higdon, D., & Mengersen, K. (1996). Bayesian computation and stochastic systems. Stat. Sci., 10, 3–66. Bishop, C. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Broomhead, D. S., & Lowe, D. (1988). Multivariate functional interpolation and adaptive networks. Complex Systems, 2, 321–355. Denison, D., Mallick, B., & Smith, A. F. M. (in press). Bayesian M.A.R.S. Statistics and Computing. Draper, D. (1995). Assessment and propagation of model uncertainty (with discussion). J. Royal Stat. Soc. B, 57, 45–98.
1232
C. C. Holmes and B. K. Mallick
Duane, S., Kennedy, A. D., Pendleton, B. J., & Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195, 216–222. Franke, R. (1982). Scattered data interpolation: Tests of some methods. Mathematics of Computation, 38, 181–200. Genest, C., & Zidek, J. V. (1986). Combining probability distributions: A critique and annotated bibliography. Statistical Science, 1, 114–148. Girosi, F. (1994). Regularization theory, radial basis functions and networks. In V. Cherkassky, F. Friedman, & H. Wechsler (Eds.), From statistics to neural networks. Berlin: Springer-Verlag. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks. Neural Computation, 7, 219–269. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711–732. Hastie, T. J., & Tibshirani, R. J. (1990). Generalised additive models. London: Chapman & Hall. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrica, 57, 97–109. Hwang, J., Lay, S., Maechler, R., Martin, D., & Schimert, J. (1994). Regression modeling in back-propagation and projection pursuit learning. IEEE Trans. Neural Networks, 5, 342–353. Jacobs, R. A. (1995). Methods for combining experts’ probability assessments. Neural Computation, 7, 867–888. Key, J. (1996). Studies of a simulation approach to Bayesian model comparison. Unpublished Ph.D. dissertation, Department of Mathematics, Imperial College, London. Lowe, D. (1995). On the use of nonlocal and nonpositive definite basis functions in radial basis function networks. In Proc. 4th Int. Conf. Artificial Neural Networks (pp. 206–210). MacKay, D. (1992a). Bayesian interpolation. Neural Computation, 4, 415–447. MacKay, D. (1992b). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equations of state calculations by fast computing machines. J. Chem. Phys., 21, 1087–1091. Min, C., & Zellner, A. (1993). Bayesian and non-Bayesian methods for combining forecasts with applications to forecasting international growth rates. J. Econometrics, 56, 89–118. Neal, R. (1996). Bayesian learning for neural networks. Berlin: Springer-Verlag. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation, 3, 213–225. Powell, M. J. D. (1987). Radial basis functions for multivariate interpolation: A review. In J. C. Mason & M. G. Cox (Eds.), Algorithms of approximation (pp. 143–167). Oxford: Clarendon Press. Richardson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. J. R. Statist. Soc. B, 59, 731–792. Ripley, B. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press.
Bayesian Radial Basis Functions of Variable Dimension
1233
Roberts, S., & Tarassenko, L. (1994). A probabilistic resource allocating network for novelty detection. Neural Computation, 6, 270–284. Smith, A. F. M., & Roberts, G. O. (1993). Bayesian computation via Gibbs sampler and related Markov chain Monte Carlo methods. J. R. Statist. Soc. B, 55, 2–24. Tierney, L. (1994). Markov chains for exploring posterior distributions. Ann. Statist., 22, 1701–1762. Tong, H., & Lim, K. S. (1980). Threshold autoregression, limit cycles and cyclical data. J. R. Statist. Soc. B, 42, 245–292. Vapnik, V., Golowich, S. E., & Smola, A. (1997). Support vector method for function approximation, regression estimation and signal processing. In M. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 281–287). Cambridge, MA: MIT Press. Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM. Weigend, A., Huberman, B. A., & Rumelhart, D. E. (1992). Predicting sunspots and exchange rates with connectionist networks. In C. Casdagli & S. Eubank (Eds.), Nonlinear modeling and forecasting. Reading, MA: Addison-Wesley. Wolpert, D. H. (1992). Stacked generalization. Neural Computation, 5, 241–259. Yingwei, L., Sundararajan, N., & Saratchandran, P. (1997). A sequential learning scheme for function approximation using minimal radial basis function neural networks. Neural Computation, 9, 461–478. Received March 26, 1997; accepted October 7, 1997.
LETTER
Communicated by Andreas Herz
Absence of Cycles in Symmetric Neural Networks Xin Wang Computer Science Department, UCLA, Los Angeles, CA 90024, U.S.A.
Arun Jagota University of California, Santa Cruz, CA 94704, U.S.A.
Fernanda Botelho Max Garzon Department of Mathematical Sciences, University of Memphis, Memphis, TN 38152, U.S.A.
For a given recurrent neural network, a discrete-time model may have asymptotic dynamics different from the one of a related continuous-time model. In this article, we consider a discrete-time model that discretizes the continuous-time leaky integrator model and study its parallel, sequential, block-sequential, and distributed dynamics for symmetric networks. We provide sufficient (and in many cases necessary) conditions for the discretized model to have the same cycle-free dynamics of the corresponding continuous-time model in symmetric networks. 1 Introduction For an n-neuron recurrent network, a much-studied and widely used continuous-time (CT) model is the leaky integrator model (Hertz, Krogh, & Palmer, 1991; Hopfield, 1984), given by a system of nonlinear differential equations: n X dxi = −xi + σi wij xj + Ii , τi dt j=1
t ≥ 0, i = 1, . . . , n,
(1.1)
and a related discrete-time (DT) version is the sigmoidal model (Hopfield, 1982; Marcus & Westervelt, 1989), specified by a system of nonlinear
This article is an expanded version of a paper presented at the Neural Information Processing Systems 1995 conference. c 1998 Massachusetts Institute of Technology Neural Computation 10, 1235–1249 (1998) °
1236
Xin Wang et al.
difference equations: n X wij xj (t) + Ii , xi (t + 1) = σi
t = 0, 1, . . . ,
i = 1, . . . , n, (1.2)
j=1
where xi (t), taking values in a compact interval [a, b], represents the state of neuron i at time t, τi is the time constant, W = [wij ] is the real-valued weight matrix, σi : < → [a, b] is the activation function, which often takes a sigmoidal form (in which case it has a gain µi > 0), and Ii is the constant external input to neuron i. When the network is symmetric (W is symmetric), the dynamics of both models are well understood: the CT model (see equation 1.1) is always convergent—every initial state will approach a fixed point asymptotically (Cohen & Grossberg, 1983; Hirsch, 1989; Hertz et al., 1991; Hopfield, 1984)—and the DT model (see equation 1.2) is either convergent or approaches a periodic orbit of period 2 (i.e., a two-cycle) (Goles, FogelmanSoulie, & Pellegrin, 1985; Marcus & Westervelt, 1989; Koiran, 1994; see also Peretto, 1984). (For results and analyses of fixed points and cycles in networks that are not necessarily symmetric, see Brown, 1992; Bruck, 1990; Goles, 1986; Xu & Kwong, 1995.) However, for a given symmetric network (n, W, σi , Ii ), the existence of possible two-cycles in its discrete-time operation is sometimes troublesome and undesirable, especially in associative memory and neural optimization applications where only fixed points are used to represent memory patterns (Hopfield, 1982) or encode feasible solutions (Hertz et al., 1991). A great deal of work on asymptotic behavior of equation 1.2 has focused on constraining the dynamics or parameters of the model so that it exhibits only convergent dynamics. Originally (Hopfield, 1982) a type of sequential dynamics (in which only one randomly chosen neuron updates its state at any time) had to be employed in order to ensure the convergent dynamics of equation 1.2. For the parallel dynamics case, it was shown in Goles et al. (1985) that for σi equal to the −1/ + 1 signum function, if W is positive definite on the set {−1, 0, 1}n , then the model in equation 1.2 is convergent only to fixed points. (See also Peretto, 1984.) In Marcus and Westervelt (1989), a related parallel dynamics result was obtained: when σi are differentiable, sigmoidal functions, if W + M−1 is positive definite, then the model in equation 1.2 is convergent only to fixed points (see also Marcus, Waugh, & Westervelt, 1990; Waugh & Westervelt, 1993). Here M is a diagonal matrix of neuron gains µi . All these previous results impose somewhat stringent restrictions on the network. The first requires the use of sequential dynamics. The second requires W to be positive definite, which is often not the case. The third requires either W to be positive definite or the neuron gains µi to be sufficiently small. This can be a problem in some applications where one has little control over W (other than its symmetry) or over the µi . For example,
Absence of Cycles in Symmetric Neural Networks
1237
many mappings of optimization problems result in W’s that are symmetric but not necessarily positive definite. Furthermore, if one is to use the mean-field annealing dynamical algorithm on the resulting network—an algorithm that often works well—one expects to use a wide range of values for the µi . More generally, changing µi affects in general the fixed points of the network, and this can be a problem. This article studies another DT model that achieves convergence to fixed points using milder conditions. The new DT model generalizes equation 1.2 with some new parameters (these help to relax the restrictions). For symmetric networks, this model has the same types of parallel and sequential dynamics of equation 1.2, but now convergence to fixed points may be achieved by imposing certain conditions on only the new parameters rather than on the µi . The (admissible) settings of these new parameters, furthermore, do not affect the fixed points of the network. Thus, not only are there more ways to achieve convergence—either W is positive definite, or the neuron gains µi are sufficiently small, or the new parameters have certain values—but the new parameters are also easier to constrain than the others (W, µi ). Moreover, with these new parameters interpreted as bifurcation parameters, the existence of possible two-cycles can be understood in this model as resulting from the existence of possible period-doubling bifurcation when the parameters are varied. Finally, it is this model, rather than equation 1.2, that is used more often in practice as a discrete-time approximation of equation 1.1. Based on all of the above, the DT model studied here is a more appropriate discrete-time model of neural networks for purposes of theoretical investigation, numerical simulation, and practical application. 2 A Discrete-Time Model The DT model studied in this article is µ ¶ n αi αi X wij xj (t) + Ii , xi (t) + σi xi (t + 1) = 1 − τi τi j=1 t = 0, 1, . . . , i = 1, . . . , n,
(2.1)
where αi ’s are newly introduced parameters, taking values in (0, τi ]. We shall assume henceforth that all τi = 1 by the transformation αi ≡ (αi /τi ). Thus, αi ∈ (0, 1]. This DT model is based on the Euler discretization of the CT model in equation 1.1 approximating dxi (t)/dt by the difference (x(t + αi ) − x(t))/αi , and it takes the model in equation 1.2 as its special case of all αi = 1. The new neuron state Pnxi (t + 1) is now a linear combination of the wij xj (t)+Ii ) and the old state xi (t). Because activation function value σi ( j=1 of αi ∈ (0, 1], the model in equation 2.1 is well defined, in that the iterative
1238
Xin Wang et al.
maps resulting from the model, n X wij xj + Ii , Fi (x) = (1 − αi )xi + αi σi
(2.2)
j=1
preserve neuron states in the compact interval [a, b]. For the purposes of this article, neuron activation functions σi are assumed to satisfy the following constraints: 1. σi have continuous first-order derivatives σi0 (y) for all y ∈ <. 2. σi are monotone increasing with σi0 (y) > 0. 3. σi0 (y) → 0 as y → ±∞. 4. σi0 (y) take maximal values µi , possibly infinite, which are usually referred to as neuron gains. Such functions are fairly general, including often-used [−1, 1]- and [0, 1]sigmoids, such as tanh(µi y), 2/π tan−1 (π µi y/2), and 1/(1 + e−µi y ). The constraints on σi ’s are sufficient for the functions defined by Z Gi (xi ) = 0
xi
σi−1 (y)dy
(2.3)
to have the following properties that will be used subsequently in proofs of several propositions of this article: P 1. G0i (y) = σi−1 (y), and particularly G0i (1xi (t)/αi + xi (t)) = j wij xi (t) + Ii . 2. Gi (y) − Gi (z) ≤ G0 (y)(y − z) − 1/(2µi )(y − z)2 ≤ G0 (y)(y − z) with equality holding only when y = z. 3. G00i (y) = 1/σi0 (σi−1 (y)). 4. G0i (y0 + y1 ) − G0i (y0 ) ≥ (minz G00i (z))y1 = y1 /µi . Here and in the rest of the article, 1x(t) = (x(t + αi ) − x(t)). 3 Parallel and Sequential Dynamics In parallel dynamics, also called synchronous dynamics, all neurons update their states in each time step. In sequential dynamics, a single neuron updates its state in each time step in such a way that each neuron updates its state infinitely many times, over all time steps t. The most widely studied special case of sequential dynamics is called asynchronous dynamics (Hopfield, 1982), in which the neuron whose state is updated is chosen at random. This models asynchronous evolution of a neural network circuit composed of autonomous neurons.
Absence of Cycles in Symmetric Neural Networks
1239
It is easy to see that the discretized DT model in equation 2.1 shares the same set of fixed points with the CT model in equation 1.1; that is, a point x∗ is a fixed point of equation 2.1 (i.e., x∗i = Fi (x∗ ) with Fi given P in equation 2.2), if and only if it is a fixed point of equation 1.1 (−x∗i + σi ( ij wij xj∗ + Ii ) = 0). However, as the result of discretization, fixed points may have different asymptotic stability (Wang & Blum, 1992), and periodic points that are not fixed points may occur (Blum & Wang, 1992; Marcus & Westervelt, 1989) in the DT model, especially when all αi = 1. Nevertheless, the discretized DT model retains the same type of the global sequential dynamics of equation 1.2, as stated in the following proposition. This result extends the result for all αi = 1 in Marcus and Westervelt (1989) to αi ∈ (0, 1]. Proposition 1. If W is symmetric with all wii > −(2−αi )/(αi µi ), the DT model (see equation 2.1) has the sequential dynamics that is convergent to fixed points for any αi ∈ (0, 1]. Proof. Consider the energy or Lyapunov function used in Hopfield (1984) and Marcus and Westervelt (1989): L(t) = −
X X 1X wij xi (t)xj (t) − Ii xi (t) + Gi (xi (t)). 2 i,j i i
(3.1)
For the DT model in equation 2.1, the one-step change 1L(t) = L(t + 1) − L(t) is X X 1X (wij xj (t)) + Ii 1xi (t) wij 1xi (t)1xj (t) − 1L(t) = − 2 i,j i j X (Gi [(xi (t + 1)) − Gi (xi (t)]) + i
X 1X wij 1xi (t)1xj (t) − σ −1 (1xi (t)/αi + xi (t))1xi (t) =− 2 i,j i X [Gi (xi (t + 1)) − Gi (xi (t))]. + i
Following Marcus and Westervelt (1989), by Taylor’s approximation theorem, and using G00i (ξ ) ≥ 1/µi for any ξ , Gi (xi (t + 1)) − Gi (xi (t)) ≤ G0i (xi (t + 1))1xi (t) −
1 [1xi (t)]2 . 2µi
Then 1L(t) ≤ −
X 1X wij 1xi (t)1xj (t) − G0i (1xi (t)/αi + xi (t))1xi (t) 2 i,j i
1240
Xin Wang et al.
X 1 [G0i (xi (t + 1))1xi (t) − [1xi (t)]2 ] 2µ i i 1X wij 1xi (t)1xj (t) =− 2 i,j X [G0i (1xi (t)/αi + xi (t)) − G0 (xi (t + 1))]1xi (t) − +
i
1X 1 [1xi (t)]2 − 2 i µi X 1 − αi 1X ≤− wij 1xi (t)1xj (t) − [1xi (t)]2 2 i,j α i µi i 1X 1 [1xi (t)]2 2 i µi X 2 − αi 1X ≤− wij 1xi (t)1xj (t) − [1xi (t)]2 . 2 i,j 2α µ i i i −
(3.2)
If at time t only neuron i is chosen to update its state and all the others remain unchanged, then 2 − αi 1 [1xi (t)]2 ≤ 0, 1L(t) ≤ − wii [1xi (t)]2 − 2 2αi µi
(3.3)
which implies that L(t) is decreasing if 1xi (t) 6= 0. Hence, any sequential trajectory tends to some fixed point. 4 Global Convergence Call a model of a neural network cycle free if it is globally convergent to fixed points only. The following proposition provides a condition that eliminates the possible “spurious” periodic dynamic behaviors of the discretized DT model in equation 2.1. Proposition 2. If W is symmetric, a sufficient condition for equation 2.1 to be cycle free in parallel dynamics is the matrix W + (2I − A)A−1 M−1 is positive definite,
(4.1)
where A = diag(αi ) and M = diag(µi ) are the diagonal matrices formed by the parameters αi and the neuron gains µi . Proof. Use the energy or Lyapunov function L(t) used in the proof of proposition 1. From equation 3.2, the first-order difference 1L(t) of L(t)
Absence of Cycles in Symmetric Neural Networks
1241
along any trajectory x(t) has an upper bound, 1 1L(t) ≤ − 1x(t)> (W + (2I − A)A−1 M−1 )1x(t). 2
(4.2)
The condition in equation 4.1 implies that the upper bound is negative, and hence the parallel dynamics is globally convergent. In a simple case where all gains µi = 1 (e.g., σi (z) = tanh(z)) and αi = α, this proposition says that the model is cycle free if the matrix W+[(2−α)/α]I is positive definite. The sufficient condition in equation 4.1 generalizes many existing conditions for the cycle-free dynamics in the literature. When αi = 1, it reduces to the condition that matrix W + M−1 is positive definite, which is the one presented in Marcus and Westervelt (1989) (with all Ri = 1 in its model) for the DT model (in equation 1.2) to be cycle free. Moreover, when µ → ∞, the sigmoidal functions tend to the signum function. If in this case αi ≥ ² for some fixed positive ², the condition in equation 4.1 reduces to the condition that the weight matrix W be positive definite, which is the one in Goles et al. (1985), except that in the latter case, W need be positive definite only on the set {−1, 0, 1}n . From the point of view of applications, this has the following consequences. Consider first the case of signum neurons. For cycle-free parallel operation, the matrix W must be positive definite. In several applications— for instance, in many optimization problems—this is not possible to achieve. One may turn to using sigmoidal neurons instead, with αi = 1. This has the advantage that even if W is not positive definite, cycle-free operation may be ensured by making the neuron gains µi sufficiently small. Unfortunately, in some applications, such control over neuron gains is not possible or desirable. For example, in the mean-field annealing algorithm, widely used for optimization, one usually employs a wide range of neuron gains. At the other extreme, αi → 0, with sigmoidal neurons, convergence to fixed points is assured for arbitrary symmetric W and for arbitrary nonnegative µi . In practical computation, however, the αi ’s need to be strictly positive. Our results reveal, given arbitrary symmetric W and arbitrary nonnegative neuron gains µi , how large we can make the αi ’s while ensuring cycle-free parallel operation. When αi are sufficiently small, the matrix in equation 4.1 will be dominated by its positive diagonal entries and become positive definite. In fact, Corollary 1. W. If
Let λmin be the minimum eigenvalue of the symmetric weight matrix
(i) λmin > 0 (i.e., W is positive definite itself) and αi are arbitrary in (0, 1], or
1242
Xin Wang et al.
(ii) λmin ≤ 0 and αi ’s satisfy 2 − αi > −λmin αi µi then the model in equation 2.1 is cycle free. Proof. Let W = P> 3P be an orthogonal decomposition of W; that is, 3 is a diagonal matrix formed by the eigenvalues of W and P is some orthogonal matrix with its transpose P> = P−1 . The condition in equation 4.1 is equivalent to that the diagonal matrix: 3 + (2I − A)A−1 M−1 is positive definite. The later condition can be fulfilled by either condition i or ii. The conclusion then follows from proposition 2. This corollary implies that if the weight matrix W is formed according to the Hebb rule as constructed in Hopfield (1982), then the model is cycle free. This is because W is an outer-product W = VV > − mI of a collection of some “memory” vectors V = [v1 , . . . , vm ], and it is positive definite. 5 Local Asymptotic Stability When all αi = α, equation 4.1 in proposition 2 that the matrix W+
2 − α −1 M α
being positive definite is also the one given in Wang and Blum (1992) to ensure consistency of the DT model (in equation 2.1) with the CT model (in equation 1.1) on local asymptotic dynamics around fixed points for symmetric networks. The consistency here means that any fixed point has exactly the same asymptotic stability in both equations 1.1 and 2.1. If these two models are consistent in this regard, a fixed point is an attractor (saddle point or repellor, respectively) of equation 1.1 if and only if it is an attractor (saddle point or repellor) of equation 1.2 (see Wang & Blum, 1992, for details). This answers the issue raised in Marcus and Westervelt (1989) on why a stable fixed point of equation 1.1 is also stable in equation 1.2, if a specific version of the condition in equation 4.1 is met. For symmetric networks, the consistency condition on the local asymptotic dynamics between the CT and DT models turns out to be a consistency condition between them on the global convergent dynamics as well. It is certainly interesting to see if this type of relationship between the local and global consistencies can be extended to general (nonsymmetric) networks.
Absence of Cycles in Symmetric Neural Networks
1243
6 Period-Doubling Bifurcation In many cases, equation 4.1 in proposition 2 is also necessary for the network to be cycle free. This can be addressed from a bifurcation point of view by treating the parameters αi as bifurcation parameters. Essentially the condition gives no room for the existence of period-doubling bifurcation, which is the source of generating possible two-cycles. Proposition 3. Let the activation functions σi be symmetric, that is, σi : < → [−a, a], and satisfy σi (0) = 0,
σi0 (0) = µi .
Let the external bias vector I = 0. Then the condition in equation 4.1 is also a necessary condition for the network to be cycle free. Proof.
Define
C = {(α1 , . . . , αn ) | W + (2I − A)A−1 M−1 is positive definite}. Let Ci denote the projection of the ith components of the n-tuples in C. Because α < α 0 ∈ Ci implies α ∈ Ci , each Ci is either the entire interval (0, 1] or an open interval (0, ci0 ) for some 0 < ci0 < 1. Notice that 0 is a fixed point of the network. The Jacobian of the iterative maps in equation 2.2 at the fixed point 0 is (I − A) + AMW.
(6.1)
Notice that the condition in equation 4.1 is equivalent to the eigenvalues of (2I − A) + AMW being all positive, which is further equivalent to the Jacobian in equation 6.1 having all eigenvalues λ ≥ −1. If C = (0, 1]n , the model has no cycles, according to proposition 2, for any (α1 , . . . , αn ) ∈ (0, 1]n . However, if Ci = (0, ci0 ) for some i and ci0 < 1, some eigenvalue of the Jacobian (see equation 6.1) becomes less than −1 when αi exceeds the “threshold” ci0 . During this course of changing αi , the network undergoes generically a period-doubling bifurcation (Ruelle, 1989), resulting in the emergence of some two-cycles. Thus, in this case, the condition in equation 4.1 in proposition 2 is also necessary to prevent this type of period-doubling bifurcation from happening around fixed points and hence to eliminate the possibility of generating two-cycles.
1244
Xin Wang et al.
Examples of σi ’s satisfying hypotheses of proposition 3 are tanhµi : < → [−1, 1] with tanhµi (z) = tanh(µi z). 7 Effect of Neuron Gains in Neural Computations Considerable research has been conducted on using equation 1.1 in neural computations, such as solving optimization problems approximately (see Hertz et al., 1991, chapt. 4, for an overview). Often the neuron gains µi are also modified while the network is evolving. A popular algorithm of this kind uses mean field annealing (MFA) (Peterson & Anderson, 1988) to solve optimization problems, in which small neuron gains are used initially and increased gradually. Similar situations also occur in some learning algorithms. In practice, a discretized model such as equation 2.1 is used instead. Proposition 2 gives some criteria on how to choose the “discretization stepsizes” αi , as functions of µi . If efficiency, for example, were the paramount consideration, one might want to choose αi as large as possible while ensuring that the sufficient condition of proposition 2 is met. The effect of changing µ on the largest-sufficing α can be examined as follows. For simplicity, consider the case where all neuron gains µi equal µ and all αi equal α. Let c1 be the supremum of α such that W + (2 − α)/(αd1 )I is positive definite and c2 the supremum of α such that W + (2 − α)/(αd2 )I is positive definite. (Here d1 and d2 play the role of two values of the neuron gain µ.) Then c1 and c2 satisfy (2 − c1 )/(c1 d1 ) = (2 − c2 )(c2 d2 ). Letting β = d2 /d1 , the previous equation gives c2 = 2c1 /(c1 + β(2 − c1 )).
(7.1)
Clearly, c2 is proportional to the reciprocal of the ratio β. Thus, when µ is small, α can be taken larger than when µ is large. This may be used to evolve the network efficiently in the beginning and slow it down later, while ensuring that two-cycles are never encountered. 8 Block-Sequential Dynamics In block-sequential dynamics, the neurons are partitioned into blocks, fixed in advance, that are updated sequentially, with neurons within a block updated in parallel. This allows control over the amount of parallelism that can be introduced in the evolution of the network without admitting cycles. One extreme corresponds to the sequential dynamics; the other corresponds to the parallel dynamics. Block-sequential dynamics was studied by Goles et al. (1985; Goles, 1982) for the evolution of equation 1.2 with σi as the signum function. Their results were generalized to the sigmoidal case in Herz and Marcus (1993).
Absence of Cycles in Symmetric Neural Networks
1245
Here we study this dynamics for the operation of equation 2.1 using arbitrary σi satisfying constraints 1–4 of section 2. We obtain a single result for the cycle-free operation of equation 2.1 under block-sequential dynamics, which gives propositions 1 and 2 as corollaries at the extremes of the block sizes. In block-sequential operation, the state vector x is partitioned into m blocks, x> = [X1 , . . . , Xm ]> . The components of Xk are updated in parallel according to equation 2.1 at time t + k, while using the most recent values of X1 , . . . , Xk−1 computed at times t + 1, . . . , t + k − 1 respectively. That is, at time t + 1, the neurons in X1 update their states in parallel according to equation 2.1, using x(t) as the previous state vector, and giving x(t + 1) as the new state vector with the updated values of X1 replacing their old values, and the other components remaining unchanged. At time t + 2, the neurons in X2 update their states in parallel using x(t + 1) as the previous state vector. And so on. Let D = W + (2I − A)A−1 M−1
(8.1)
denote the matrix in proposition 2. Let Di denote the submatrix of D whose row and column indices are restricted to Xi . Proposition 4. If W is symmetric, a sufficient condition for equation 2.1 to be cycle free in block-sequential dynamics is that all the matrices Di are positive definite. Proof. Use the energy function L(t) used in the proof of proposition 1. If at time t + k only neurons in block Xk update their states simultaneously then, from equation 4.2, 1 1L(t) ≤ − 1Xk (t)> Dk 1Xk (t). 2 Since all the Di are positive definite, 1L(t) is negative, and hence the blocksequential dynamics is globally convergent. Proposition 4 connects the sequential and parallel special cases very well. For the sequential dynamics, there are n blocks Xi = xi ; hence Di = Dii . Di being positive definite means that wii +
2 − αi > 0, αi µi
which gives proposition 1. For the parallel dynamics, there is a single block X1 = x; hence D1 = D. The result is exactly proposition 2.
1246
Xin Wang et al.
The usefulness of proposition 4 goes beyond these two cases, however. As the number of blocks is decreased in going from the sequential to the parallel case, the condition becomes more restrictive, whereas the parallelism in the network increases. This might help in choosing an appropriate block size for evolving a network according to equation 2.1 in associative memory or optimization applications so as to maximize parallelism without admitting cycles. Finally, the sufficient condition of proposition 4 generalizes a result of Goles et al. (1985) for the cycle-free evolution of equation 1.2 under blocksequential dynamics, with σi as the signum function. Let Wi denote the submatrix of W whose indices are restricted to Xi . When µ → ∞, the sigmoidal functions tend to the signum functions. For αi ≥ ² for some fixed positive number ², the condition of proposition 4 reduces to the condition that all Wi be positive definite, which is the one in Goles et al. (1985), except that in the latter case, Wi have to be positive definite only on the set {−1, 0, 1}ni where ni denotes the size of the block Xi . Corollary 2. If W is symmetric, a sufficient condition for equation 2.1 to be cycle free in block-sequential dynamics is that the matrix D given by equation 8.1 is positive definite. Proof. We use the following fact from linear algebra (see, e.g., Satake, 1975): if the matrix D in equation 8.1 is positive definite, then for any distinct i1 , i2 , . . . , im , the submatrix of D formed by the entries at intersections of the rows i1 , . . . , im and the columns i1 , . . . , im is also positive definite. 9 Distributed Dynamics In distributed dynamics, in each time step, a nonempty subset of neurons is chosen to update its states in parallel, in such a way that each neuron updates its state infinitely many times, over all time steps t. Distributed dynamics incorporates, as special cases, the sequential and parallel dynamics (by restricting the subsets to have cardinality 1 or n) and the block-sequential dynamics (by imposing additional constraints). Distributed dynamics, together with possible communication delays, constitutes totally asynchronous dynamics that provides a general model for parallel and distributed computation. For details of this and its generalization to continuous-time systems, see Bertsekas and Tsitsiklis (1989) and Wang, Li, & Blum (1994) and references in them. Fractional dynamics is another special case of distributed dynamics in which exactly r neurons update their states in every time step, with r fixed in advance between 1 and n. Fractional dynamics was proposed by Ballard, Gardner, & Srinivas (1987) to approximately solve optimization problems encoded into the Hopfield model.
Absence of Cycles in Symmetric Neural Networks
1247
In distributed dynamics, a nonempty subset Xt = {xi1 , xi2 , . . . , xim } (m ≥ 1 and 1 ≤ ij ≤ n) of the n neurons is chosen at time t. Neurons in Xt update their states in parallel according to the DT update equations at time t + 1. Neurons not in Xt remain unchanged at time t+1. Each xi occurs in infinitely many subsets Xt , over all time steps t. Distributed dynamics was studied by Herz and Marcus (1993) for the evolution of equation 1.2, with σi as a sigmoidal function. Here we study this dynamics for the evolution of equation 2.1. Let Dx (t) denote the square submatrix of the matrix D of equation 8.1 whose row and column indices are restricted to the set Xt of active neurons at time t. Proposition 5. If W is symmetric, a sufficient condition for equation 2.1 to be cycle free in distributed dynamics is that the matrices Dx (t) are positive definite at all times t. Proof.
The proof is similar to that of proposition 4.
Corollary 3. If W is symmetric, a sufficient condition for equation 2.1 to be cycle free in distributed dynamics is that the matrix D given by equation 8.1 is positive definite. Proof.
The proof follows from corollary 2 and proposition 5.
Corollary 3, together with our earlier results, indicates that the cycle-free behavior of equation 2.1 ensured by the condition in equation 4.1 is robust under a variety of dynamics. 10 Conclusions The continuous-time leaky integrator model is well known to be globally convergent to fixed points. A particular discrete-time (iterated-map) variant of this model has been well studied by others in the past. Previous results ensuring global convergence of this DT model have imposed somewhat stringent conditions on the network dynamics and parameters. Our article has studied a different DT model, which spans the CT model and the iterated-map DT model at its extremes. Results ensuring global convergence have been obtained for this DT model, under milder conditions than for the previous results on the iterated-map model. Our results have generalized several previous results (Hopfield, 1982, 1984; Goles et al., 1985; Marcus & Westervelt, 1989; Herz & Marcus, 1993) and span parallel, sequential, blocksequential, and distributed dynamics. Finally, the new DT model and our results are potentially of more relevance to applications (e.g., optimization) than the iterated-map model and previous results on it.
1248
Xin Wang et al.
Acknowledgments We thank anonymous referees for their helpful comments and reference to Herz and Marcus (1993). We also thank Eric Mjolsness and Anand Rangarajan for pointers to some recent related work by Waugh, Marcus, and Westervelt.
References Ballard, D. H., Gardner, P. C., & Srinivas, M. A. (1987). Graph problems and connectionist architectures (Tech. Rep.). Rochester, NY: Department of Computer Science, University of Rochester. Bertsekas, D. P., & Tsitsiklis, J. H. (1989). Parallel and distributed computation: Numerical methods. Englewood Cliffs, NJ: Prentice Hall. Blum, E. K., & Wang, X. (1992). Stability of fixed points and periodic orbits and bifurcations in analog neural networks. Neural Networks, 5, 577–587. Brown, D. P. (1992). Matrix tests for period 1 and 2 limit cycles in discrete threshold networks. IEEE Transactions on Systems, Man, and Cybernetics, 22(3), 552– 554. Bruck, J. (1990). On the convergence properties of the Hopfield model. Proc. IEEE, 78, 1579–1585. Cohen, M., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man and Cybernetics, SMC-13, 815–826. Goles, E. (1982). Fixed point behaviour of threshold functions on a finite set. SIAM Journal of Algorithmic Discrete Methods, 3, 529–531. Goles, E. (1986). Antisymmetrical neural networks. Discrete Applied Mathematics, 13, 97–100. Goles, E., Fogelman-Soulie, F., & Pellegrin, D. (1985). Decreasing energy functions as a tool for studying threshold networks. Discrete Applied Mathematics, 12, 261–277. Herz, A. V. M., & Marcus, C. M. (1993). Distributed dynamics in neural networks. Physical Review E, 47(3), 2155–2161. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hirsch, M. W. (1989). Convergent activation dynamics in continuous time networks. Neural Networks, 2, 331–349. Hopfield, J. J. (1982). Neural networks and physical systems with collective computational abilities. Proceedings of the National Academy of Sciences, USA, 79, 2554–2558. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, USA, 81, 3088–3092. Koiran, P. (1994). Dynamics of discrete time, continuous state Hopfield networks. Neural Computation, 6(3), 459–468.
Absence of Cycles in Symmetric Neural Networks
1249
Marcus, C. M., Waugh, F. R., & Westervelt, R. M. (1990). Associative memory in an analog iterated-map neural network. Physical Review A, 41, 3355–3364. Marcus, C. M., & Westervelt, R. M. (1989). Dynamics of iterated-map neural networks. Physical Review A, 40(1), 501–504. Peretto, P. (1984). Collective properties of neural networks: A statistical physics approach. Biological Cybernetics, 50, 51–62. Peterson, C., & Anderson, J. R. (1988). Neural networks and NP-complete optimization problems: A performance study on the graph bisection problem. Complex Systems, 2(1), 59–89. Ruelle, D. (1989). Elements of differentiable dynamics and bifurcation theory. San Diego: Academic Press. Satake, I. (1975). Linear algebra. New York: Marcel Dekker. Wang, X., & Blum, E. K. (1992). Discrete-time versus continuous-time neural networks. Journal of Computer and System Sciences, 49(1), 1–17. Wang, X., Li, Q., & Blum, E. K. (1994). Asynchronous dynamics of continuoustime neural networks. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 493–500). San Mateo, CA: Morgan Kaufmann. Waugh, F. R., & Westervelt, R. M. (1993). Analog neural networks with local competition. I. Dynamics and stability. Physical Review E, 47, 4524–4536. Xu, Z., & Kwong, C. P. (1995). Global convergence and asymptotic stability of asymmetric Hopfield networks. Journal of Mathematical Analysis and Applications, 191, 405–427. Received August 28, 1996; accepted October 2, 1997.
LETTER
Communicated by Bard Ermentrout
Pattern Generation by Two Coupled Time-Discrete Neural Networks with Synaptic Depression W. Senn Th. Wannier J. Kleinle H.-R. Luscher ¨ L. Muller ¨ J. Streit K. Wyler ¨ Universit¨at Bern, Buhlplatz 5, CH-3012 Bern, Switzerland
Numerous animal behaviors, such as locomotion in vertebrates, are produced by rhythmic contractions that alternate between two muscle groups. The neuronal networks generating such alternate rhythmic activity are generally thought to rely on pacemaker cells or well-designed circuits consisting of inhibitory and excitatory neurons. However, experiments in organotypic cultures of embryonic rat spinal cord have shown that neuronal networks with purely excitatory and random connections may oscillate due to their synaptic depression, even without pacemaker cells. In this theoretical study, we investigate what happens if two such networks are symmetrically coupled by a small number of excitatory connections. We discuss a time-discrete mean-field model describing the average activity and the average synaptic depression of the two networks. Depending on the parameter values of the depression, the oscillations will be in phase, antiphase, quasiperiodic, or phase trapped. We put forward the hypothesis that pattern generators may rely on activity-dependent tuning of synaptic depression. 1 Introduction Spinal pattern generators rely on intrinsic spinal circuits activated by descending signals from the brain stem that autonomously coordinate the alternate rhythmic activity inducing walking or swimming. The basis of the alternate oscillations is thought to consist of reciprocal inhibitory synaptic interactions between two groups of neurons (Getting, 1989). We suggest that this same functionality can be realized by depressing synapses with purely excitatory connections between and within the two neural groups. The proposed mechanism for pattern generation seems to be particularly attractive since in the early stage of prenatal development, the inhibitory connections are not yet developed but rhythmic activity can already be c 1998 Massachusetts Institute of Technology Neural Computation 10, 1251–1275 (1998) °
1252
W. Senn et al.
observed (Gao & Ziskind-Conhaim, 1995). Moreover, the transition between different activity patterns does not need to be tuned by descending input or additional circuits. Rather, these transitions may be induced by changing the parameters controlling the dynamics of the synaptic depression without affecting the connection strength between the networks. The synaptic parameters themselves can be thought of as being activity dependent, and this will allow the spinal cord to recall different rhythmic patterns in a self-organizing way. An activity-dependent regulation of synaptic depression has indeed been found experimentally in other regions of the nervous system (Markram & Tsodyks, 1996; Abbott, Varela, Sen, & Nelson, 1997), and current network simulations suggest that this synaptic self-regulation may indeed be necessary to generate spinal cord oscillations in a narrower physiological parameter regime (Streit & Senn, 1997). In embryonic rat spinal cord cultures, rhythmical activities are induced by pharmacological blockage of inhibitory synaptic transmissions (Streit, 1993). A time-discrete mathematical model has been developed to investigate the observed phenomena, and its analysis has shown that rhythmical activities characterized by the synchronous firing of numerous neurons can be obtained from randomly connected excitatory neurons with synaptic depression (Senn et al., 1996). However, it is not clear whether the alternating activity typical for spinal pattern generators can also be based on such mechanisms. To analyze this question, we investigated the activity generated by a symmetrical coupling between two identical networks (called isotropic coupling), as it may occur between the two sides of the spinal cord. We describe the state of a network by two time-dependent variables: the average activity and the average synaptic depression. Guided by observations in organotypic cultures of embryonic rat spinal cord, the strength of synaptic transmission between two neurons is taken to be depressed after the occurrence of an action potential in the presynaptic cell and to return to its normal value with an exponential time course. We found that the time constant of the recovery from synaptic depression is an adequate parameter to control the dynamics of the coupled system. In the spinal slice cultures, the time constant of synaptic depression could be pharmacologically modulated by atropin, leading to the speculation of a physiological modulation by presynaptic receptors. Depending on this time constant, the activities of the networks will either converge to a fixed value or show various types of rhythmic behavior. We observe the following oscillations in the two networks: inphase (0 degree phase lag), antiphase (180 degree phase lag), out-of-phase (constant phase lag other than 0 or 180 degrees), quasiperiodic (oscillating amplitudes), phase trapped (oscillating phase lag), or phase walkthrough (constant phase drift). In addition, we find bistability where both inphase and antiphase oscillations are stable at the same parameter values. Coupled neural oscillators have been studied theoretically in different contexts. They have been found to occur in the locomotion of the lamprey
Pattern Generation
1253
(Kopell, 1988), in circadian rhythms (Kawato & Suzuki, 1980), in paired cell oscillations (Rinzel & Ermentrout, 1989; Sherman & Rinzel, 1992), in short-term memory (Horn & Usher, 1991), and more recently in elementary pattern recognition where they serve for dynamic feature binding and pattern segmentation (Gerstner, Ritz, & van Hemmen, 1993; Wang & Terman, 1997). In the light of the recent discoveries of synaptic depression in neocortex (Markram & Tsodyks, 1996; Abbott et al., 1997), one can ask whether the different oscillation patterns that could emerge from synaptic depression also play a role in higher cognitive functions. From a mathematical point of view, coupled oscillators are classified according to the strength of coupling, and different techniques have been developed for different coupling strengths. If the coupling is weak, the effect of the coupling depends on only the phase difference between the oscillators. This approach of reducing each oscillator to a single-phase variable is based on the average phase difference (APD) theory (Ermentrout & Kopell, 1991), which until now has been used only in the context of timecontinuous systems. By applying these ideas to the time-discrete case, we can explain phase-locking phenomena such as inphase and antiphase oscillations (section 3). Unlike the time-continuous case where only phase-locked solutions exist for isotropically coupled oscillators of the same frequency, phase trapping is also possible in the time-discrete case. If the coupling becomes stronger, the amplitudes of the oscillations are disturbed as well, and quasiperiodicity can occur in both cases. The technique for investigating this stronger coupling is bifurcation analysis, which we apply to the time-discrete model (section 4). 2 The Model A neuron is considered to be a threshold element producing an action potential when the sum of incoming excitatory postsynaptic potentials (EPSPs) exceeds some defined threshold value (see Figure 1a). The number of (excitatory) synaptic connections projecting onto each of a population of cells is assumed to be Poisson distributed with mean µ. Thus, the probability that m there are m connections onto a specific cell is µm! e−µ . This probability is assumed to be the same for each cell in the network, a reasonable assumption as long as one considers cultures of embryonic spinal cord. Although there are on average µ anatomical connections ending up on a single cell, only a fraction of these synaptic connections is functionally active and may induce an EPSP in the postsynaptic cell. Obviously the presynaptic cell must be active, but at the same time, the synaptic transmission must function reliably. On average, the number of active connections is reduced to µat st , where at ∈ [0, 1] is the average activity of cells and st ∈ [0, 1] is the average synaptic reliability at time t. Thus, the probability that there are m active m t st ) e−µat st . If a cell needs m◦ connections projecting onto a cell at time t is (µam!
1254
W. Senn et al.
(a) inputi = K or 0 s ji = synaptic reliability sji
∼µ
j
Σ
outputj = K or 0
(b) µ= µ11
µ12 = µ21
τ
µ= µ22 τ
thresholdj = 1
Figure 1: (a) Each neuron is modeled by a threshold element with timedependent synaptic reliability. The number of connections onto cell j is Poisson distributed with mean µ. A possible action potential from cell i to cell j is transmitted only with reliability sji ∈ [0, 1]. (b) Two weakly coupled networks with random excitatory connections.
incoming EPSPs to exceed a set threshold, the probability of triggering an action potential at time t + 1 becomes Z µat st X (µat st )m 1 −µat st e = xm◦ −1 e−x dx . m! 0(m◦ ) 0 m≥m◦ The equation is obtained by integrating the right-hand side m◦ − 1 times by parts. If the threshold is normalized to 1 and if K represents the height of an average EPSP, the cell needs at least m◦ = K1 EPSPs to reach threshold. The average activity at+1 of the whole network at time t + 1 is equal to the probability that any single cell is activated. Thus, at+1 = FK (µat st ),
. where FK (y) =
1 0( K1 )
Z
y
x K −1 e−x dx . 1
(2.1)
0
In the embryonic spinal cord cultures of the rat, an EPSP is found to be very large and is roughly 0.8 times threshold (which itself is ∼ 13 mV above the resting potential; Streit, 1993). We therefore fix the height of an undepressed EPSP at K = 0.8. In the same cultures, the average number of connections is estimated to lie between 3 and 20. The average conduction delay between two cells is ∼ 14 ms, and this time delay is assumed to elapse between time step t and t + 1 in our model (Streit, Spenger, & Luscher, ¨ 1991). The long conduction delay is explained by the immaturity of the cells and the fact that axons are not myelinated. The average synaptic reliability st+1 is calculated from its value st and the average activity at at the previous time step. Due to effects of synaptic depression, st+1 becomes small if the previous activity at was large. The synaptic reliability will therefore be depressed by a factor dτ (at ), where 1 . dτ (y) = 1 − ye− τ is a linearly decreasing function in y. The constant τ is
Pattern Generation
1255
interpreted as the time constant of recovery from synaptic depression and is referred to as the synaptic depression time (constant). It has been found to be ∼ 220 ms (Streit et al., 1991), which corresponds to 15 time steps in our model. In addition, due to synaptic recovery effects, the synaptic reliability st+1 should be small if it was small during the previous time step. This is expressed by a further depression factor dτ (1 − st ), a linearly increasing function of st . Including both effects, the synaptic reliability st is given by st+1 = dτ (at ) · dτ (1 − st ),
dτ (y) = 1 − y · e− τ . 1
(2.2)
This same formula can also be derived by looking at the history of a single synapse and a subsequent averaging over the ensemble. Let sji (t) be the individual synaptic reliability of the connection from cell i to cell j at time t. As a function of the activity of the presynaptic cell i during the past two time steps t − 1 and t − 2, we set 1, 2 1 − e− τ ,
if cell i was active at neither time t − 2 nor t − 1 if cell i was active at time t − 2 but not at t − 1 (2.3) sji (t) = − τ1 , if cell i was active at time t − 1 1 − e but not at t − 2 − τ1 − τ2 (1 − e )(1 − e ), if cell i was active at time t − 1 and t − 2 . The idea is to quantify the influence of a synaptic transmission that occurred 4t time steps before by depressing sji (t) with a factor (1 − e−4t/τ ). Taking into account only the two preceding time steps and averaging over i and j yields, together with the first-order approximation st−1 ≈ 1 − at−2 e−1/τ , the final formula, equation 2.2 (cf. Senn et al., 1996). We now consider two networks with identical parameter values that are weakly symmetrically coupled in both directions (see Figure 1b). The average number of (excitatory) connections between cells of network i and j is denoted by µji = µij (i, j ∈ {1, 2} , i 6= j). The number of connections between the networks is again assumed to be Poisson distributed with mean µij . Thus, the average number of connections active at time t and terminating on a cell of the population i consists of active connections of its own j j population and of the other population and sums to µii ait sit + µij at st . Here, i i at and st (i = 1, 2) denote the average activity and the average synaptic reliability, respectively, of population i. From equations 2.1 and 2.2, we obtain the time-discrete evolution of the coupled system: a1t+1 = FK (µa1t s1t + µ12 a2t s2t ),
a2t+1 = FK (µa2t s2t + µ21 a1t s1t ),
s1t+1 = dτ (a1t ) · dτ (1 − s1t ),
s2t+1 = dτ (a2t ) · dτ (1 − s2t ).
(2.4)
1256
W. Senn et al.
In the following discussion, we qualitatively distinguish between weak coupling, which essentially affects only the phases and frequencies of the individual oscillators, and medium-strength coupling, which also influences their amplitudes. In the scenario of weak coupling, we set the average number µij of connections between cells of different networks to ∼ 0.001 (compare this to the typical number µ ∼ 10 of intrinsic connections per cell), and for medium-strength coupling we set it to 0.1. It is also possible for pairs of networks to be strongly coupled, where µij is larger than ∼ 1. Such oscillators produce only in-phase oscillations. 3 Weak Coupling: Average Phase Difference Theory 3.1 Average Phase Difference Theory for Discrete Time. In APD theory, the description of an oscillator is reduced to a single phase variable. This phase corresponds to the angle that fixes the position of the oscillator on the closed curve representing the dynamics in phase space (cf. Figure 10b). The interaction of two oscillators is then studied by its influence on the phase difference between the oscillators. In the following, we apply this theory, which originally was developed for continuous time (Kopell, 1988), to the present case of discrete time. We assume that the mapping (at , st ) → (at+1 , st+1 ) of the uncoupled system is smooth and that there is a stable invariant closed curve describing the uncoupled oscillator. We also assume that the uncoupled mapping is transformed in such a way that on the closed curve, it shows a constant increment of the phase θt , θt+1 −θt = ω. This is always possible if the rotation number ω/2π of the invariant closed curve is irrational (Guckenheimer & Holmes, 1990; Denjoy’s theorem). The constant ω is interpreted as the radial frequency of the phase-variable θ. (Note that the radial frequency ω can be negative, while the natural frequency f = |ω|/2π of the oscillator is always positive.) Now we consider two weakly symmetrically coupled identical j oscillators with smooth coupling function h² (θti , θt ) with 0 < |h² | ≤ ² ¿ 1 .
(3.1)
The coupling is supposed to change the (instantaneous) radial frequency ω of oscillators i by some quantity h² , depending on the actual phases θti and j j θt of oscillator i and j, respectively. The function h = h² (θti , θt ) defined this way describes the acceleration of the frequency of oscillator i during one i − θti = ω, time step caused by the coupling. Thus, the equations 4θti ≡ θt+1 (i = 1, 2), of the uncoupled system become in the presence of symmetric coupling: 4θt1 = ω + h² (θt1 , θt2 ),
4θt2 = ω + h² (θt2 , θt1 ).
(3.2)
If ² is small, the phase difference between the two oscillators does not change too much during n (À 1) iterations as long as n ¿ 1/². Moreover, if ω/2π is
Pattern Generation
1257
irrational, the iterated points θ0i = 0 , θ1i ≈ ω , θ2i ≈ 2ω, . . . , θni ≈ nω start to sample the invariant curve (²-)densely and (²-)uniformly, and this allows averaging the coupling effect over the closed curve. Hence, dropping the index ², the average influence of the coupling at phase lag φ is approximately H(φ) =
1 2π
Z
2π
h(θ, θ + φ) dθ,
(3.3)
0
and this represents the average acceleration per time step of one oscillator if it lags the other by φ. For a numerical evaluation of the function H, we refer to section A.1 (an algorithm in the time-continuous case is presented in Williams & Bowtell, 1997). The average phase increments of the two isotropically coupled oscillators may now be written as 4θt1 = ω + H(φt ),
4θt2 = ω + H(−φt ),
(3.4)
where φt = θt2 − θt1 . The condition that the two oscillators are phase locked is φt+1 − φt = 0. From equation 3.4, one calculates φt+1 − φt = 4θt2 − 4θt1 = H(−φt ) − H(φt ) = −2Hodd (φt ) = 0,
(3.5)
. where Hodd (φ) = 12 (H(φt ) − H(−φt )) denotes the antisymmetric part of H. Thus, two solutions θt1 and θt2 are phase locked with constant phase lag φ◦ if and only if Hodd (φ◦ ) = 0. The solutions are stably phase locked if, for any small deviation φt from φ◦ , the successive phase lags φt+1 , φt+2 , . . . converge back to φ◦ . It is a simple result from the iteration theory of real functions that a function φt+1 (φt ) is stable at φ◦ if it crosses the diagonal | < 1 at this φt+1 = φt with an angle less than 45 degrees—that is, if | dφdφt+1 t point (consider Figure 1b in Senn et al., 1996, with a replaced by φ, or consider the standard book of Collet & Eckmann, 1986). According to equation 3.5, one has φt+1 = φt −2Hodd (φt ), and the last equality turns out to be equivalent to ¯ dHodd (φ) ¯¯ <1. (3.6) 0< dφ ¯φ=φ◦ Thus, equation 3.6 is the condition that the two oscillators, locked at lag φ◦ , are actually stably locked. Notice that the upper bound is due to the time discretization and does not occur for continuous time (Kopell, 1988, formula 3.7). To make a rigorous statement, we must restrict ourselves to rotation numbers ω/2π, which are badly approximated by rational numbers—that is, which satisfy for some c, ν > 0 the relation ¯ ¯ ¯ω p ¯¯ c ¯ ¯ 2π − q ¯ ≥ q2+ν
for all p ∈ Z and q ∈ N .
(3.7)
1258
W. Senn et al.
This condition is necessary to exclude a self-locking of the coupled oscillators on a rational frequency, which could make the approximation in equation 3.3 bad. Our investigations lead to the following theorem: Theorem 1. Suppose the mapping describing the uncoupled system exhibits a stable invariant closed curve with rotation number ω/2π ∈ R \ Q being badly approximated by rationals. We assume a coupling of the form in equation 3.2 with coupling function h² satisfying equation 3.1. Then, for small ², the dynamics of the coupled system in equation 3.2 is approximated by the averaged equations (see equation 3.4). The number of phase-locked solutions of equation 3.4 is even and, enumerating them according to increasing phase lag, at most, every second of these solutions is stably phase locked. If there are exactly two phase-locked solutions, the corresponding phase lags must be φ◦ = 0 and φ◦ = π , and at least one of them is unstable. A solution locked at lag φ◦ is stable if and only if equation 3.6 is satisfied. The change of the radial frequency ω from the uncoupled oscillation to the phase-locked oscillation is approximately given by H(φ◦ ). Remark. The restriction of ω/2π to numbers that are badly approximated by rational numbers is rather severe since, although these numbers have full measure, the parameter values leading to an invariant curve with rational rotation number generically form open-dense sets (the so-called Arnold tongues; cf. Arnold, 1965) in the parameter space. On the other hand, the parameter values with irrational rotation numbers generically have positive measure as well (Herman, 1983; Guckenheimer & Holmes, 1990). Proof of Theorem 1. Since the uncoupled mapping exhibits an invariant j curve, it is possible to define the coupling function h² (θti , θt ) together with its average H(φ) (cf. section A.1). If the rotation number ω/2π of the uncoupled mapping is irrational, the angles θt = t ω + θ◦ (t = 1, 2, . . .) of the uncoupled system are uniformly distributed (see, e.g., Hlawka, 1979) and for small ², the temporal average of the interactions converges to the spatial average H(φ) in the same way as this is the case for multiple-pulse-coupled oscillators (Ermentrout & Kopell, 1991). In the time-discrete case, we are faced with the additional problem that the single oscillator might be self-locked onto a fixed frequency and a fixed phase and that this self-locking force could dominate the influence of the coupling between the two networks. It remains to be proved in a mathematically rigorous way that if ω/2π is badly approximated by rationals, the self-locking can be excluded, since for small perturbations, the locking at the frequency ω/2π is of lower order than the interaction h² (cf. Arnold, 1965). For small ², the qualitative dynamics of the coupled system in equation 3.2 will then be the same as for the averaged system in equation 3.4. To discuss the approximation in equation 3.4, we note that according to equation 3.5, every zero φ◦ of Hodd leads to a phase-locked solution of
Pattern Generation
1259
equation 3.4, and φ◦ represents the constant phase lag from the first to the second oscillator. Now, the origin φ◦ = 0 is always a zero of Hodd since Hodd is antisymmetrical, that is, Hodd (−φ) = −Hodd (φ). Similarly, φ◦ = π is always a zero since Hodd is antisymmetrical with respect to π , that is, Hodd (π − φ) = −Hodd (π + φ). This last equation follows from the antisymmetry (with respect to 0) and the 2π periodicity of Hodd . But the antisymmetry with respect to π implies that every zero must occur symmetrically to π . Since 0 (φ ) 6= 0; see above), these zeros are generically nondegenerate (i.e., Hodd ◦ every new zero appears pairwise in the interval (0, 2π ) (cf. Figure 2b). To prove the stability statements, one asserts that generically at every second zero, the function Hodd crosses the zero line with positive slope. According to the criterion in equation 3.6, an upward crossing corresponds to a stably phase-locked solution if, in addition, the slope is < 1. Thus, at most every second zero can be stable. In case of exactly two zeros (which must be 0 and π ) either the inphase, the antiphase, or neither is stable. The statement about the frequency follows directly from equations 3.4, from which the frequency of an oscillation locked at lag φ◦ is found to be ω + H(φ◦ ). Apart from the restriction imposed onto the rotation number, there is another important difference between APD theory for discrete and continuous time. For continuous time, the phase-locked solutions for isotropic coupling generically arise in pairs symmetrically to the zero lag. Since for continuous time the stability criterion does not include the upper bound in equation 3.6, there is generically at least one attractive phase lag, and any oscillation will eventually stably phase-lock. For discrete time, however, the phase-locked solutions may all be unstable since there is no guarantee that every second zero of H will also satisfy the upper bound in equation 3.6. In such a case, the phase either waxes and wanes (phase trapping) or shows a constant drift (phase walkthrough). Thus, two weakly isotropically coupled time-discrete oscillators will, after some transient, be either phase locked, phase trapped, or phase walkthrough. In case of phase trapping, the phase lag can show any dynamical behavior known from iteration theory of one-dimensional maps such as 2k periodicity, almost periodicity, and chaos (Collet & Eckmann, 1986). For continuous time, phase-trapped solutions were first investigated by Wever (1972) and Kronauer, Czeisler, Pilato, Moore-Ede, & Weitzman (1982) by means of an additional amplitude consideration. 3.2 Application of APD Theory to the Time-Discrete Model. In order to apply the theorem to the system (see equation 2.4) we have to ensure that the rotation number of the uncoupled oscillation (µij ) is irrational. This is the case if the limit set of an iterated point of the uncoupled system forms a closed curve in the (a, s)-phase plane. It turns out that for the mapping, there is indeed a high chance of finding parameter values (µ, τ, K) with such an invariant closed curve. According to the remark above, there is always a nonvanishing chance to get an irrational rotation number by randomly
1260
W. Senn et al. −5
(a) 1 a11 and a2 (dashed)
(b) 1
0.8
(φ) odd (ϕ) H Hodd
0.5
0.6
0
0.4
−0.5
0.2 0 400 1
−1
420 440 460 Time (seconds) Time(sec)
480
0
π/2
π φ
3π/4
2π
π/2
π φ
3π/4
2π
−4
(d) 2
a1 and a2 (dashed)
(c)
x 10
x 10
0.8
(ϕ) H odd H (φ) odd
1
0.6
0
0.4
−1
0.2 0 400
−2
420 440 460 Time (seconds) Time(sec)
480
0
Figure 2: A zero φ◦ of the function Hodd represents stable oscillations with phase lag φ◦ . (a,b) For very weak coupling strength µij = 0.0001 the phase lag φ◦ = π satisfies the stability criterion (see equation 3.6), and the antiphase oscillation is therefore stable. (c,d) For weak coupling with µij = 0.02 the slope of Hodd is positive at the zero φ◦ = 0, and the in-phase oscillation is now stable.
choosing the parameter values. The chance to pick out “bad” parameter values from an Arnold tongue, which then lead to a discrete orbit with a rational rotation number is smaller the weaker the coupling is. As an example, we consider the uncoupled system for the parameter values µ = 9.0, τ = 9.5, and K = 0.8 and calculate the function H(φ) for the two coupling strengths µij = 0.0001 and µij = 0.02 (see Figures 2b and 2d and section A.1 for more details). Looking at the zeros of the asymmetric part Hodd shows that in both cases, there are phase-locked solutions at φ◦ = 0 and φ◦ = π. However, applying the criterion in equation 3.4, only the antiphase solution at coupling strength µij = 0.0001 (see Figure 2b, φ◦ = π ) and the inphase solution at µij = 0.02 (see Figure 2d, φ◦ = 0), for which the
Pattern Generation
1261
1
(b) 1
a1 and a2 (dashed)
(a)
0.8
0.5
a
1
0.6 0.4
0 1
0.2
1 0.5
0
2
3
****
Time (sec)
oooo
4
5
s
2
0.5 0 0
a2
Figure 3: (a) One network may “stumble” and oscillate twice in a silent period of the other (µ = 10, τ = 12, µij = 0.002). (b) The corresponding phase diagram (a1 , a2 , s2 ) shows that the trajectory cuts off the blob of the a1 -oscillation during the time from 3.3–3.6 s (marked with ∗ ∗ ∗ in a and b) while for 3.9–4.2 s, the trajectory follows the invariant curve (◦ ◦ ◦).
slope of Hodd at the zero is positive, are stably phase locked. Starting at the same initial condition, the system will converge after some transients to a stable antiphase oscillation (see Figure 2a) or to a stable in-phase oscillation (see Figure 2b). For some intermediate values of the coupling strength, the function Hodd will cross the φ−axes at phase lags different from 0 and π , and out-of-phase oscillations with stable lags φ◦ 6= 0, π can be observed. In general, increasing the coupling strength tends to stabilize the in-phase solution. An interesting anomaly is discovered for some parameter values, which we call a stumbling solution. For these parameter values, one may have two oscillations of one network against one oscillation of the other network. After a transient period, the network eventually will phase-lock (see Figure 3).
4 Medium-Strength Coupling: A Bifurcation Analysis If the coupling strength is increased by two orders of magnitude, we find new phenomena such as bistability (stable inphase and antiphase oscillations) and quasiperiodicity (oscillating amplitude). To explain these phenomena, we fix the coupling strength between the networks at µij = 0.1 and discuss the bifurcations induced by changing the synaptic depression time τ and the coupling strength µ within the networks. First we transform
1262
W. Senn et al.
the system of coupled oscillators into appropriate coordinates to study the inphase and antiphase oscillations. This canonical transform is given by the S
+ − − linear mapping (a1t , s1t , a2t , s2t ) 7−→ (a+ t , st , at , st ), where
and
1 2 . at + at , = a+ t 2
1 2 . st + st s+ = t 2
(4.1) . a− t =
a1t
− 2
a2t
,
. s− t =
s1t
− 2
s2t
.
Analyzing the steady-state equations of system 2.4, we can show that there is at most one attracting fixed point within the open cube (0, 1)4 . This fixed point has the coordinates (P f ix , P f ix ), where P f ix = (a◦ , s◦ ) is the “second” intersection point of the nullclines CK,µ+µij = {(a, s) | a = FK ((µ + µij )as) } and Cτ = {(a, s) | s = dτ (a)dτ (1 − s) } (see Senn et al., 1996). For in-phase solutions, the coordinates a− and s− vanish, while for (“pure”) + antiphase solutions, one has a+ t = a◦ and st = s◦ for all t. The fixed point within these new coordinates is S((P f ix , P f ix )) = (P f ix , 0). Let 8 : + − − + + − − (a+ t , st , at , st ) 7−→ (at+1 , st+1 , at+1 , st+1 ) be the mapping (see equation 2.4) in the new coordinates. The crucial point is that the linearization of 8 at the fixed point diagonalizes to µ ¶ d8+ 0 , (4.2) d8|(Pf ix ,0) = 0 d8− where d8± are 2 × 2 matrices (cf. section A.2). The tangent space decomposes into a direct sum E+ ⊕ E− ' R2 × R2 on which d8+ and d8− act, respectively. E+ is the space of the in-phase solutions and is spanned by the coordinates (a+ , s+ ) while E− is the space of the antiphase solutions and is spanned by (a− , s− ) and translated to the fixed point. The stability of the fixed point within the subspaces E+ and E− is determined by the stability of the corresponding restricted tangent mappings d8± . These linear mappings are stable if their eigenvalues have modulus smaller than 1 and unstable if the modulus is larger than 1. As shown in section A.2, the eigenvalues of d8+ and d8− , denoted by λ+ , λ+ and λ− , λ− , respectively, are complex conjugates and their modulus is calculated as |λ± | =
q (µ ± µij )η²τ (s◦ + (1 − ²τ )a◦ ),
(4.3)
where we abbreviated . η = fK ((µ + µij )a◦ s◦ ) , fK (y) 1 1 1 . y K −1 e−y and ²τ = e− τ . = 0(1/K)
(4.4)
Pattern Generation
(b)
1 0.9
sI
0.8 0.7
sF
sA uF
0.6
?
sA
uA
0.5 0.4 0.3
?
0.2 0.1 0 4
5
τ+ τ6
7 tau
τQ
8
9
10
maximal/minimala1 amplitude of a¹
maximal/mimal amplitudes of a¹ a1
(a)
1263
1
0.9 0.8
sQ
uI
0.7
sA
0.6
sI
0.5 0.4
uA
sF uF
uF
0.3 0.2 0.1 0 7
8
9
τ+ τ Q
10
τ11
12 tau
13
14
τq
15
16
17
Figure 4: (a) Bifurcation diagram for µ = 16. (b) Bifurcation diagram for µ = 10. At τ + and τ − a Naimark-Sacker bifurcation in the plane E+ and E− , respectively, occurs. At τ Q , the invariant circle bifurcates into an invariant torus, which again merges into a circle at τ q . The following abbreviations are used: s = stable, u = unstable, F = fixed point, I = in-phase oscillation, A = antiphase oscillation, Q = quasiperiodic oscillation.
4.1 Bifurcations with Respect to the Depression Time Constant τ . By tuning the synaptic depression time τ , we find that |λ− | < 1 and |λ+ | < 1 for τ smaller than some critical value τ + (see Figures 4a and 4b). This shows that the fixed point is stable for τ < τ + . At τ = τ + the eigenvalue λ+ crosses the unit circle and generates a (forward, supercritical) NaimarkSacker bifurcation (the discrete analog of a Hopf bifurcation for flows) in the subspace E+ . From this bifurcation, a stable invariant curve C+ close to the subspace E+ emerges, and this curve is composed of all in-phase solutions with different starting values. There is a second Naimark-Sacker bifurcation if the fixed point loses its stability in the directions of E− as well. This happens at τ = τ − when the eigenvalue λ− crosses the unit circle (see Figures 4a and 4b). At this point, an invariant curve C− emerges within E− , which comprises the antiphase solutions for different starting values. At its birth, when C− is still near the fixed point, C− is unstable since the fixed point is unstable in the directions E+ orthogonal to C− . In the case µ = 16 (see Figure 4a), C− and thus the antiphase solutions become stable at some value τ Q > τ − (e.g., through a backward torus bifurcation, which was not clearly identified numerically). Independent of this stability change of the antiphase solutions, the in-phase solution remains stable for all values τ > τ + . As the bifurcation diagram (see Figure 4a) shows, we have bistability, in other words, the stability of the in-phase and antiphase oscillations at the same parameter values. Figure 5 shows that for τ = 9, both solutions may stabilize after a few oscillations. If the synaptic
1264
W. Senn et al.
depression time τ is increased also, first the in-phase and then the antiphase cycle disappear, and this leads to the extinction of any activity. Fixing the connectivity parameter at µ = 10 (see Figures 4b and 6), we further observe a bifurcation of the stable invariant curve C+ into a stable invariant 2-torus. This occurs at parameter value τ = τ Q , where C+ itself loses its stability. The limit set of a trajectory then consists of either the 2torus itself (as for τ = 10 in Figure 6), a finite number of closed curves (see Figure 7a), or a finite number of points. These different limit sets arise if the components of the corresponding rotation vector (ω1 /2π , ω2 /2π ) are both irrational (2-torus), if one is irrational and the other rational (collection of closed curves), or if both are rational (collection of isolated points). Here, ω1 and ω2 are the radial frequencies of the activity oscillations and the slower amplitude oscillations, respectively. The phenomenon of having two oscillations with different time scales is sometimes referred to as quasiperiodicity. Between τ = 11 and τ = 12, the 2-torus splits again and forms a 3-torus (not shown in the bifurcation diagram) since we find three frequencies corresponding to a rotation vector (ω1 /2π , ω2 /2π , ω3 /2π ) (see Figure 7b). The three different frequencies represent the fast frequency of the original oscillation (∼ 6.3 Hz), the slower frequency of the amplitude oscillations (∼ 3 · 10−2 Hz), and the slowest frequency of the amplitude-bound oscillations (∼ 3 · 10−4 Hz). In the phase space, the corresponding trajectory moves on a 3-torus isomorphic to the direct product of three circles. If only the first two oscillations are present, the trajectory lies on a 2-torus embedded in [0, 1]4 (see Figure 7a). Examining the oscillations in Figure 7b more closely, one sees that the two solutions are phase trapped; the phase difference itself may oscillate (see Figure 8a). At τ = τ q , the attracting 2torus collapses to a stable cycle, which now represents a stable alternating oscillation (as for τ = 15 in Figure 6). Figure 8b shows that the different oscillation types found for static values of τ are also present if the parameter τ is increased dynamically. For mathematical correctness, we mention that in general, the limit sets in the phase space with irrational rotation numbers can also be homeomorphic to Cantor sets (1-, 2-, or 3-Cantori) and this is even the “typical” case in some measure theoretic sense (Herman, 1983). 4.2 Bifurcations Within the Parameter Plane (µ, τ ). We next investigate the activity patterns as a function of the connectivity number µ and the depression time constant τ (see Figure 9). From a phenomenological point of view, the possible oscillations we find for a pair of parameter values (µ, λ) are either in phase or, more frequently, antiphase. This behavior coincides well with the prediction of the local fixed-point analysis—with the fact that for (|λ− | < 1, |λ+ | > 1), an in-phase and for (|λ− | > 1, |λ+ | > 1) an antiphase solution emerge. By the same local analysis, the activity converges to the fixed point (P f ix , P f ix ) if both eigenvalues have modulus smaller than 1 and the activity dies out if this fixed point does not exist at all. Thus, the classification of the fixed point yields the following criterion (cf. Figure 9):
Pattern Generation
1265
0.8
0.8
a1 and a2 (dashed)
(b) 1
a1 and a2 (dashed)
(a) 1
0.6 0.4 0.2 0
0.6 0.4 0.2
Time (seconds)
0
Time (seconds)
Figure 5: For particular parameter values (µ = 16, τ = 9, µij = 0.1, K = 0.8), the in-phase and the antiphase solutions are stable. This situation is possible only if the (pairwise complex conjugated) eigenvalues of the linearized mapping (see equation 2.4) at the fixed point have modulus greater than 1 (|λ− | = 1.012, |λ+ | = 1.019). (a) With starting values (a10 , s10 ) = (0.2, 0.1) and (a20 , s20 ) = (0.2, 1), the activities synchronize. (b) With different starting values (a10 , s10 ) = (0, 1) and (a20 , s20 ) = (0.2, 1), the activities eventually lock antiphase.
Phenomenological criterion. Let λ± be the eigenvalues (see equation 4.3) of the tangent mappings dφ ± restricted to the subspaces E− and E+ . Then, the long-term behavior of the two network activities may be classified by 1. no fixed point 6= 0: extinction of both activities. 2. |λ− | < 1 , |λ+ | < 1: convergence of both to a constant activity a◦ 6= 0. 3. |λ− | < 1 , |λ+ | > 1: stable in-phase oscillation. 4. |λ− | > 1 , |λ+ | > 1: typically a stable antiphase oscillation (if |λ− | and |λ+ | are not too large, a stable in-phase solution may coexist). The criterion explains the dynamical behavior in Figure 8b where a change of the synaptic depression time τ from 4 to 10 and from 10 to 15 evoked the transition from convergence to in-phase oscillation and eventually to antiphase oscillation. For τ = 4 we calculated |λ− | = 0.939 < 1 and |λ+ | = 0.950 < 1 (convergence); for τ = 10 we calculated |λ− | = 0.995 < 1 and |λ+ | = 1.005 > 1 (in-phase oscillation) and for τ = 15 we obtained |λ− | = 1.011 > 1 and |λ+ | = 1.021 > 1 (antiphase oscillation). The regions of the two-dimensional bifurcation diagram for the coupled networks almost coincide with the one for the isolated network (Senn et al., 1996). Whenever the activity of the isolated network dies out, converges, or oscillates, so do the activities of the coupled network. The in-phase oscillations occur only in a narrow strip of parameter values (|λ− | < 1 and |λ+ | > 1)
1266
W. Senn et al.
t
oin
dp
g
e fix
1
ctin
n
utio
a attr
ol es
g ctin
a
attr
s pha
v g in
in
act
ting
i osc
s
ion
llat
ari
in
attr
0.8
us
tor
2ant
rna
alte
0.6
a2 a2
+
0.4
tion
τ
=4
0 4 + τ:
er ack
in E
ca fur
0.2
bi
-S
ark
im Ne
ion cat -
tion
9 9
τ =τQ :
ca fur
bi ar rus 10 0 eim =1 τ tauτ- : N
To
er ack
1
in E
0.8
ur
bif
0.6 0.4
k-S
0.2 15
τ=
15
0
a1
a1
Figure 6: Orbits in the (a1 , a2 )-plane for different values of the synaptic depression time. The connectivity parameters are µ = 10 and µij = 0.1. A time plot of the corresponding activities is shown in Figure 8b, and in Figure 4b, the full bifurcation diagram is depicted.
at the borderline between converging activity (|λ− | < 1 and |λ+ | < 1) and antiphase oscillations (|λ− | > 1 and |λ+ | > 1) (cf. Figure 9). 4.3 Estimating the Frequencies. In a first-order approximation, the frequencies f , f + , and f − of the uncoupled, the in-phase, and antiphase oscillations can be identified by those of the linearization at the fixed point— that is, by the rotation angles of the mappings d8◦ , d8+ , and d8− , respectively. Here, d8◦ represents the linearization of the uncoupled system with µij = 0 at the corresponding fixed point P¯ f ix = (¯a◦ , s¯◦ ). Using the abbreviation η¯ = fK (µ¯a◦ s¯◦ ) and η = fK ((µ + µij )a◦ s◦ ), we obtain the following relation for the linearized frequencies (cf. section A.3): Lemma 1. µη¯ ¯ s◦ + ²τ (1 − ²τ a¯ ◦ ) cos(2π f ) = p µη² ¯ τ (¯s◦ + (1 − ²τ )¯a◦ ) and
(µ ± µij )ηs◦ + ²τ (1 − ²τ a◦ ) . cos(2π f ± ) = p (µ ± µij )η²τ (s◦ + (1 − ²τ )a◦ )
Pattern Generation
1267
(a)
(b) 110.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0 12201201
1800 1200
s
a¹
2
0.4
0.2
0 1
1 0.5
0.5 2
a
0 0
00 12011200
1
1220 1200
a
0 18001201
16000
16000
t (=number of iterarions)
Figure 7: (a) A quasiperiodic orbit with two frequencies lies on a 2-torus generated by 5000 iterations of equation 2.4. Notice that the trajectory jumps clockwise from one circle to the next. The small circles correspond to the slow-amplitude oscillations. (µ = 10, τ = 11, µij = .1.) (b) In a quasiperiodic solution, two or more frequencies are involved at different timescales. The activity a1 is shown during intervals of 20, 600, and 1480 iterations. (µ = 10, τ = 12, µij = .1.)
(a) 0.8
(b)
1
0.9
0.1
0.2
0.8
a¹ and a² (dashed)
a1 and a2 (dashed)
0.75 0.7 0.65 0.6
0.18
0.14
0.7 0.6 0.5 0.4 0.3 0.2
0.55 0.1
0.5 1200
1300 1400 1500 t (=number of iteration)
1600
0 0
τ = 4 τ = 10 τ = 15 0.5
1
1.5 Time (second)
2
2.5
Time (sec)
Figure 8: (a) Plotting the oscillations from the first network shown in Figure 7b (solid line) together with the oscillations of the second network (dashed line) reveals a quasiperiodic solution that is phase trapped. (b) While increasing the synaptic depression time τ , the dynamics change from convergence to a constant activity (τ = 4, t = 0, . . . , 0.5 s), to stable in-phase oscillation (τ = 10, t = 0.5, . . . , 0.8 s), and eventually to stable antiphase oscillation (τ = 15, t = 1, . . . , 3 s). The numbers in the figure give the period lengths of the oscillations in seconds. (Other parameter values: µ = 10 and µij = 0.1.)
1268
W. Senn et al.
17.5
1 15
4 12.5
tau
τ
3
10
7.5
5
2
2.5
5
10
µ
15
20
25
mu
Figure 9: The four regions in the parameter space (µ, τ ) for which the phenomenological criterion in section 4.2 predicts the following behavior: (1) extinction of any activity, (2) convergence to the constant activity a◦ , (3) stable in-phase oscillations, and (4) typically stable alternating oscillations. Superimposed are the two lines corresponding to the running parameter in the bifurcation diagrams of Figure 4, and the circles refer to the parameter values chosen in Figures 5–8. The four different gray levels are determined by the evaluating formula (see equation 4.3). (Other parameters: K = .8, µij = .1.)
Evaluating these formulas for µij = 0.1 and for the parameters (µ, τ ) with values lying in the domain (1 ≤ µ ≤ 25 , 1 ≤ τ ≤ 18) of Figure 9, one finds the relation f < f + < f − . This confirms the frequency differences we measured in the case that both the in-phase and the antiphase solutions are stable (see Figure 10a): coupling the two oscillators with medium strength increases the frequencies of the oscillators, and the increase is larger for antiphase oscillations than for in-phase oscillations. That the antiphase oscillation is faster than the in-phase oscillation can also be seen by laying the in-phase and antiphase planes E+ and E− above each other and tracing two points on the invariant in-phase and antiphase curve C+ and C− , respectively (see Figure 10b). After 10 iterations, the point on the antiphase curve is moved by an angle 4θ farther to the right than the corresponding point on the in-phase curve. The validity of the inequality f + < f − is restricted
Pattern Generation
1269
(a) 1
(b)
0.8
0.25
0.6
0.2
a
C+ ∆θ
C-
s
1
a
s 0.15
0.4 0.1
0.2 0
0.05
1
Time(sec) Time (seconds)
2
0 0
0.2
0.4
aa1
0.6
0.8
1
Figure 10: (a) One (in-phase) solution from Figure 5a (solid curve, frequency f + ), one (antiphase) solution from Figure 5b (dashed curve, frequency f − ), and the uncoupled solution (dotted curve, frequency f ) are plotted for t = 0, . . . , 2.2 s. As predicted by the local fixed-point analysis (section 4), we find f < f + < f − . (b) Corresponding plots in the phase plane (a, s). The in-phase oscillations with larger amplitudes correspond to the outer invariant circle C+ , while the inner invariant circle C− corresponds to the antiphase oscillations. After 10 iterations (≈ 0.14 s) the “in-phase” and “antiphase” map turned the square and the circle on the horizontal line onto the upper square and circle, respectively. A point on the inner antiphase curve is therefore turned by an angle 4θ “faster” to the right than a point on the outer in-phase curve. This again shows that f + < f − .
by the following condition (which, however, is satisfied for the considered parameter values): Theorem 2. Let f + and f − be the frequencies of the linearized in-phase and antiphase solutions. Then for small µij > 0, one has f + < f − if and only if ²τ 1−as◦◦ ²τ < µη. We emphasize that the theorem makes a statement about the frequencies of the linearized mapping at the fixed point. The interesting point is that these linearized frequencies fit the relation for the true frequencies in the case of medium-strength coupling (but not necessarily in the case of weak coupling). If we ask how these frequencies change with parameters µ and τ , we find that the frequencies do not change as long as the ratio µ/τ is constant. Increasing the average number of connections µ within each network accelerates the oscillations by the same factor as it does by decreasing the synaptic depression time. That this statement holds fairly well can be seen from Figure 4 in Senn et al. (1996), where both frequencies were plotted as functions of the parameters µ and τ . The decrease in frequency when the
1270
W. Senn et al.
synaptic depression time becomes longer can be seen in Figure 8b. The frequencies corresponding to the values τ = 4, 10, 15 are 10 Hz, 7.1 Hz, and 5.6 Hz, respectively. Notice that for τ = 15, the in-phase frequency during the short transition time is 5 Hz, which is indeed smaller than the antiphase frequency of 5.6 (≈ 1/.18) Hz after the transient. This is exactly what the theorem predicts. 5 Summary and Discussion We have investigated the oscillation types of two weakly connected timediscrete networks induced by synaptic depression. The parameters that determine the oscillation pattern are not only the coupling strength between the networks but also the degree of connectivity among an individual network and the dynamics of the synaptic depression. By developing a discrete version of the APD theory, we showed that for weak coupling (µij ∼ 10−4 µ), an increase in the coupling strength can change a stable antiphase oscillation into a stable in-phase oscillation. For medium-strength coupling (µij ∼ 10−2 µ), we further encountered quasiperiodic oscillations for which the amplitude itself oscillates. In addition, stable in-phase and antiphase solutions for the same parameter values may coexist. A bifurcation analysis with respect to increasing synaptic depression time showed that from a rest point, an in-phase solution emerges through a NaimarkSacker bifurcation. This inphase solution evolves to a quasiperiodic solution through a torus bifurcation. The phase lag changes from inphase to out of phase, and eventually a stable antiphase oscillation establishes itself. From a local fixed-point analysis, we deduced that coupling two oscillating networks generally increases the frequency; in case of bistability, the antiphase frequency is again higher than the in-phase frequency. The transition from in-phase oscillations to antiphase oscillations is in our case either induced by increasing the synaptic depression time constant or decreasing the coupling strength between the networks. Roughly, increasing the depression time constant has the same effect on the dynamical behavior as decreasing the coupling strength. This is also true if one considers the change of frequency as a function of either the synaptic depression time or the coupling strength: increasing the synaptic depression time and decreasing the coupling strength between the networks lower the frequency (of the linearization at the fixed point (cf. Senn et al., 1996, Figure 4). This supports the suggestion of Kopell (1988) that physiological oscillators always speed up with excitatory input. In our situation, an increase of the depression time constant destabilizes the in-phase oscillations. Destabilization of synchrony is also known to occur for increasing axonal delays (Crook, Ermentrout, Vanier, & Bower, 1997) or when increasing the rise time of the excitatory synaptic connections (van Vreeswijk, Abbott, & Ermentrout, 1994; Hansel, Mato, & Meunier, 1995). On the other hand, it is well known that for increasing coupling strength
Pattern Generation
1271
between two oscillators, an antiphase oscillation will turn into an in-phase oscillation (Schillen & Konig, ¨ 1991; Sherman & Rinzel, 1992). This feature is used to train networks with a Hebbian learning rule to perform memory tasks (Horn & Usher, 1991) or to solve the binding problem (Wang & Terman, 1997). Our investigation suggests that these same tasks can be realized by learning the synaptic depression time constant where instead of a Hebbian increase of the coupling strength, the depression time constant of a synapse showing coincident presynaptic and postsynaptic activity is decreased. Such a network of oscillators based on depressing synapses would have the advantage of allowing for a richer dynamical behavior than a classical oscillator network based on pairs of excitatory and inhibitory cells with static synapses. (Note that one excitatory cell with N depressing synapses projecting onto it would correspond to one excitatory cell with N static synapses together with N inhibitory cells, which would regulate the response of these synapses.) It remains to be explored whether this additional dynamic complexity may improve the capacity of an oscillatory-based associative memory in storing spatiotemporal patterns. For locomotion, traditional pattern generators are based on the idea that different activity patterns are induced by either tuning the coupling strength between the networks or changing the network configuration through central input (Grillner, 1981; Kopell, 1988). This study, however, suggests that spinal pattern generators may consist as well of unstructured excitatory neural networks characterized solely by the dynamics of their synaptic depression. The fact that different oscillatory behavior can emerge by changing the parameters of the synaptic depression opens the possibility of an activity-dependent self-regulation of the pattern generator. In order to test this hypothesis experimentally, we have started to perform multisite recordings of weakly coupled dissociated spinal networks and hope to be able to modulate the dynamics of synaptic depression at the different sites pharmacologically.
Appendix A.1 Calculation of the Function H(φ). We assume that the uncoupled system has an invariant closed curve C◦ with irrational rotation number and first want to transform C◦ to a circle with constant rotation angle ω (in Figure 10, this curve coincides approximately with C+ ). According to Denjoy theory, such a transformation is possible if the mapping on the invariant curve is at least twice continuously differentiable (Guckenheimer & Holmes, 1990); it may not be possible if the mapping is less smooth (Herman, 1983). Let αt be the cumulated angle of an iterated point measured with respect to some zero direction and center P f ix . Since the mapping on the curve C◦ is injective and continuous (and thus monotonic), the average rotation angle ω = limt→∞ αt /t is well defined, and ω/2π represents the rotation number.
1272
W. Senn et al. g
The transform αt 7→ θt of α to the new angle coordinate θ with constant rotation is now defined by θt = g(αt ) = tω mod 2π
(t = 0, 1, 2, . . .) .
(A.1)
Numerically this is done by iterating the uncoupled mapping (see equation 2.4) to obtain the list of angles (α0 , α1 , α2 , . . . , αN ) mod 2π and by assigning this list to the equidistant new angles (0, ω, 2ω, . . . , Nω) mod 2π . Continuous continuation leads to a coordinate transform α(mod 2π ) ↔ θ of the circle onto itself with the property in equation A.1 and which is again continuous and monotonic. All calculations were performed by MATLAB1 and to implement the function FK we used the built-in function gammainc. Next we consider the two weakly coupled oscillators: the first at phase θ 1 and the second at phase θ 2 in the new angle coordinates. We define the function h(θ 1 , θ 2 ) as the deviation from the constant rotation angle ω of the first oscillator for one iteration step, which is caused by the coupling. 1 in the presence of Thus, setting θti := θ i for i = 1, 2, we calculate θt+1 1 2 1 2 1 2 1 , θ 2 ) and put the coupling via (θt , θt ) 7→ (αt , αt ) 7→ (αt+1 , αt+1 ) 7→ (θt+1 t+1 1 1 1 2 h(θ , θ ) := θt+1 −(θt +ω) (while taking care of the 2π periodicity). To obtain H(φ) we calculate a list h(θ, θ + φ), θ ∈ [0, 2π ], and average h(θ, θ + φ) over one cycle of θ (cf. equation 3.3). A.2 The Local Fixed-Point Analysis. We consider the mapping (see + − − equation 2.4) within the diagonalized coordinates (4.1), 8: (a+ t , st , at , st ) + − − , s , a , s ), and calculate its linearization at the fixed point 7−→ (a+ t+1 t+1 t+1 t+1 (P f ix , 0), P f ix = (a◦ , s◦ ). We make the ansatz a1t = a◦ + αt ,
a2t = a◦ ± αt ,
s1t
s2t = s◦ ± σt ,
= s◦ + σt ,
and consider the mapping (αt , σt ) 7−→ (αt+1 , σt+1 ) in the two cases ±. Writing the derivative of 8 at the fixed point according to equation 4.2, one obtains ∂a± Ã ! ∂a± t+1
d8± = µ =
∂a± t ∂s± t+1 ∂a± t
t+1
∂s± t ∂s± t+1 ∂s± t
=
(µ ± µij )ηs◦ −²τ dτ (1 − s◦ )
∂αt+1 ∂αt ∂σt+1 ∂αt
∂αt+1 ∂σt ∂σt+1 ∂σt
(µ ± µij )ηa◦ ²τ dτ (a◦ )
¶
,
(A.2)
where we used the abbreviations from equations 2.2 and 4.4. Now one makes use of the fact that the eigenvalues of d8+ and d8− are complex conjugated. 1 MATLAB, The Language of Technical Computing, MathWorks Inc., Version 2 (1996), http://www.mathworks.com.
Pattern Generation
1273
This follows from the fact that d8± is composed of a genuine rotation (cf. proof of lemma 2 in Senn et al., 1996). Let us denote the eigenvalues of d8+ and d8− by λ+ , λ+ , and λ− , λ− , respectively. Their modulus can be calculated directly by means of the determinant, without knowing λ± explicitly: ¡ ¢ |λ± |2 = det d8± = (µ ± µij )η²τ (s◦ + (1 − ²τ )a◦ ) .
(A.3)
A.3 Proof of Lemma 1 and Theorem 2. Identifying the frequencies f , f + , and f − of the uncoupled, in-phase, and antiphase oscillations with those of the corresponding linearization, we have λ = |λ|ei2π f ,
±
λ± = |λ± |ei2π f ,
(A.4)
where λ, λ+ , and λ− are eigenvalues of d8◦ , d8+ , and d8− , respectively. (We assume that f, f ± > 0; otherwise we take the complex conjugate of the eigenvalues.) The formula in the lemma for cos(2π f ± ) is obtained according to cos(2π f ± ) =
(µ ± µij )ηs◦ + ²τ dτ (a◦ ) Re(λ± ) trace(d8± ) . = = p ± |λ | 2|λ| (µ ± µij )η²τ (s◦ + (1 − ²τ )a◦ )
The first equality follows from equation A.4. The fact that the real part of λ± is half the trace of d8± follows from the fact that d8± is composed of a genuine rotation (Senn et al., 1996) and thus has the complex conjugated eigenvalues λ± and λ± in their Jordan normal form. The trace of d8± is read from equation A.2, and with equation A.3, we obtain the third equality. The formula for cos(2π f ) is obtained in the same way. To prove theorem 2, we must show the equivalence of f − < f + and 1−a◦ ²τ ²τ s◦ < µη for small µij . According to Lemma 3 in Senn et al. (1996), the period of the uncoupled system is ≥ 4 and therefore the rotation angle ω is less than π/2. Thus, the relation f − < f + is equivalent to cos(2π f − ) < cos(2π f + ), and by the lemma we get (µ + µij )ηs◦ + ²τ dτ (a◦ ) (µ − µij )ηs◦ + ²τ dτ (a◦ ) q q < . µij µ 1− µ 1 + µij If we apply the binomial expansion 1 q 1±
µij µ
=1∓
µ ¶2 µij µij +O , 2µ µ
the first-order term µij /µ cancels and, neglecting higher-order terms, the relation transforms after some algebraic manipulations to ²τ dτ (a◦ ) < µηs◦ .
1274
W. Senn et al.
Acknowledgments W. S. and K. W. were supported by Swiss National Science Foundation grant 5002-03793 (Schwerpunktprogramm Biotechnologie), and J. K. was supported by grant 31-42055.94. We thank H.-P. Clamann for helpful comments and for correcting the English. References Abbott, L., Varela, J., Sen, K., & Nelson, S. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Arnold, V. I. (1965). Small denominators, I: Mappings of the circumference onto itself. Am. Math. Soc. (AMS) Transl. Ser. 2, 46, 213–284. Collet, P., & Eckmann, J.-P. (1986). Iterated maps on the interval as dynamical systems (2nd ed.). Basel: Birkh¨auser Verlag. Crook, S., Ermentrout, G., Vanier, M., & Bower, J. (1997). The role of axonal delay in the synchronization of networks of coupled cortical oscillators. J. Comput. Neuroscience, 4, 161–172. Ermentrout, G., & Kopell, N. (1991). Multiple pulse interactions and averaging in systems of coupled neural oscillators. J. Math. Biol., 29, 195–217. Gao, B., & Ziskind-Conhaim, L. (1995). Development of glycine- and GABAgated currents in rat spinal motoneurons. J. Neurophysiol., 74, 113–121. Gerstner, W., Ritz, R., & van Hemmen, J. (1993). A biologically motivated and analytically soluble model of collective oscillations in the cortex. I. Theory of weak locking. Biol. Cybernetics, 68, 363–374. Getting, P. (1989). Emerging principles governing the operation of neural networks. Ann. Pew. Neurosci., 12, 184–204. Grillner, S. (1981). Control of locomotion in bipeds, tetrapods, and fish. In V. B. Brooks (Ed.), Handbook of physiology, section 1: The nervous system (Vol. 2, pp. 1179–1236). Bethesda, MD: American Physiological Society. Guckenheimer, J., & Holmes, P. (1990). Nonlinear oscillations, dynamical systems, and bifurcations of vector fields (3rd ed.). New York: Springer-Verlag. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–337. Herman, M. (1983). Sur les courbes invariantes par les diff´eomorphismes de l’anneau. Ast´erisque, 103/104. Hlawka, E. (1979). Theorie der Gleichverteilung. Mannheim: Bibliographisches Institut. Horn, D., & Usher, M. (1991). Parallel activation of memories in an oscillatory neural network. Neural Computation, 3, 31–43. Kawato, M., & Suzuki, R. (1980). Two coupled neural oscillators as a model of the circadian pacemaker. J. Theor. Biol., 86, 547–575. Kopell, N. (1988). Toward a theory of modelling central pattern generators. In A. H. Cohen, S. Rossignol, & S. Grillner (Eds.), Neural control of phytmic movements in vertebrates (pp. 369–413). New York: Wiley. Kronauer, R., Czeisler, C., Pilato, S., Moore-Ede, M., & Weitzman, E. (1982).
Pattern Generation
1275
Mathematical model of the human circadian system with two interacting oscillators. Am. J. Physiol., 242, R3–R17. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382, 807–810. Rinzel, J., & Ermentrout, B. (1989). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks. Cambridge, MA: MIT Press. Schillen, T., & Konig, ¨ P. (1991). Stimulus-dependent assembly formation of oscillatory responses II: Desynchronization. Neural Computation, 3, 167–178. Senn, W., Wyler, K., Streit, J., Larkum, M., Luscher, ¨ H.-R., Mey, H., Muller, ¨ L., Steinhauser, D., Vogt, K., & Wannier, T. (1996). Dynamics of a random neural network with synaptic depression. Neural Networks, 9(4), 575–588. Sherman, A., & Rinzel, J. (1992). Rythmogenetic effects of weak electrotonic coupling in neural models. Neurobiology, 89, 2471–2474. Streit, J. (1993). Regular oscillations of synaptic activity in spinal networks in vitro. J. Neurophysiol., 70, 871–878. Streit, J., & Senn, W. (1997). The generation of oscillatory activity in simulated random excitatory neronal networks with synaptic depression. Manuscript submitted for publication. Streit, J., Spenger, C., & Luscher, ¨ H.-R. (1991). An organotypic spinal cord-dorsal root ganglia-skeletal muscle coculture of embryonic rat. II. Functional evidence for the formation of spinal reflex arcs in vitro. Europ. J. Neurosci., 3, 1054-1068. van Vreeswijk, C., Abbott, L., & Ermentrout, G. (1994). When inhibition not excitation synchronizes neural firing. J. Comput. Neuroscience, 1, 313–322. Wang, D., & Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Computation, 9, 805–836. Wever, R. (1972). Virtual synchronization towards the limits of the range of entrainment. J. Theor. Biol., 36, 119–132. Williams, T., & Bowtell, G. (1997). The calculation of frequency-shift functions for chains of coupled oscillators, with application to a network of the lamprey locomotor pattern generator. J. Comput. Neuroscience, 4, 47–55. Received August 29, 1996; accepted November 3, 1997.
LETTER
Communicated by Eytan Ruppin
Computational Studies of Lateralization of Phoneme Sequence Generation James A. Reggia Sharon Goodall Yuri Shkuro Departments of Computer Science and Neurology, Institute of Advanced Computer Studies, University of Maryland, College Park, MD 20742, U.S.A.
The mechanisms underlying cerebral lateralization of language are poorly understood. Asymmetries in the size of hemispheric regions and other factors have been suggested as possible underlying causal factors, and the corpus callosum (interhemispheric connections) has also been postulated to play a role. To examine these issues, we created a neural model consisting of paired cerebral hemispheric regions interacting via the corpus callosum. The model was trained to generate the correct sequence of phonemes for 50 monosyllabic words (simulated reading aloud) under a variety of assumptions about hemispheric asymmetries and callosal effects. After training, the ability of the full model and each hemisphere acting alone to perform this task was measured. Lateralization occurred readily toward the side having larger size, higher excitability, or higher learning-rate parameter. Lateralization appeared most readily and intensely with strongly inhibitory callosal connections, supporting past arguments that the effective functionality of the corpus callosum is inhibitory. Many of the results are interpretable as the outcome of a “race to learn” between the model’s two hemispheric regions, leading to the concept that asymmetric hemispheric plasticity is a critical common causative factor in lateralization. To our knowledge, this is the first computational model to demonstrate spontaneous lateralization of function, and it suggests that such models can be useful for understanding the mechanisms of cerebral lateralization. 1 Introduction A great number of functional cerebral asymmetries are currently recognized to exist in humans (Hellige, 1993; Kinsbourne, 1978; Springer & Deutsch, 1993). These cognitive and behavioral lateralizations include language, handedness, visuospatial processing, emotion and its facial expression, olfaction, and attention. For example, over 90% of the population has dominance of the left cerebral hemisphere for language. Much of the evidence for known qualitative and quantitative asymmetries comes from studies of deficits c 1998 Massachusetts Institute of Technology Neural Computation 10, 1277–1297 (1998) °
1278
James A. Reggia et al.
in stroke and trauma patients, from special procedures applied to normal subjects and “split-brain” patients following commissurotomy (Gazzaniga, 1987), and from functional imaging studies of higher cortical functions and perceptual tasks (Mazziotta & Phelps, 1984; Peterson, Fox, Posner, Mintun, & Raichle, 1988; Reivich et al., 1984). There is also substantial plasticity of the brain with respect to functional asymmetries. For example, left hemispherectomy in infants or even children as old as 9 years of age can result in the right hemisphere’s becoming fairly skilled in language functions (Dennis & Whitaker, 1976; Vargha-Khadem et al., 1997). Lateralization of various functional brain asymmetries (motor preference, vocalizations, perception, etc.) is not uniquely human; it has been found repeatedly in numerous studies with rodents, birds, and primates (Hauser, 1993; Hellige, 1993; Ward & Hopkins, 1993). Although such hemispheric specializations are well documented, the underlying causes of hemispheric function lateralization are not well understood and have been the subject of intense scientific investigation for over a century. Perhaps the most widely accepted theory is that hemispheric anatomical asymmetries are a critical factor in function lateralization. (Galaburda & Habib, 1987; Geschwind & Galaburda, 1987; Hellige, 1993). Anatomic and cytoarchitectonic asymmetries include, for example, a larger left temporal plane in 65% of subjects, an asymmetry that is present in newborns (Galaburda, Sanides, & Geschwind, 1978; Geschwind & Levitsky, 1968; Loftus et al., 1993; Witelson & Pallie, 1973). However, other significant asymmetries exist in the hemispheres. For example, higher-order dendritic branching is greater in speech areas of the left hemisphere (Scheibel, 1985), and evidence exists that there is more gray matter relative to white matter in the left hemisphere than in the right (Gur et al., 1980). Important neurotransmitters such as dopamine and norepinephrine are asymmetrically distributed between the hemispheres (Tucker & Williamson, 1984), while the threshold for motor-evoked potentials is lower for the left hemisphere (Macdonell et al., 1991). Substantial arguments have been advanced that it is unlikely that a single underlying hemispheric asymmetry will be able to account for language and other hemispheric specializations (Hellige, 1993). Besides intrinsic hemispheric differences, another potential factor in function lateralization is hemispheric interactions via pathways such as the corpus callosum (Zaidel, 1983). Corpus callosum fibers are mostly homotopic; each hemisphere projects to the other in a topographic fashion so that roughly mirror-symmetric points are connected to each other (Innocenti, 1986; Pandya & Seltzer, 1986). Most neurons sending axons through the corpus callosum are pyramidal cells, and these synapse mainly on contralateral pyramidal cells over a cortical region of diameter up to 2 mm (Hartenstein & Innocenti, 1981; Innocenti, 1986). Such cellular components, as well as transcallosal diaschisis and split-brain experiments, suggest that transcallosal interhemispheric interactions are mainly excitatory in nature (Berlucchi, 1983), but this hypothesis is quite controversial (Denenberg, 1983). Tran-
Lateralization of Phoneme Sequence Generation
1279
scallosal monosynaptic excitatory postsynaptic potentials are subthreshold and of low amplitude, and are followed by stronger, more prolonged inhibition (Toyama, Tokashiki, & Matsunami, 1969), suggesting to some that transcallosal inhibitory interactions are much more important (Cook, 1986; Kinsbourne, 1978). The case for transcallosal inhibition has been strengthened recently by transcranial magnetic stimulation studies indicating that activation of one primary motor cortex inhibits the contralateral one (Ferbert et al., 1992; Meyer, Roricht, von Einseidel, Kruggel, & Weindl, 1995). In spite of this wealth of information about hemispheric asymmetries and interactions, it is not yet clear what physiological mechanisms are responsible for individual or population lateralization of function (Reggia, 1995). While it is often argued that anatomical asymmetries underlie functional asymmetries such as language lateralization to the (usually) left hemisphere (Geschwind & Galaburda, 1987), the mechanisms by which this might occur are not known. Further, it is difficult to see how such anatomical asymmetries, favoring the left in roughly 65% of the population, could account for left-hemisphere dominance for language in over 90% of the population. It seems likely that computational modeling could be useful for refining, extending, and examining the plausibility of various hypotheses concerning the mechanisms of hemispheric specialization. Although a few neural models of interacting hemispheric regions have been studied previously (Anninos, Argyrakis, & Skouras, 1984; Cook & Beech, 1990; Ringo et al., 1994), none of these past studies has directly looked at how lateralization of functions can arise spontaneously. In this article, our goal is to use computational modeling to investigate possible causes for lateralization of a single task. We do this by studying a recurrently connected neural model consisting of two simulated hemispheric regions interacting via a corpus callosum. In a series of simulations, the model was repeatedly trained to generate a temporal sequence of outputs when given a fixed input pattern (a simple caricature of single-word reading aloud). Each time the model was trained, we varied its assumptions about hemispheric asymmetry and callosal excitatory and inhibitory strength, determining which conditions resulted in function lateralization. Our initial hypothesis was that each of the hemispheric asymmetries introduced into the model would produce functional lateralization when and only when inhibitory interhemispheric interactions were present. The actual situation turned out to be more interesting. Although functional lateralization usually occurred with underlying hemispheric asymmetries if the callosal connections were inhibitory, under certain conditions it would also develop with excitatory callosal connections. Further, lateralization secondary to hemispheric size asymmetry was found to have multiple causes. Many of the results we observed can be interpreted as a “race to learn” between the model’s two hemispheric regions, supporting the concept that asymmetric hemispheric plasticity is a common causative factor in lateralization.
1280
James A. Reggia et al.
2 Methods The neural model is trained to take three-letter words (CAD, MOP, SIT, etc.) as input and to produce the correct temporal sequence of phonemes for the pronunciation of each word as output. For example, the single fixed input pattern MAT, the trained model’s output goes through a sequence of three states representing the phonemes for M, A, and T, one at a time. The 50 words used as inputs are listed in the appendix. 2.1 The Model. The architecture of the model is summarized schematically in Figure 1. While this model captures the direct transformation of input to output suggested by functional imaging studies of single-word reading (Peterson et al., 1988), the intent is not to create a veridical model of underlying neocortical structures. Rather, our goal is to represent the functionality of two interacting, recurrently connected pathways (left and right) as sequential output is generated. The model’s input elements (I) are divided into three groups, each corresponding to the possible input characters at one of the three input character positions. Input elements are fully connected to two sets of neural elements representing corresponding regions of the left (LH) and right (RH) hemisphere cortex. These regions are fully connected to each other via a simulated corpus callosum (CC), and they are also fully connected to a set of output neural elements (O) representing individual phonemes. A set of state elements (S) provides delayed feedback to the hemispheric regions via recurrent connections, similar to those used in Jordan (1986). These recurrent feedback connections are motivated in part by the recurrent or “backward” neuroanatomic connections between cortical regions (Felleman & Van Essen, 1991). The two hemispheric regions are taken to represent roughly mirror-image cortical regions, consistent with the fact that such regions are generally specialized for the same or a similar function (Heilman & Valenstein, 1979; Kupfermann, 1991). During simulations, activation patterns representing written words are clamped on the input elements and held fixed while the network generates a sequence of outputs, as follows. With the onset of input, activation propagates forward to the two hemispheric regions, and then to the output elements, where ideally only the single element representing the correct first phoneme of the input word is activated. This output pattern activates the set of state elements (S), and then the network recomputes its output, generating the second phoneme in the word’s pronunciation sequence. In doing this, the activation levels of the elements in the hemispheric layers are now determined not only by the input activation pattern but also by the activation pattern of the state elements. The state elements provide one time unit of delay between output activity and its feedback via recurrent connections to the hemispheric regions. This process repeats, and a third output phoneme is produced. The activation level ai of each neural element i is set to zero at the begin-
Lateralization of Phoneme Sequence Generation
1281
... ... LH
...
CC
I RH
O
S
...
... ...
Figure 1: Model architecture. Individual neural elements are indicated by small circles; sets of related elements are indicated by boxes. I = inputs, O = outputs, LH/RH = left/right hemisphere regions, CC = corpus callosum, S = state elements.
ning of presenting each input. The activation level of hemispheric element i is governed by dai /dt = −ai + Mσ (hi ),
(2.1)
where M is a constant maximum activation level, hi represents the linear weighted sum of input activation to element i from inputs I, state elements S, and the opposite hemispheric region, and σ is the logistic equation σ (x) = (1 + e−x )−1 . At equilibrium ai = Mσ (hi ), so each hemispheric element effectively computes a sigmoid function of its input. The reason for iteratively computing ai (25 iterations, 0.1 time step, Euler method), rather than just assigning it a value Mσ (hi ), is to allow the hemispheric regions time to influence each other during a simulation via the corpus callosum. Output element activation levels are simply computed directly as ai = σ (hi ). State element i is assigned the value of the corresponding output element plus a fraction of its previous value, that is, asi = µasi + aoi
(2.2)
where superscript s designates a state element and o designates an output element. The supervised learning rule used in the model is a variant of recurrent error backpropagation specifically designed for networks transforming a
1282
James A. Reggia et al.
single, fixed input pattern into a sequence of outputs (Jordan, 1986). The equations governing learning are briefly stated in the appendix. Learning occurs on all connections pictured in Figure 1 except those forming the corpus callosum (CC), and also on connections from a single bias unit to hemispheric and output elements (not shown in Figure 1). Learning was incremental, with weight updates occurring after each individual output. Weights modified by learning are initialized with uniform random real values between −1 and +1. Callosal weights are uniform, usually all having a constant value c during any given simulation. When c > 0, positive feedback via the excitatory callosal connections can excessively activate hemispheric elements (i.e., all or most elements nearly maximally activated, greatly slowing learning). To avoid this, hemispheric elements have a “self-connection” that contributes −2c(1 + 2.5ai ) to hi when c > 0. 2.2 Experimental Methods. Using the model described, we undertook a series of simulations in which the assumptions about hemispheric asymmetries and the excitatory and inhibitory effects of the corpus callosum were systematically altered with each simulation. Five hemispheric asymmetries were examined: relative size, maximum activation level, sensitivity to input stimuli, learning-rate parameter, and amount of feedback. Asymmetries were examined one at a time in isolation. In addition, a model with symmetric hemispheric regions was used as a control. In most simulations, each hemispheric region was sufficiently large (six or more elements) that it could independently learn the input-output mapping. This was motivated by experimental evidence that either human hemisphere alone can acquire language (Dennis & Whitaker, 1976; Vargha-Khade et al., 1997), and that function lateralization would trivially occur if one hemispheric region was too small to learn the mapping. For each hemispheric asymmetry examined and the control model, the uniform value c of callosal weights was varied over 17 values between −3.0 and +3.0. In this manner, over a thousand simulations were done, each involving training the model and then testing its performance. Training data consisted of 50 three-letter words with their associated pronunciation sequence. These words were designed so that any singleoutput phoneme alone could never unambiguously predict the subsequent phoneme (e.g., an L in the first position could be followed by an A, I, or O). Thus, the feedback from state elements to hemispheric regions alone could not predict uniquely the subsequent correct output state. The baseline parameters, used in all simulations described below unless explicitly noted otherwise, are M = 1.0, µ = 0.5, a learning rate of 0.05, and 10 elements per hemispheric region. All software is implemented in C, and most simulations were run on DEC Alphas. In each simulation, model performance was measured as root mean square error E. Training consisted of repeated passes through the training data until either E was reduced to 0.05 or 10,000 passes through the
Lateralization of Phoneme Sequence Generation
1283
data occurred. Error was measured prior to training and under three conditions after training: with both hemispheric regions connected to outputs (E), and with each of the left and right hemisphere regions alone connected to outputs (EL and ER , respectively). When only one hemispheric region was connected to the outputs, the other still had an indirect, albeit limited, influence on output via the corpus callosum. Lateralization was measured using an asymmetry coefficient ρc (Lezak, 1995), specifically: ρc =
EL − ER 1 − 12 (EL + ER )
,
(2.3)
where c indicates corpus callosum strength. Negative values of ρc indicate left lateralization of function, positive values indicate right lateralization, and for the specific simulations done here, ρc = ±0.6 roughly corresponds to maximal or complete lateralization. We use ρ¯ (ρ¯ + , ρ¯ − ) to designate the mean value of ρc for all c (for c ≥ 1, c ≤ −1, respectively). 3 Results 3.1 Symmetrical Hemispheric Regions. Simulations were done with a symmetrical, control version of the model in which both hemispheric regions were identical except for random differences in initial connection weights. Figure 2a shows a representative example of the results using the baseline parameters (see above) but different random initial weights on each run. Lateralization is typically more pronounced for negative callosal strengths (mean |ρ| of 0.21 for c ≤ −1) than for positive callosal strengths (mean |ρ| of 0.06 for c ≥ 1). However, the direction of lateralization is arbitrary (ρ¯ = 0.0001, ρ¯ − = −0.01, ρ¯ + = 0.01). Similar results occur with other size hemispheric regions and other random weight sets. Lateralization in this case is due to initial random asymmetries in weight values. These weight asymmetries, when accentuated by the iterative process by which hemispheric activation values are determined, can lead to significant pretraining mean activation asymmetries, especially with inhibitory callosal connections, as shown in Figure 2b. Comparing Figures 2b and 2a shows that these initial activation-level asymmetries are predictive of the direction of lateralization with inhibitory callosal connections. With backpropagation learning, higher mean hemispheric activation levels closer to 0.5 on one side cause faster weight changes on that side for both the incoming and outgoing connections of hemispheric elements. Thus, the more highly activated side of the model learns the task more quickly. Once one hemisphere region has largely learned to generate the correct outputs, the output error values drop to near zero, and weight changes throughout the entire model effectively end. The opposite hemispheric region is therefore never driven to learn an adequate set of weights. It is possible to find random seeds for which, by chance, pretraining
1284
James A. Reggia et al. a.
E 0.6 0.4 0.2 0
−3
−2
−1
0
1
2
3
C
b.
Figure 2: Symmetric control model. (a) Error versus corpus callosum weights. The top dashed line indicates pretraining error; the thick line at the bottom indicates posttraining error, with both RH and LH jointly determining model output. The middle lines between .15 and .35 are the error when the left (dashed line) or right (solid line) hemispheric region alone sends activation to output elements. For example, when all callosal connections had an inhibitory weight of −1.5, the left hemispheric region (EL ≈ .16) was more accurate than the right hemispheric region (ER ≈ .32) in generating correct outputs, giving a substantial asymmetry coefficient value ρ−1.5 = −0.21. (b) Corresponding mean pretraining activation levels for left (dashed line) and right (solid line) hemispheric regions.
mean activation levels are essentially equal in the two hemispheric regions for all callosal strengths. If one uses such a seed in all simulations (unlike in Figure 2a, where a different random seed was used in each simulation), significant lateralization does not occur with the symmetrical model for any callosal strengths. However, even with arbitrary seeds, the lateralization
Lateralization of Phoneme Sequence Generation
1285
occurring in the symmetric model is limited in magnitude relative to that seen when various asymmetries are introduced, as described below. 3.2 Asymmetric Hemisphere Sizes. Larger left hemisphere language regions are often suggested as a causative factor for left language lateralization (Geschwind & Galaburda, 1987; Hellige, 1993). To examine this, we studied simulations where hemispheric sizes |LH| > |RH|, that is, models with more left hemispheric elements. Figure 3a shows a representative set of results with |LH| = 14 and |RH| = 10. Consistent lateralization to the left occurs for c ≤ −0.5, and the amount of lateralization tends to increase as c becomes increasingly negative (e.g., in Figure 3a, ρ−.5 = −0.15, ρ−1 = −0.39, ρ−2 = −0.57). Lateralization also increases with increasing ratio of hemispheric sizes |LH| / |RH|. For example, ρ−1 is −0.21 at 14/12, −0.39 at 14/10, and −0.49 at 14/8. While lateralization to the larger side is a robust finding with this version of the model, it is associated with increased pretraining mean hemispheric activation levels on the larger side for negative callosal strengths. This occurs because the larger hemispheric region more effectively inhibits the smaller when there are equally strong inhibitory callosal connections to both hemispheres. As the control model showed, an asymmetry in pretraining activation levels by itself can lead to lateralization. To control for this factor, we modified the model so that the uniform strength of inhibitory left-to-right callosal connections could differ from that of right-to-left connections. In this case, the strength of each left-to-right callosal connection was taken as 10c/|LH|, and each right-to-left callosal connection 10c/|RH|. This tended to produce pretraining hemispheric activation levels that were approximately the same. With this modification, consistent lateralization to the larger hemispheric region with negative callosal strengths largely disappeared in many cases. For example, this effect can be seen in Figure 3b, where |LH|/|RH| = 14/10 as in Figure 3a, but now symmetric pretraining mean hemispheric activation is present. Some lateralization still occurs here for negative callosal strengths, but its direction is arbitrary (ρ¯ − = −0.01). Mild but inconsistent lateralization still occurred with positive callosal strengths in this case (e.g., ρ¯ + = −0.07 in Figure 3b). Even when the ratio |LH|/|RH| is quite large (e.g., 24/6), in general less lateralization occurred. Figure 4a shows an example where some lateralization did occur, favoring the left side, mainly in the range −0.5 ≤ c ≤ 1.0 (i.e., when callosal connections are relatively weak). More consistent and pronounced lateralization for all values of c could still be elicited when the number of hemispheric elements on the smaller side was so low that the smaller side could not learn the task in isolation, for example, |LH|/|RH| = 20/4. In section 4, we will consider the explanation that it is the relative speed at which a hemisphere learns that is often the critical factor in determining whether lateralization occurs. In that context, the results in Figure 4b are very
1286
James A. Reggia et al.
Size of Hemispheres: 14/10
a. E 0.6 0.4 0.2 0
−3
b.
−2
−1
0
1
2
3
C
3
C
Size of Hemispheres: 14/10e
E 0.6 0.4 0.2 0
−3
−2
−1
0
1
2
Figure 3: Error versus callosal strength with larger left (14 elements) than right (10 elements) hemispheric regions. Same notation as Figure 2a. (a) Unadjusted. (b) Adjusted to equilibrate pretraining mean activation values.
important. They show that for a single hemispheric region operating alone, the rate at which learning occurs increases as the number of hemispheric elements increases. 3.3 Asymmetric Cortical Excitability. Asymmetric cortical excitability has been associated experimentally with functional lateralization (e.g., Macdonell et al., 1991) and regionally may also be implied by asymmetries in various neurotransmitter levels (Tucker & Williamson, 1984). Asymmetric excitability in the model should lead to relatively higher mean pretraining
Lateralization of Phoneme Sequence Generation
a.
1287
Size of Hemispheres: 20/6e2
E 0.6 0.4 0.2 0
−3
−2
−1
0
1
2
3
C
b. 3500
Epochs
3000 2500 2000 1500 1000 500 6
8
10 12 14 16 18 Number of Cortical Elements
20
Figure 4: (a) Error versus callosal strength for asymmetric size cortical regions after adjusting to equilibrate pretraining mean activation levels. Same notation as in Figure 2a; 20 elements on the left, 6 on the right (ρ¯ = −0.10). (b) Time for a single hemispheric region to learn this task versus number of hemispheric elements.
activation levels in the more excitable hemispheric region, and this would be expected to result in better independent performance by that hemispheric region. We modeled asymmetric cortical excitability in two ways. First, we let the maximum activation level M in equation 2.1 be asymmetric. For example, keeping ML = 1.0 on the left and considering progressively lower values of MR on the right, we found significant lateralization toward the left first occurred for simulations with callosal strengths c < 0 at about MR = 0.8.
1288
James A. Reggia et al.
At MR = 0.7, lateralization increased and became present for all callosal strengths, although it was progressively more pronounced for increasingly inhibitory callosal connections (see Figure 5a). Lower and lower MR led to progressively increased lateralization, reaching the point where the left hemispheric region alone was almost always as good as the two hemispheric regions together when MR = 0.3 (see Figure 5b). This progressive lateralization with increasing M asymmetry can be summarized by plotting ρ + and ρ − for various MR values (see Figure 5c). The second way asymmetric cortical excitability was produced was by allowing hemispheric sensitivity (i.e., input gain) to input received from input elements and state elements to differ on the two sides. This produced results similar to those seen with asymmetric maximum activation levels. As the ratio of left to right input sensitivity progressively increased, lateralization to the left first appeared with negative callosal strengths, being more pronounced with increasingly strong negative callosal strengths. With greater and greater asymmetry in input sensitivity ratio, this trend became increasingly more pronounced and was eventually accompanied by left lateralization of lesser magnitudes even when callosal strengths were positive. 3.4 Lateralization Without Asymmetric Pretraining Mean Activation Levels. Another possible cause of functional lateralization is asymmetric synaptic plasticity, which might also be implied by regional asymmetries in various neurotransmitters. If one hemispheric region is more modifiable and better able to learn new information, then one could hypothesize that it would assume a greater portion of a task. We simulated asymmetric synaptic plasticity by allowing one hemisphere to have a larger learning-rate parameter than the other. In this situation, functional lateralization tended to occur readily and consistently toward the side with the larger learningrate parameter. This effect increased with increasing learning-rate parameter asymmetry. Unlike the asymmetries already described, lateralization was largely independent of whether the corpus callosum was excitatory or inhibitory. Figure 6a illustrates these points, where the learning-rate parameters are 0.05 (left) and 0.02 (right), and there is consistent lateralization to the left hemisphere regardless of the sign of c (ρ¯ − = −0.29, ρ¯ + = −0.28). Pretraining mean activation levels of the hemispheric regions had only a secondary, modulating effect on this lateralization. For example, lateralization to the left still occurs at c = −2.5 in this case (see Figure 6a; ρ−2.5 = −.23) in spite of higher pretraining activation levels in the right hemisphere when c = −2.5 (see Figure 6b). Asymmetric feedback to the hemispheric regions (via state elements; see Figure 1) might also be conjectured to lead to lateralization. However, this was difficult to elicit with our model unless the feedback to one side was so small as to preclude that side from learning the task at all. For example, when feedback activation from state elements to the right hemi-
Lateralization of Phoneme Sequence Generation
a.
1289
Maximum Activation: 1.0/0.7
E 0.6 0.4 0.2 0
−3
−2 −1 0 1 2 Maximum Activation: 1.0/0.3
3 C
−3
−2
3 C
b. E 0.6 0.4 0.2 0
−1
0
1
2
c.
rho
−0.1 −0.2 −0.3 −0.4 −0.5 0.3
0.4
0.5
0.6
0.7
0.8
M
Figure 5: Error versus callosal strength with ML = 1.0, but (a) MR = 0.7, or (b) MR = 0.3. (c) Values of ρ + (solid line) and ρ − (dashed line) for various MR values when ML = 1.0.
1290
James A. Reggia et al.
a.
Learning Rates: 0.05/0.02
E 0.6 0.4 0.2 0
−3
−2
−1
0
1
2
3
C
b. 1
Activation
0.8 0.6 0.4 0.2 0 −3
−2
−1
0
1
2
3
Figure 6: Effects of asymmetric learning-rate parameters of 0.05 on the left and 0.02 on the right. (a) Error versus callosal strength; same notation as Figure 2a. (b) Mean pretraining activation levels for left (dashed line) and right (solid line) hemispheric regions.
spheric region was scaled to be 0.2 of that on the left, very mild but consistent lateralization to the left could occur. Independent simulations with a single hemispheric region showed that the number of training epochs required to learn the task was inversely related to the feedback level (e.g., 950 epochs for baseline feedback level versus 1500 epochs for 0.2 scaled feedback).
Lateralization of Phoneme Sequence Generation
1291
4 Discussion While there have been many previous neural models of cerebral cortex, to our knowledge only three have examined aspects of hemispheric interactions. The earliest of these demonstrated that oscillatory activity in one simulated hemisphere could be transferred to the other hemisphere via interhemispheric connections (Anninos et al., 1984; Anninos & Cook, 1988). A second model established that inhibitory callosal connections produce slower convergence and different activity patterns in the two simulated hemispheres (Cook & Beech, 1990). Finally, a pair of recurrent error backpropagation networks were trained to learn a small set of input-output associations simultaneously, and it was shown that slow “interhemispheric” connections were not critical for short output times (Ringo et al., 1994). None of these previous neural models of hemispheric interactions examined lateralization of functionality through synaptic weight changes (learning). In addition, there has been one symbol-processing model of interacting hemispheres (van Kleek & Kosslyn, 1991). This latter study is not a neural model; it represents the cognitive processes in high-level vision as numerous complex modules. This algorithmic model assumes a priori the lateralization of speech and other functions and then examines how such preexisting lateralizations could cause lateralization of other functions. This work does not demonstrate “spontaneous” lateralization of functionality (i.e., it starts with assumed lateralizations of functions). The work described here is also related to past studies involving the gating of multiple performance networks, such as mixture of experts architectures (Jacobs, Jordan, Nowlan, & Hinton, 1991; Jordan & Jacobs, 1994) and modular neural networks (Ronco & Gawthrop, 1997). The architecture of these models typically includes two or more expert networks analogous to our hemispheric regions. Unlike our model and other past models incorporating callosal connections (Anninos et al., 1984; Cook & Beech, 1990: Ringo et al., 1994), the expert networks in mixture-of-experts models have their output multiplicatively filtered by a gating network. Such gating networks have not been intended as a model of the corpus callosum, and it is difficult to see how they could be considered as such. Typically the gating networks directly receive and process input patterns through their own separate weighted network, unlike the real corpus callosum. Neither expert network (hemisphere) determines the activation levels of neurons in the gating network, in contrast to our model and the biological corpus callosum. Thus, the expert networks do not directly influence each other as occurs biologically, nor do they do so indirectly by activating the gating network. In addition, the output connectivity of a gating network would suggest the existence of callosal fiber synapses on cortical neuron axons or subcortical structures, something that appears to be inconsistent with experimental data showing that callosal fibers synapse primarily (about 97%) on cortical dendrites and dendritic spines (White, 1989). Finally, in mixture-of-expert
1292
James A. Reggia et al.
models the different networks become experts on different domains of the function being learned, whereas in our model the direct interhemispheric interactions generally produce a network (hemisphere) that is better than the other over all of the function domain, capturing more closely the essence of hemispheric dominance. While gated architectures have been applied to neuropsychological questions recently (Dailey, Cottrell, & Padgett, 1997), they have not been used to study the specific issues of function lateralization or callosal functionality. Thus, to our knowledge, the research reported here is the first systematic study of computer-simulated emergence of function lateralization. Our simulations show that within the limitations of the model we studied, it is easy to produce lateralization of function with several hemispheric asymmetries. Lateralization tended to occur most readily when callosal connections were inhibitory, supporting previous arguments that the primary functionality of the corpus callosum is inhibitory. However, when hemispheric asymmetries were sufficiently pronounced or directly affected learning rate, some lateralization could occur with all callosal strengths. For example, almost complete lateralization occurred even with strongly excitatory callosal connections when the hemispheric regions were asymmetrically excitable or had asymmetric learning rates. Many of our results can be intuitively understood if one views learning in the model as a “race” between the two hemispheric regions to acquire an input-output mapping via weight changes. The “race-to-learn” ends when the model as a whole has acquired the mapping and output error values approximate zero. In this situation, weight changes stop. Thus, any factor that increases the learning speed on one side of the model causes that side to learn the input-output associations better than the other side, leading to function lateralization. Specifically, asymmetries in such different factors as initial random weights, hemispheric size, maximum activation level, and input sensitivity all could lead to asymmetric pretraining activation levels when callosal strengths were negative, and via this to more rapid learning on the eventually dominant side. Asymmetries in hemisphere size and feedback intensity were found to be associated with asymmetric learning speeds, and these could lead to lateralization without asymmetries in pretraining hemispheric activation levels. Finally, directly accelerating weight changes in one hemisphere by increasing its learning-rate parameter led to function lateralization for all callosal strengths. Of special interest is that lateralization could occur with asymmetry in size of cortical regions under three conditions. First, if asymmetric-size regions led to asymmetric pretraining activation levels, marked lateralization could occur. Second, even if one prevented secondary asymmetry in hemispheric activation levels (and thus in the rate at which individual synapses changed), some lateralization could still occur if the size asymmetry was
Lateralization of Phoneme Sequence Generation
1293
sufficiently dramatic (e.g., 4 to 1). Such large asymmetries in size have been observed in cortical cytoarchitectonic areas related to language (Galaburda et al., 1978). In this case, the increased numbers of synaptic weights in the model’s larger hemispheric region resulted in the larger region’s having a faster learning rate (e.g., Figure 4b). Finally, if one hemispheric region was too small to learn the associative input-output mapping while the other was not, significant lateralization could occur. Overall, these simulations also suggest an asymmetric synaptic plasticity hypothesis for biological lateralization of language and other associative functions: asymmetric synaptic plasticity in hemispheric regions forms a common causative mechanism for many cortical factors that result in behavioral lateralization. Among other things, this hypothesis leads to the testable and potentially refutable prediction that lateralization of a function to one cerebral region will be associated with increased synaptic plasticity in that region relative to its mirror-image region in the opposite hemisphere. Experimental evidence for or against this expectation could be provided in many ways, such as immunochemical or electron microscopic measures of synaptic density, or total synaptic area per unit volume, or physiological assessment of long-term potentiation. To our knowledge, the asymmetric synaptic plasticity hypothesis has not been suggested previously, and relative measures of synaptic plasticity have not been made in paired regions of cerebral cortex. This hypothesis seems particularly intriguing to us in the context of suggestions, made by others independent of any considerations of function lateralization, that the peak period of synaptogenesis during the first three years of life is an important neural correlate of language acquisition (Bates, Thal, & Janowsky, 1992). While this work clearly demonstrates that computational models can be used to investigate the emergence of function lateralization, much further work is needed to assess the generality of the results reported here. For example, we have studied only error backpropagation in this work, selecting it as an initial learning method to examine because it is effective in learning input-output associations, it has been used successfully in the past to simulate aspects of cortical neuron responses (Lehky & Sejnowski, 1988; Zipser & Anderson, 1988), it has produced interesting results from learning involving interacting hemispheric regions (Ringo et al., 1994), and there is recent experimental evidence supporting its biological relevance (Fitzsimonds, Song, & Poo, 1997). Further, we examined only diffuse and not homotopic callosal connections. Finally, we emphasize that we have considered only lateralization of a single task in this work, as a logical first step before examining more general situations. There are multiple functional hemispheric asymmetries, and the question of how these are allocated to multiple cortical areas is an important direction for future research. Work is underway to address several of these issues.
1294
James A. Reggia et al.
Appendix A.1 Words Used for Training. CAD LAD PAT TAT
CAP LAP PIP TIP
CAT LID PIT TIT
COD LIP POD TOP
COP COT HAD HAP HAT HID HIP HIT HOD HOP HOT LIT LOP LOT MAD MAP MAT MID MIP MOP PAD PAP POP POT SAD SAP SAT SIP SIT SOD SOP SOT TAP TOT
A.2 Learning Rule Equations. Symbols o,h,i,s, and t are superscripts that denote output, hidden, input, state, and target values, respectively; η is the learning rate and M the maximum activation level. • Error at the ith output unit: eoi = (ati − aoi ). • Weight change for connection from jth hidden unit to ith output unit: 1woij = ηo ajh eoi [aoi (1 − aoi )]. • Error at the jth hidden unit: X woij eoi [aoi (1 − aoi )]. ejh = i
• Weight change for connection from kth input unit to jth hidden unit: i = ηi aik ejh [ajh (1 − ajh /M)]. 1wjk
• Weight change for connection from lth state unit to jth hidden unit: 1wjls = ηi asl ejh [ajh (1 − ajh /M)]. Acknowledgments This work was supported by NINDS award NS35460. We thank David Montgomery for assistance with programming during the early part of this work, and Rita Berndt and Ken Revett for helpful comments on the article. References Anninos, P., Argyrakis, P., & Skouras, A. (1984). A computer model for learning processes and the role of the cerebral commissures. Biol. Cybern., 50, 329–336. Anninos, P., & Cook, N. (1988). Neural net simulation of the corpus callosum. Intl. J. Neurosci., 38, 381–391. Bates, E., Thal, D., & Janowsky, J. (1992). Early language development and its neural correlates. In S. Segalowitz & I. Rapin (Eds.), Handbook of neuropsychology 7 (pp. 69–110). Amsterdam: Elsevier. Berlucchi, G. (1983). Two hemispheres but one brain. Behav. Brain Sci., 6, 171–173.
Lateralization of Phoneme Sequence Generation
1295
Cook, N. (1986). The brain code. New York: Methuen. Cook, N., & Beech, A. (1990). The cerebral hemispheres and bilateral neural nets. Int. J. Neurosci., 52, 201–210. Daily, M., Cottrell, G., & Padgett, M. (1997). A mixture of expert model exhibiting prosopagnosia. In M. Shafto & P. Langley (Eds.), Proc 19th Ann Conf Cog Sci Soc (pp. 155–160). Cincinnati: Cognitive Science Society. Denenberg, V. (1983). Micro and macro theories of the brain. Behav. Brain. Sci., 6, 174–178. Dennis, M., & Whitaker, H. (1976). Language acquisition following hemidecortication: Linguistic superiority of left over right hemisphere. Brain and Language, 3, 404–433. Felleman, D., & Van Essen, D. (1991). Distributed hierarchical processing in primate visual cortex. Cerebral Cortex, 1, 1. Ferbert A., Priori, I., Rothwell, J., Day, B., Colebatch, J., & Marsden, C. (1992). Interhemispheric inhibition of the human motor cortex. J. Physiol., 453, 525– 546. Fitzsimonds, R., Song, H., & Poo, M. (1997). Propagation of activity-dependent synaptic depression in simple neural networks. Nature, 388, 439–448. Galaburda, A., & Habib, M. (1987). Cerebral dominance: Biological associations and pathology. Discussions in Neurosciences, 4(2), Foundation FESN. Galaburda, A., Sanides, F., & Geschwind, N. (1978). Cytoarchitectonic left-right asymmetries in temporal speech region. Arch. Neurol., 35, 812–817. Gazzaniga, M. (1987). Cognitive and neurologic aspects of hemispheric disconnection in the human brain. Discussions in Neurosciences, 4(4), Foundation FESN. Geschwind, N., & Galaburda, A. (1987). Cerebral lateralization. Cambridge, MA: MIT Press. Geschwind, N., & Levitsky, W. (1968). Left-right asymmetries in temporal speech region. Science, 167, 186. Gur, R., Packer, I., Hungerbuhler, J., Reivich, M., Obrist, W., & Scakeim, H. (1980). Differences in distribution of gray and white matter in human cerebral hemispheres. Science, 207, 1226–1228. Hartenstein, V., & Innocenti, G. (1981). The arborization of single callosal axons in the mouse cerebral cortex. Neurosci Letters, 19–24. Hauser, M. (1993). Right hemisphere dominance for production of facial expressions in monkeys. Science, 261, 475–477. Heilman, K., & Valenstein, E. (1979). Clinical neuropsychology. New York: Oxford University Press. Hellige, J. (1993). Hemispheric asymmetry. Cambridge, MA: Harvard University Press. Innocenti, G. (1986). General organization of callosal connections in the cerebral cortex. In E. Jones & A. Peters (Eds.), Cerebral cortex (Vol. 5), (pp. 291–353). New York: Plenum. Jacobs, R., Jordan, M., Nowlan, S., & Hinton, G. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87. Jordan, M. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. Proc. 8th Ann. Conf. Cog. Sci. Soc., 531–546.
1296
James A. Reggia et al.
Jordan, M., & Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Kinsbourne, M. (ed.) (1978). Asymmetrical function of the brain. Cambridge: Cambridge University Press. Kupfermann, I. (1991). Localization of higher cognitive and affective functions. In E. Kandel, J. Schwartz, & T. Jessell (Eds.), Principles of neural science (pp. 823– 838). Amsterdam: Elsevier. Lehky, S., & Sejnowski, T. (1988). Network model of shape-from-shading. Nature, 333, 452–454. Lezak, M. (1995). Neurological assessment. New York: Oxford University Press. Loftus, W., Tramo, M., Thomas, C., Green, R., Nordgren, R., & Gazzaniga, M. (1993). Three-dimensional quantitative analysis of hemispheric asymmetry in the human superior temporal region. Cerebral Cortex, 3, 1074. Macdonell, R., Shapiro, B., Chiappa, K., Helmers, S., Cros, D., Day, B., & Shahani, B. (1991). Hemispheric threshold differences for motor evoked potentials produced by magnetic coil stimulation. Neurology, 41, 1441–1444. Mazziotta, J., & Phelps, M. (1984). Human sensory stimulation and deprivation: PET results and strategies. Ann Neurol, 15, 550–560. Meyer, B., Roricht, S., von Einseidel, H., Kruggel, F., & Weindl, A. (1995). Inhibitory and excitatory interhemispheric transfers between motor cortical areas in normal humans and patients with abnormalities of corpus callosum. Brain, 118, 429. Pandya, D., & Seltzer, B. (1986). The topography of commisural fibers. In F. Lepore, M.Ptito, & H. Jasper (Eds.), Two hemispheres—One Brain (pp. 47–73). New York: Alan Liss. Peterson, S., Fox, P., Posner, M., Mintun, M., & Raichle, M. (1988). PET studies of the cortical anatomy of single-word processing. Nature, 331, 585–589. Reggia, J. (1995). Symmetries of natural and artificial neural networks. Symmetry: Culture and Science, 6, 446–449. Reivich, M. et al. (1984). PET studies of perceptual tasks. Ann. Neurol., 15, 561– 565. Ringo, J., Doty, R., Demeter, S., & Simard, P. (1994). Time is of the essence: A conjecture that hemispheric specialization arises from interhemispheric conduction delay. Cerebral Cortex, 4, 331–343. Ronco, E., & Gawthrop, P. (1997). Gated modular neural networks for modeling and control (Tech. Rep. No. CSC-97008). Glasgow: Department of Mechanical Engineering, University of Glascow. Scheibel, A. (1985). Differentiality characteristics of the human speech cortex. In D. Benson & E. Zaidel (Eds.), The dual brain (pp. 65–74). New York: Guilford. Springer, S., & Deutsch, G. (1993). Left brain, right brain. San Francisco: W. H. Freeman. Toyama, K., Tokashiki, S., & Matsunami, K. (1969). Synaptic action of commissural impulses upon association efferent cells in cat visual cortex. Brain Res., 14, 518–520. Tucker, D., & Williamson, P. (1984). Asymmetric neural control systems in human self-regulation. Psychol. Review, 91, 185–215. Van Kleek, M., & Kosslyn, S. (1991). Computer models in the study of cere-
Lateralization of Phoneme Sequence Generation
1297
bral lateralization. In F. Kitterle (Ed.), Cerebral laterality: Theory and research (pp. 155–174). Hillsdale, NJ: Erlbaum. Vargha-Khadem, F., Carr, L., Isaacs, E., Brett, E., Adams, C., & Mishkin, M. (1997). Onset of speech after left hemispherectomy in a nine-year-old boy. Brain, 120, 159–182. Ward, J., & Hopkins, W. (1993). Primate laterality. Berin: Springer-Verlag. White, E. (1989). Cortical circuits. Birkhauser. Witelson, S., & Pallie, W. (1973). Left hemisphere specialization for language in the newborn. Brain, 96, 641–646. Zaidel, E. (1983). Disconnection syndrome as a model for laterality effects in the normal brain. In J. Hellige (ed.), Cerebral hemispheric asymmetry (pp. 95–151). New York: Praeger. Zipser, D., & Anderson, R. (1988). A backpropagation programmed network that simulates response properties of posterior parietal neurons. Nature, 331, 679–684. Received April 23, 1997; accepted November 3, 1997.
LETTER
Communicated by Peter Dayan
Nonlinear Component Analysis as a Kernel Eigenvalue Problem Bernhard Scholkopf ¨ ¨ biologische Kybernetik, 72076 Tubingen, ¨ Max-Planck-Institut fur Germany
Alexander Smola Klaus-Robert Muller ¨ GMD First (Forschungszentrum Informationstechnik), 12489 Berlin, Germany
A new method for performing a nonlinear form of principal component analysis is proposed. By the use of integral operator kernel functions, one can efficiently compute principal components in high-dimensional feature spaces, related to input space by some nonlinear map—for instance, the space of all possible five-pixel products in 16 × 16 images. We give the derivation of the method and present experimental results on polynomial feature extraction for pattern recognition.
1 Introduction Principal component analysis (PCA) is a powerful technique for extracting structure from possibly high-dimensional data sets. It is readily performed by solving an eigenvalue problem or using iterative algorithms that estimate principal components (for reviews of the existing literature, see Jolliffe, 1986, and Diamantaras & Kung, 1996). PCA is an orthogonal transformation of the coordinate system in which we describe our data. The new coordinate values by which we represent the data are called principal components. It is often the case that a small number of principal components is sufficient to account for most of the structure in the data. These are sometimes called factors or latent variables of the data. We are interested not in principal components in input space but in principal components of variables, or features, which are nonlinearly related to the input variables. Among these are variables obtained by taking arbitrary higher-order correlations between input variables. In the case of image analysis, this amounts to finding principal components in the space of products of input pixels. To this end, we are computing dot products in feature space by means of kernel functions in input space. Given any algorithm that can be expressed solely in terms of dot products (i.e., without explicit usage of the variables themselves), this kernel method enables us to construct different nonlinear c 1998 Massachusetts Institute of Technology Neural Computation 10, 1299–1319 (1998) °
1300
Bernhard Scholkopf, ¨ Alexander Smola, and Klaus-Robert Muller ¨
versions of it (Aizerman, Braverman, & Rozonoer, 1964; Boser, Guyon, & Vapnik, 1992). Although this general fact was known (Burges, private communication), the machine learning community has made little use of it, the exception being support vector machines (Vapnik, 1995). In this article, we give an example of applying this method in the domain of unsupervised learning, to obtain a nonlinear form of PCA. In the next section, we review the standard PCA algorithm. In order to be able to generalize it to the nonlinear case, we formulate it in a way that uses exclusively dot products. In section 3, we discuss the kernel method for computing dot products in feature spaces. Together, these two sections form the basis for section 4, which presents the proposed kernel-based algorithm for nonlinear PCA. First experimental results on kernel-based feature extraction for pattern recognition are given in section 5. We conclude with a discussion (section 6) and an appendix containing some technical material that is not essential for the main thread of the argument. 2 PCA in Feature Spaces Given a set of centered observations xk , k = 1, . . . , M, xk ∈ RN , PCA diagonalizes the covariance matrix,1 C=
M 1 X xj xj> . M j=1
PM
k=1 xk
= 0,
(2.1)
To do this, one has to solve the eigenvalue equation, λv = Cv,
(2.2)
1 PM for eigenvalues λ ≥ 0 and v ∈ RN \{0}. As Cv = M j=1 (xj ·v)xj , all solutions v with λ 6= 0 must lie in the span of x1 , . . . , xM ; hence, equation 2.2 in that case is equivalent to
λ(xk · v) = (xk · Cv) for all k = 1, . . . , M.
(2.3)
In the remainder of this section, we describe the same computation in another dot product space F, which is related to the input space by a possibly nonlinear map, 8 : RN → F, x 7→ X.
(2.4)
1 More precisely, the covariance matrix is defined as the expectation of xx> ; for convenience, we shall use the same term to refer to the estimate in equation 2.1 of the covariance matrix from a finite sample.
Nonlinear Component Analysis
1301
Note that F, which we will refer to as the feature space, could have an arbitrarily large, possibly infinite, dimensionality. Here and in the following, uppercase characters are used for elements of F, and lowercase characters denote elements of RN . PM 8 Again, we assume that we are dealing with centered data, that is k=1 (xk ) = 0 (we shall return to this point later). Using the covariance matrix in F, M 1 X 8(xj )8(xj )> C¯ = M j=1
(2.5)
(if F is infinite dimensional, we think of 8(xj )8(xj )> as the linear operator that maps X ∈ F to 8(xj )(8(xj ) · X)) we now have to find eigenvalues λ ≥ 0 and eigenvectors V ∈ F\{0} satisfying, ¯ λV = CV.
(2.6)
Again, all solutions V with λ 6= 0 lie in the span of 8(x1 ), . . . , 8(xM ). For us, this has two useful consequences. First, we may instead consider the set of equations, ¯ for all k = 1, . . . , M, λ(8(xk ) · V) = (8(xk ) · CV)
(2.7)
and, second, there exist coefficients αi (i = 1, . . . , M) such that, V=
M X
αi 8(xi ).
(2.8)
i=1
Combining equations 2.7 and 2.8, we get λ
M X
αi (8(xk ) · 8(xi )) =
i=1
M M X 1 X αi (8(xk ) · 8(xj ))(8(xj ) · 8(xi )) M i=1 j=1
for all k = 1, . . . , M.
(2.9)
Defining an M × M matrix K by Kij := (8(xi ) · 8(xj )),
(2.10)
this reads MλKα = K2 α,
(2.11)
1302
Bernhard Scholkopf, ¨ Alexander Smola, and Klaus-Robert Muller ¨
where α denotes the column vector with entries α1 , . . . , αM . To find solutions of equation 2.11, we solve the eigenvalue problem, Mλα = Kα,
(2.12)
for nonzero eigenvalues. A justification of this procedure is given in appendix A. Let λ1 ≤ λ2 ≤ · · · ≤ λM denote the eigenvalues of K (i.e., the solutions Mλ of equation 2.12), and α1 , . . . , αM the corresponding complete set of eigenvectors, with λp being the first nonzero eigenvalue (assuming 8 6≡ 0). We normalize αp , . . . , αM by requiring that the corresponding vectors in F be normalized, that is, (Vk · Vk ) = 1 for all k = p, . . . , M.
(2.13)
By virtue of equations 2.8 and 2.12, this translates into a normalization condition for αp , . . . , αM : 1=
M X
αik αjk (8(xi ) · 8(xj )) =
i,j=1
M X
αik αjk Kij
i,j=1
= (αk · Kαk ) = λk (αk · αk ).
(2.14)
For the purpose of principal component extraction, we need to compute projections onto the eigenvectors Vk in F (k = p, . . . , M). Let x be a test point, with an image 8(x) in F; then (Vk · 8(x)) =
M X
αik (8(xi ) · 8(x))
(2.15)
i=1
may be called its nonlinear principal components corresponding to 8. In summary, the following steps were necessary to compute the principal components: (1) compute the matrix K, (2) compute its eigenvectors and normalize them in F, and (3) compute projections of a test point onto the eigenvectors.2 For the sake of simplicity, we have made the assumption that the observations are centered. This is easy to achieve in input space but harder in F, because we cannot explicitly compute the mean of the 8(xi ) in F. There is, however, a way to do it, and this leads to slightly modified equations for kernel-based PCA (see appendix B). 2 Note that in our derivation we could have used the known result (e.g., Kirby & Sirovich, 1990) that PCA can be carried out on the dot product matrix (xi · xj )ij instead of equation 2.1; however, for the sake of clarity and extendability (in appendix B, we shall consider the question how to center the data in F), we gave a detailed derivation.
Nonlinear Component Analysis
1303
Before we proceed to the next section, which more closely investigates the role of the map 8, the following observation is essential: 8 can be an arbitrary nonlinear map into the possibly high-dimensional space F, for example, the space of all dth order monomials in the entries of an input vector. In that case, we need to compute dot products of input vectors mapped by 8, at a possibly prohibitive computational cost. The solution to this problem, described in the following section, builds on the fact that we exclusively need to compute dot products between mapped patterns (in equations 2.10 and 2.15); we never need the mapped patterns explicitly. 3 Computing Dot Products in Feature Spaces In order to compute dot products of the form (8(x) · 8(y)), we use kernel representations, k(x, y) = (8(x) · 8(y)),
(3.1)
which allow us to compute the value of the dot product in F without having to carry out the map 8. This method was used by Boser et al. (1992) to extend the Generalized Portrait hyperplane classifier of Vapnik and Chervonenkis (1974) to nonlinear support vector machines. To this end, they substitute a priori chosen kernel functions k for all occurrences of dot products, obtaining decision functions f (x) = sgn
à ` X
! νi k(x, xi ) + b .
(3.2)
i=1
Aizerman et al. (1964) call F the linearization space, and use it in the context of the potential function classification method to express the dot product between elements of F in terms of elements of the input space. If F is highdimensional, we would like to be able to find a closed-form expression for k that can be efficiently computed. Aizerman et al. (1964) consider the possibility of choosing k a priori, without being directly concerned with the corresponding mapping 8 into F. A specific choice of k might then correspond to a dot product between patterns mapped with a suitable 8. A particularly useful example, which is a direct generalization of a result proved by Poggio (1975, lemma 2.1) in the context of polynomial approximation, is d N X xj · yj (x · y)d = j=1
=
N X j1 ,...,jd =1
xj1 · . . . · xjd · yj1 · . . . · yjd = (Cd (x) · Cd (y)),
(3.3)
1304
Bernhard Scholkopf, ¨ Alexander Smola, and Klaus-Robert Muller ¨
where Cd maps x to the vector Cd (x) whose entries are all possible dth degree ordered products of the entries of x. For instance (Vapnik, 1995), if x = (x1 , x2 ), then C2 (x) = (x21 , x22 , x1 x2 , x2 x1 ), or, yielding the same value of the dot product, 82 (x) = (x21 , x22 ,
√ 2 x1 x2 ).
(3.4)
For this example, it is easy to verify that (x · y)2 = (x21 , x22 , √ 2 y1 y2 )> = (82 (x) · 82 (y)). In general, the function k(x, y) = (x · y)d
√ 2 x1 x2 )(y21 , y22 ,
(3.5)
corresponds to a dot product in the space of dth-order monomials of the input coordinates. If x represents an image with the entries being pixel values, we can thus easily work in the space spanned by products of any d pixels—provided that we are able to do our work solely in terms of dot products, without any explicit use of a mapped pattern 8d (x). The latter lives in a possibly very high-dimensional space: even though we will identify terms like x1 x2 and x2 x1 into one coordinate of F, as in equation 3.4, the d dimensionality of F still is (N+d−1)! d!(N−1)! and thus grows like N . For instance, 16× 16 pixel input images and a polynomial degree d = 5 yield a dimensionality of 1010 . Thus, using kernels of the form in equation 3.5 is our only way to take into account higher-order statistics without a combinatorial explosion of time and memory complexity. The general question that function k does correspond to a dot product in some space F has been discussed by Boser et al. (1992) and Vapnik (1995): Mercer’s theorem of functional analysis implies that if k is a continuous kernel of a positive integral operator, there exists a mapping into a space where k acts as a dot product (for details, see appendix C). Besides equation 3.5, radial basis functions, µ
kx − yk2 k(x, y) = exp − 2 σ2
¶ ,
(3.6)
k(x, y) = tanh(κ(x · y) + 2),
(3.7)
and sigmoid kernels,
have been used in support vector machines. These different kernels allow the construction of polynomial classifiers, radial basis function classifiers, and neural networks with the support vector algorithm, which exhibit very similar accuracy. In addition, they all construct their decision functions from an almost identical subset of a small number of training patterns, the support vectors (Scholkopf, ¨ Burges, & Vapnik, 1995).
Nonlinear Component Analysis
1305
The application of equation 3.1 to our problem is straightforward. We simply substitute an a priori chosen kernel function k(x, y) for all occurrences of (8(x)·8(y)). The choice of k then implicitly determines the mapping 8 and the feature space F. 4 Kernel PCA 4.1 The Algorithm. To perform kernel-based PCA (see Figure 1), henceforth referred to as kernel PCA, the following steps have to be carried out. First, we compute the matrix Kij = (k(xi , xj ))ij . Next, we solve equation 2.12 by diagonalizing K and normalize the eigenvector expansion coefficients α n by requiring λn (αn · αn ) = 1. To extract the principal components (corresponding to the kernel k) of a test point x, we then compute projections onto the eigenvectors by (cf. equation 2.15 and Figure 2), (Vn · 8(x)) =
M X
αin k(xi , x).
(4.1)
i=1
If we use a kernel as described in section 3, we know that this procedure exactly corresponds to standard PCA in some high-dimensional feature space, except that we do not need to perform expensive computations in that space. In practice, our algorithm is not equivalent to the form of nonlinear PCA that can be obtained by explicitly mapping into the feature space F. Even though the rank of the matrix K is always limited by the sample size, we may not be able to compute this matrix if the dimensionality is prohibitively high. In that case, using kernels is imperative. 4.2 Properties of (Kernel) PCA. If we use a kernel that satisfies the conditions given in section 3, we know that we are in fact doing a standard PCA in F. Consequently, all mathematical and statistical properties of PCA (see, e.g., Jolliffe, 1986; Diamantaras & Kung, 1996) carry over to kernelbased PCA, with the modifications that they become statements concerning F rather than RN . In F, we can thus assert that PCA is the orthogonal basis transformation with the following properties (assuming that the eigenvectors are sorted in descending order of the eigenvalue size): (1) the first q (q ∈ {1, . . . , M}) principal components, that is, projections on eigenvectors, carry more variance than any other q orthogonal directions, (2) the meansquared approximation error in representing the observations by the first q principal components is minimal, (3) the principal components are uncorrelated, and (4) the first q principal components have maximal mutual information with respect to the inputs (this holds under gaussian assumptions, and thus depends on the data and the chosen kernel). We conclude this section by noting one general property of kernel PCA in input space: for kernels that depend on only dot products or distances
1306
Bernhard Scholkopf, ¨ Alexander Smola, and Klaus-Robert Muller ¨
k(x,y) = (x.y)
linear PCA
R2
x x x xx x x
kernel PCA R2
x
x
x
x x
e.g. k(x,y) = (x.y)d
x x x
x
xx x x
k
x x
x x
x
x
x x
Φ
x x x x x x x
x
F
Figure 1: The basic idea of kernel PCA. In some high-dimensional feature space F (bottom right), we are performing linear PCA, just like a PCA in input space (top). Since F is nonlinearly related to input space (via 8), the contour lines of constant projections onto the principal eigenvector (drawn as an arrow) become nonlinear in input space. Note that we cannot draw a preimage of the eigenvector in input space, because it may not even exist. Crucial to kernel PCA is the fact that there is no need to carry out the map into F. All necessary computations are carried out by the use of a kernel function k in input space (here: R2 ).
in input space (as all the examples that we have given so far do), kernel PCA has the property of unitary invariance, following directly from the fact that both the eigenvalue problem and the feature extraction depend on only kernel values. This ensures that the features extracted do not depend on which orthonormal coordinate system we use for representing our input data. 4.3 Computational Complexity. A fifth-order polynomial kernel on a 256-dimensional input space yields a 1010 -dimensional feature space. For two reasons, kernel PCA can deal with this huge dimensionality. First, we do not need to look for eigenvectors in the full space F, but just in the subspace spanned by the images of our observations xk in F. Second, we do not
Nonlinear Component Analysis
1307
feature value (V.Φ(x)) = Σ αi k (xi,x)
Σ α1
α2
k
k
α3 k
α4
weights (eigenvector coefficients)
k
comparison: k(xi,x) sample x1, x2, x3,...
input vector x Figure 2: Feature extraction architecture in kernel PCA (cf. equation 4.1). In the first layer, the input vector is compared to the sample via a kernel function, chosen a priori (e.g., polynomial, gaussian, or sigmoid). The outputs are then linearly combined using weights, which are found by solving an eigenvector problem.
need to compute dot products explicitly between vectors in F (which can be impossible in practice, even if the vectors live in a lower-dimensional subspace) because we are using kernel functions. Kernel PCA thus is computationally comparable to a linear PCA on ` observations with an ` × ` dot product matrix. If k is easy to compute, as for polynomial kernels, for example, the computational complexity is hardly changed by the fact that we need to evaluate kernel functions rather than just dot products. Furthermore, when we need to use a large number ` of observations, we may want to work with an algorithm for computing only the largest eigenvalues, as, for instance, the power method with deflation (for a discussion, see Diamantaras & Kung, 1996). In addition, we can consider using an estimate of the matrix K, computed from a subset of M < ` examples, while still extracting principal components from all ` examples (this approach was chosen in some of our experiments described below). The situation can be different for principal component extraction. There, we have to evaluate the kernel function M times for each extracted principal component (see equation 4.1), rather than just evaluating one dot product as for a linear PCA. Of course, if the dimensionality of F is 1010 , this is still vastly faster than linear principal component extraction in F. Still, in some cases (e.g., if we were to extract principal components as a preprocessing step for classification), we might want to speed things up. This can be done by a technique proposed by Burges (1996) in the context of support vector machines. In the present setting, we approximate each eigenvector V = Pm P` ˜ i=1 αi 8(xi ) (see equation 2.8) by another vector V = j=1 βj 8(zj ), where
1308
Bernhard Scholkopf, ¨ Alexander Smola, and Klaus-Robert Muller ¨
m < ` is chosen a priori according to the desired speedup, and zj ∈ RN , j = ˜ 2. 1, . . . , m. This is done by minimizing the squared difference ρ = kV − Vk The crucial point is that this also can be done without explicitly dealing with the possibly high-dimensional space F. As ρ = kVk2 +
m X i,j=1
βi βj k(zi , zj ) − 2
m ` X X
αi βj k(xi , zj ),
(4.2)
i=1 j=1
the gradient of ρ with respect to the βj and the zj is readily expressed in terms of the kernel function; thus, ρ can be minimized by gradient descent. Finally, although kernel principal component extraction is computationally more expensive than its linear counterpart, this additional investment can pay back afterward. In experiments on classification based on the extracted principal components, we found that when we trained on nonlinear features, it was sufficient to use a linear support vector machine to construct the decision boundary. Linear support vector machines, however, are much faster in classification speed than nonlinear ones. This is due to the fact that for k(x, y) = (x · y), the support vector decision function (see equaP` tion 3.2) can be expressed with a single weight vector w = i=1 νi xi as f (x) = sgn((x · w) + b). Thus the final stage of classification can be done extremely fast. 4.4 Interpretability and Variable Selection. In PCA, it is sometimes desirable to be able to select specific axes that span the subspace into which one projects in doing principal component extraction. In this way, it may, for instance, be possible to choose variables that are more accessible to interpretation. In the nonlinear case, there is an additional problem: some directions in F do not have preimages in input space. To make this plausible, note that the linear span of the training examples mapped into feature space can have dimensionality up to M (the number of examples). If this exceeds the dimensionality of input space, it is rather unlikely that each vector of the form in equation 2.8 has a preimage. To get interpretability, we thus need to find directions in input space (i.e., input variables) whose images under 8 span the PCA subspace in F. This can be done with an approach akin to the one already described. We could parameterize our set of desired input variables and run the minimization of equation 4.2 only over those parameters. The parameters can be, for example, group parameters, which determine the amount of translation, say, starting from a set of images. 4.5 Dimensionality Reduction, Feature Extraction, and Reconstruction. Unlike linear PCA, the proposed method allows the extraction of a number of principal components that can exceed the input dimensionality. Suppose that the number of observations M exceeds the input dimensionality N. Linear PCA, even when it is based on the M × M dot product matrix, can find at
Nonlinear Component Analysis
1309
most N nonzero eigenvalues; they are identical to the nonzero eigenvalues of the N × N covariance matrix. In contrast, kernel PCA can find up to M nonzero eigenvalues—a fact that illustrates that it is impossible to perform kernel PCA directly on an N × N covariance matrix. Even more features could be extracted by using several kernels. Being just a basis transformation, standard PCA allows the reconstruction of the original patterns xi , i = 1, . . . , `, from a complete set of extracted principal components (xi · vj ), j = 1, . . . , `, by expansion in the eigenvector basis. Even from an incomplete set of components, good reconstruction is often possible. In kernel PCA, this is more difficult. We can reconstruct the image of a pattern in F from its nonlinear components; however, if we have only an approximate reconstruction, there is no guarantee that we can find an exact preimage of the reconstruction in input space. In that case, we would have to resort to an approximation method (cf. equation 4.2). Alternatively, we could use a suitable regression method for estimating the reconstruction mapping from the kernel-based principal components to the inputs. 5 Experiments 5.1 Toy Examples. To provide some insight into how PCA in F behaves in input space, we show a set of experiments with an artificial twodimensional data set, using polynomial kernels (cf. equation 3.5) of degree 1 through 4 (see Figure 3). Linear PCA (on the left) leads to only two nonzero eigenvalues, as the input dimensionality is 2. In contrast, nonlinear PCA allows the extraction of further components. In the figure, note that nonlinear PCA produces contour lines (of constant feature value), which reflect the structure in the data better than in linear PCA. In all cases, the first principal component varies monotonically along the parabola underlying the data. In the nonlinear cases, the second and the third components show behavior that is similar for different polynomial degrees. The third component, which comes with small eigenvalues (rescaled to sum to 1), seems to pick up the variance caused by the noise, as can be nicely seen in the case of degree 2. Dropping this component would thus amount to noise reduction. Further toy examples, using radial basis function kernels (see equation 3.6) and neural network–type sigmoid kernels (see equation 3.7), are shown in Figures 4 and 5. 5.2 Character Recognition. In this experiment, we extracted nonlinear principal components from a handwritten character database, using kernel PCA in the form given in appendix B. We chose the US Postal Service (USPS) database of handwritten digits collected from mail envelopes in Buffalo. This database contains 9298 examples of dimensionality 256; 2007 of them make up the test set. For computational reasons, we decided to use a subset of 3000 training examples for the matrix K. To assess the utility of
1310
Bernhard Scholkopf, ¨ Alexander Smola, and Klaus-Robert Muller ¨ Eigenvalue=0.709 Eigenvalue=0.621 Eigenvalue=0.570 Eigenvalue=0.552 1
1
1
1
0.5
0.5
0.5
0.5
0
0
0
0
−0.5 −0.5 −0.5 −0.5 −1 0 1 −1 0 1 −1 0 1 −1 0 1 Eigenvalue=0.291 Eigenvalue=0.345 Eigenvalue=0.395 Eigenvalue=0.418 1
1
1
1
0.5
0.5
0.5
0.5
0
0
0
0
−0.5 −0.5 −0.5 −0.5 −1 0 1 −1 0 1 −1 0 1 −1 0 1 Eigenvalue=0.000 Eigenvalue=0.034 Eigenvalue=0.026 Eigenvalue=0.021 1
1
1
1
0.5
0.5
0.5
0.5
0
0
0
0
−0.5 −1
0
−0.5 1 −1
0
−0.5 1 −1
0
−0.5 1 −1
0
1
Figure 3: Two-dimensional toy example, with data generated in the following way: x values have uniform distribution in [−1, 1], y values are generated from yi = x2i + ξ , where ξ is normal noise with standard deviation 0.2. From left to right, the polynomial degree in the kernel (see equation 3.5) increases from 1 to 4; from top to bottom, the first three eigenvectors are shown in order of decreasing eigenvalue size. The figures contain lines of constant principal component value (contour lines); in the linear case, these are orthogonal to the eigenvectors. We did not draw the eigenvectors; as in the general case, they live in a higherdimensional feature space.
the components, we trained a soft margin hyperplane classifier (Vapnik & Chervonenkis, 1974; Cortes & Vapnik, 1995) on the classification task. This is a special case of support vector machines, using the standard dot product as a kernel function. It simply tries to separate the training data by a hyperplane with large margin. Table 1 illustrates two advantages of using nonlinear kernels. First, performance of a linear classifier trained on nonlinear principal components is better than for the same number of linear components; second, the performance for nonlinear components can be further improved by using more components than is possible in the linear case. The latter is related to the fact that there are many more higher-order features than there are pixels in an image. Regarding the first point, note that extracting a certain number of features in a 1010 -dimensional space constitutes a much higher reduction of dimensionality than extracting the same number of features in 256-dimensional input space.
Nonlinear Component Analysis
1311
Figure 4: Two-dimensional toy example with three data clusters (gaussians with standard deviation 0.1, depicted region: [−1, 1] × [−0.5, 1]): first eight nonlinear kx−yk2 principal components extracted with k(x, y) = exp(− 0.1 ). Note that the first two principal components (top left) nicely separate the three clusters. Components 3–5 split up the clusters into halves. Similarly, components 6–8 split them again, in a way orthogonal to the above splits. Thus, the first eight components divide the data into 12 regions. The Matlab code used for generating this figure can be obtained from http://svm.first.gmd.de.
Figure 5: Two-dimensional toy example with three data clusters (gaussians with standard deviation 0.1, depicted region: [−1, 1] × [−0.5, nonlinear ¡ 1]): first three ¢ principal components extracted with k(x, y) = tanh 2(x · y) + 1 . The first two principal components (top left) are sufficient to separate the three clusters, and the third component splits the clusters into halves.
For all numbers of features, the optimal degree of kernels to use is around 4, which is compatible with support vector machine results on the same data set (Scholkopf, ¨ Burges, & Vapnik, 1995). Moreover, with only one exception, the nonlinear features are superior to their linear counterparts. The resulting error rate for the best of our classifiers (4.0%) is competitive with convolutional five-layer neural networks (5.0% were reported by LeCun et al., 1989) and nonlinear support vector classifiers (4.0%, Scholkopf, ¨ Burges, & Vapnik, 1995); it is much better than linear classifiers operating directly on the image data (a linear support vector machine achieves 8.9%; Scholkopf, ¨ Burges, & Vapnik, 1995). These encouraging results have been reproduced on an object recognition task (Scholkopf, ¨ Smola, & Muller, ¨ 1996).
1312
Bernhard Scholkopf, ¨ Alexander Smola, and Klaus-Robert Muller ¨
Table 1: Test Error Rates on the USPS Handwritten Digit Database. Test Error Rate for Degree Number of components
1
2
3
4
5
6
7
32 64 128 256 512 1024 2048
9.6 8.8 8.6 8.7 N.A. N.A. N.A.
8.8 7.3 5.8 5.5 4.9 4.9 4.9
8.1 6.8 5.9 5.3 4.6 4.3 4.2
8.5 6.7 6.1 5.2 4.4 4.4 4.1
9.1 6.7 5.8 5.2 5.1 4.6 4.0
9.3 7.2 6.0 5.4 4.6 4.8 4.3
10.8 7.5 6.8 5.4 4.9 4.6 4.4
Note: Linear support vector machines were trained on nonlinear principal components extracted by PCA with kernel (3.5), for degrees 1 through 7. In the case of degree 1, we are doing standard PCA, with the number of nonzero eigenvalues being at most the dimensionality of the space, 256. Clearly, nonlinear principal components afford test error rates that are superior to the linear case (degree 1).
6 Discussion 6.1 Feature Extraction for Classification. This article presented a new technique for nonlinear PCA. To develop this technique, we made use of a kernel method so far used only in supervised learning (Vapnik, 1995). Kernel PCA constitutes a first step toward exploiting this technique for a large class of algorithms. In experiments comparing the utility of kernel PCA features for pattern recognition using a linear classifier, we found two advantages of nonlinear kernels. First, nonlinear principal components afforded better recognition rates than corresponding numbers of linear principal components; and, second, the performance for nonlinear components can be improved by using more components than is possible in the linear case. We have not yet compared kernel PCA to other techniques for nonlinear feature extraction and dimensionality reduction. We can, however, compare results with other feature extraction methods used in the past by researchers working on the USPS classification problem. Our system of kernel PCA feature extraction plus linear support vector machine, for instance, performed better than LeNet1 (LeCun et al., 1989). Although the latter result was obtained a number of years ago, LeNet1 nevertheless provides an architecture that contains a great deal of prior information about the handwritten character classification problem. It uses shared weights to improve transformation invariance and a hierarchy of feature detectors resembling parts of the human visual system. In addition, our features were extracted without taking into account that we want to do classification. Clearly, in supervised learning, where we are given a set of labeled observations (x1 , y1 ), . . . , (x` , y` ), it
Nonlinear Component Analysis
1313
would seem advisable to make use of the labels not only during the training of the final classifier but also in the stage of feature extraction. Finally, we note that a similar approach can be taken in the case of regression estimation. 6.2 Feature Space and the Curse of Dimensionality. We are doing PCA in 1010 -dimensional feature spaces, yet getting results in finite time that are comparable to state-of-the-art techniques. In fact, however, we are not working in the full feature space, but in a comparably small linear subspace of it, whose dimension equals at most the number of observations. The method automatically chooses this subspace and provides a means of taking advantage of the lower dimensionality. An approach that consisted in explicitly of mapping into feature space and then performing PCA would have severe difficulties at this point. Even if PCA was done based on an M × M dot product matrix (M being the sample size), whose diagonalization is tractable, it would still be necessary to evaluate dot products in a 1010 -dimensional feature space to compute the entries of the matrix in the first place. Kernel-based methods avoid this problem; they do not explicitly compute all dimensions of F (loosely speaking, all possible features), but work only in a relevant subspace of F. 6.3 Comparison to Other Methods for Nonlinear PCA. Starting from some of the properties characterizing PCA (see above), it is possible to develop a number of possible generalizations of linear PCA to the nonlinear case. Alternatively, one may choose an iterative algorithm that adaptively estimates principal components and make some of its parts nonlinear to extract nonlinear features. Rather than giving a full review of this field here, we briefly describe five approaches and refer readers to Diamantaras and Kung (1996) for more details. 6.3.1 Hebbian Networks. Initiated by the pioneering work of Oja (1982), a number of unsupervised neural network algorithms computing principal components have been proposed. Compared to the standard approach of diagonalizing the covariance matrix, they have advantages—for instance, when the data are nonstationary. Nonlinear variants of these algorithms are obtained by adding nonlinear activation functions. The algorithms then extract features that the authors have referred to as nonlinear principal components. These approaches, however, do not have the geometrical interpretation of kernel PCA as a standard PCA in a feature space nonlinearly related to input space, and it is thus more difficult to understand what exactly they are extracting. 6.3.2 Autoassociative Multilayer Perceptrons. Consider a linear threelayer perceptron with a hidden layer smaller than the input. If we train
1314
Bernhard Scholkopf, ¨ Alexander Smola, and Klaus-Robert Muller ¨
it to reproduce the input values as outputs (i.e., use it in autoassociative mode), then the hidden unit activations form a lower-dimensional representation of the data, closely related to PCA (see, for instance, Diamantaras & Kung, 1996). To generalize to a nonlinear setting, one uses nonlinear activation functions and additional layers.3 While this can be considered a form of nonlinear PCA, the resulting network training consists of solving a hard nonlinear optimization problem, with the possibility of getting trapped in local minima, and thus with a dependence of the outcome on the starting point of the training. Moreover, in neural network implementations, there is often a risk of getting overfitting. Another drawback of neural approaches to nonlinear PCA is that the number of components to be extracted has to be specified in advance. As an aside, note that hyperbolic tangent kernels can be used to extract neural network–type nonlinear features using kernel PCA (see Figure 5). The principal components of a test point x in that case P take the form (see Figure 2) i αin tanh · (κ(xi , x) + 2). 6.3.3 Principal Curves. An approach with a clear geometric interpretation in input space is the method of principal curves (Hastie & Stuetzle, 1989), which iteratively estimates a curve (or surface) capturing the structure of the data. The data are mapped to the closest point on a curve, and the algorithm tries to find a curve with the property that each point on the curve is the average of all data points projecting onto it. It can be shown that the only straight lines satisfying the latter are principal components, so principal curves are indeed a generalization of the latter. To compute principal curves, a nonlinear optimization problem has to be solved. The dimensionality of the surface, and thus the number of features to extract, is specified in advance. 6.3.4 Locally Linear PCA. In cases where a linear PCA fails because the dependences in the data vary nonlinearly with the region in input space, it can be fruitful to use an approach where linear PCA is applied locally (e.g., Bregler & Omohundro, 1994). Possibly kernel PCA could be improved by taking locality into account. 6.3.5 Kernel PCA. Kernel PCA is a nonlinear generalization of PCA in the sense that it is performing PCA in feature spaces of arbitrarily large (possibly infinite) dimensionality, and if we use the kernel k(x, y) = (x · y), we recover standard PCA. Compared to the above approaches, kernel PCA has the main advantage that no nonlinear optimization is involved; it is 3 Simply using nonlinear activation functions in the hidden layer would not suffice. The linear activation functions already lead to the best approximation of the data (given the number of hidden nodes), so for the nonlinearities to have an effect on the components, the architecture needs to be changed to comprise more layers (see, e.g., Diamantaras & Kung, 1996).
Nonlinear Component Analysis
1315
essentially linear algebra, as simple as standard PCA. In addition, we need not specify the number of components that we want to extract in advance. Compared to neural approaches, kernel PCA could be disadvantageous if we need to process a very large number of observations, because this results in a large matrix K. Compared to principal curves, kernel PCA is harder to interpret in input space; however, at least for polynomial kernels, it has a very clear interpretation in terms of higher-order features. 7 Conclusion Compared to other techniques for nonlinear feature extraction, kernel PCA has the advantages that it requires only the solution of an eigenvalue problem, not nonlinear optimization, and by the possibility of using different kernels, it comprises a fairly general class of nonlinearities that can be used. Clearly the last point has yet to be evaluated in practice; however, for the support vector machine, the utility of different kernels has already been established. Different kernels (polynomial, sigmoid, gaussian) led to fine classification performances (Scholkopf, ¨ Burges, & Vapnik, 1995). The general question of how to select the ideal kernel for a given task (i.e., the appropriate feature space), however, is an open problem. The scene has been set for using the kernel method to construct a wide variety of rather general nonlinear variants of classical algorithms. It is beyond our scope here to explore all the possibilities, including many distance-based algorithms, in detail. Some of them are currently being investigated—for instance, nonlinear forms of k-means clustering and kernel-based independent component analysis (Scholkopf, ¨ Smola, & Muller, ¨ 1996). Linear PCA is being used in numerous technical and scientific applications, including noise reduction, density estimation, image indexing and retrieval systems, and the analysis of natural image statistics. Kernel PCA can be applied to all domains where traditional PCA has so far been used for feature extraction and where a nonlinear extension would make sense. Appendix A: The Eigenvalue Problem in the Space of Expansion Coefficients Being symmetric, K has an orthonormal basis of eigenvectors (β i )i with corresponding eigenvalues µi ; thus, for all i, we have Kβ i = µi β i (i = 1, . . . , M). To understand the relation between equations 2.11 and 2.12, we proceed as follows. First, suppose λ, α satisfy equation 2.11. We may expand α in K’s PM P ai β i . Equation 2.11 then reads Mλ i ai µi β i = eigenvector basis as α = i=1 P 2 i 2 i ai µi β , or, equivalently, for all i = 1, . . . , M, Mλai µi = ai µi . This in turn means that for all i = 1, . . . , M, Mλ = µi or ai = 0 or µi = 0.
(A.1)
1316
Bernhard Scholkopf, ¨ Alexander Smola, and Klaus-Robert Muller ¨
Note that the above are not exclusive ors. We next assume that λ, α satisfy equation 2.12, to carry out a similar P In that case, we find that P derivation. equation 2.12 is equivalent to Mλ i ai β i = i ai µi β i , that is, for all i = 1, . . . , M, Mλ = µi or ai = 0.
(A.2)
Comparing equations A.1 and A.2, we see that all solutions of the latter satisfy the former. However, they do not give its full set of solutions: given a solution of equation 2.12, we may always add multiples of eigenvectors of K with eigenvalue 0 and still satisfy equation 2.11, with the same eigenvalue. This means that there exist solutions of equation 2.11 that belong to different eigenvalues yet are not orthogonal in the space of the αk . It does not mean, however, that the eigenvectors of C¯ in F are not orthogonal. Indeed, if α is P an eigenvector of K with eigenvalue 0, then the corresponding vector i αi 8(x Pi ) is orthogonal to all vectors in the span of the 8(x P j ) in F, since (8(xj ) · i αi 8(xi )) = (Kα)j = 0 for all j, which means that i αi 8(xi ) = 0. Thus, the above difference between the solutions of equations 2.11 and 2.12 is irrelevant, since we are interested in vectors in F rather than vectors in the space of the expansion coefficients of equation 2.8. We thus only need to diagonalize K to find all relevant solutions of equation 2.11. Appendix B: Centering in High-Dimensional Space Given any 8 and any set of observations x1 , . . . , xM , the points ˜ i ) := 8(xi ) − 8(x
M 1 X 8(xi ) M i=1
(B.1)
are centered. Thus, the assumptions of section 2 now hold, and we go on ˜ i ) · 8(x ˜ j )) in F. We arrive at the to define covariance matrix and K˜ ij = (8(x already familiar eigenvalue problem, ˜, ˜ = K˜ α λ˜ α
(B.2)
˜ being the expansion coefficients of an eigenvector (in F) in terms of with α P ˜ = M α˜ i 8(x ˜ i ). Because we do not have the the points in equation B.1, V i=1 centered data (see equation B.1), we cannot compute K˜ directly; however, we can express it in terms of its noncentered counterpart K. In the following, we shall use Kij = (8(xi ) · 8(xj )) and the notations 1ij = 1 for all i, j, (1M )ij := ˜ i ) · 8(x ˜ j )): 1/M, to compute K˜ ij = (8(x à K˜ ij =
M M 1 X 1 X (8(xi ) − 8(xm )) · (8(xj ) − 8(xn )) M m=1 M n=1
! (B.3)
Nonlinear Component Analysis
= Kij −
1317
M M M 1 X 1 X 1 X 1im Kmj − Kin 1nj + 2 1im Kmn 1nj M m=1 M n=1 M m,n=1
= (K − 1M K − K1M + 1M K1M )ij . We thus can compute K˜ from K and then solve the eigenvalue problem ˜ k are normalized (see equation B.2). As in equation 2.14, the solutions α k ˜ in F, which translates into by normalizing the corresponding vectors V k k ˜ ·α ˜ ) = 1. For feature extraction, we compute projections of centered λ˜ k (α 8-images of test patterns t onto the eigenvectors of the covariance matrix of the centered points, ˜ k · φ(t)) ˜ = (V
M X
˜ i ) · 8(t)). ˜ α˜ ik (8(x
(B.4)
i=1
Consider a set of test points t1 , . . . , tL , and define two L × M matrices 1 PM by Kijtest = (8(ti ) · 8(xj )) and K˜ ijtest = ((8(ti ) − M m=1 8(xm )) · (8(xj ) − 1 PM test ˜ in terms of Ktest , and n=1 8(xn ))). As in equation B.3, we express K M 0 0 test test test arrive at K˜ = K − 1M K − K 1M + 1M K1M , where 10M is the L × M matrix with all entries equal to 1/M. Appendix C: Mercer Kernels Mercer’s theorem of functional analysis (e.g., Courant & Hilbert, 1953) gives conditions under which we can construct the mapping 8 from the eigenfunction decomposition of k. If kR is the continuous kernel of an integral operator K : L2 → L2 , (K f )(y) = k(x, y) f (x) dx, which is positive, that is, Z (C.1) f (x)k(x, y) f (y) dx dy ≥ 0 for all f ∈ L2 , then k can be expanded into a uniformly convergent series, k(x, y) =
∞ X
λi φi (x)φi (y),
(C.2)
i=1
with λi ≥ 0. In this case, p p 8 : x 7→ ( λ1 ψ1 (x), λ2 ψ2 (x), . . .)
(C.3)
is a map into F such that k acts as the given dot product, that is, (8(x)·8(y)) = k(x, y). Although formulated originally for the case where the integral operator acts on functions f from L2 ([a, b]), Mercer’s theorem also holds if f is defined on a space of arbitrary dimensionality, provided that it is compact (e.g., Dunford & Schwartz, 1963).
1318
Bernhard Scholkopf, ¨ Alexander Smola, and Klaus-Robert Muller ¨
Acknowledgments A. S. and B. S. were supported by grants from the Studienstiftung des deutschen Volkes. B. S. thanks the GMD First for hospitality during two visits. A. S. and B. S. thank V. Vapnik for introducing them to kernel representations of dot products during joint work on support vector machines. Thanks to AT&T and Bell Laboratories for letting us use the USPS database and to L. Bottou, C. Burges, and C. Cortes for parts of the soft margin hyperplane training code. This work profited from discussions with V. Blanz, L. Bottou, C. Burges, H. Bulthoff, ¨ P. Haffner, Y. Le Cun, S. Mika, N. Murata, P. Simard, S. Solla, V. Vapnik, and T. Vetter. We are grateful to V. Blanz, C. Burges, and S. Solla for reading a preliminary version of the article.
References Aizerman, M., Braverman, E., & Rozonoer, L. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821–837. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In D. Haussler (Ed.), Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh: ACM Press. Bregler, C., & Omohundro, M. (1994). Surface learning with applications to lipreading. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems 6. San Mateo, CA: Morgan Kaufmann. Burges, C. J. C. (1996). Simplified support vector decision rules. In L. Saitta (Ed.), Proc. 13th Intl. Conf. on Machine Learning. San Mateo, CA: Morgan Kaufmann. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297. Courant, R., & Hilbert, D. (1953). Methods of mathematical physics (Vol. 1). New York: Interscience. Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks. New York: Wiley. Dunford, N., & Schwartz, J. T. (1963). Linear operators part II: Spectral theory, self adjoint operators in Hilbert space. New York: Wiley. Hastie, T., & Stuetzle, W. (1989). Principal curves. JASA, 84, 502–516. Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-Verlag. Kirby, M., & Sirovich, L. (1990). Application of the Karhunen-Lo`eve procedure for the characterization of human faces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(1), 103–108. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. J. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biology, 15, 267–273.
Nonlinear Component Analysis
1319
Poggio, T. (1975). On optimal nonlinear associative recall. Biological Cybernetics, 19, 201–209. Scholkopf, ¨ B., Burges, C., & Vapnik, V. (1995). Extracting support data for a given task. In U. M. Fayyad & R. Uthurusamy (Eds.), Proceedings, First Intl. Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press. Scholkopf, ¨ B., Smola, A., & Muller, ¨ K.-R. (1996). Nonlinear component analysis as a kernel eigenvalue problem (Tech. Rep. No. 44). Tubingen: ¨ Max-Planck-Institut fur ¨ biologische Kybernetik. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V., & Chervonenkis, A. (1974). Theory of pattern recognition [in Russian]. Nauka, Moscow, 1974. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorie der Zeichener Kenning, Akademie-Verlag, Berlin). Received December 28, 1996; accepted September 18, 1997.
ARTICLE
Communicated by Misha Tsodyks
Chaotic Balanced State in a Model of Cortical Circuits C. van Vreeswijk H. Sompolinsky Racah Institute of Physics and Center for Neural Computation, Hebrew University, Jerusalem, 91904 Israel
The nature and origin of the temporal irregularity in the electrical activity of cortical neurons in vivo are not well understood. We consider the hypothesis that this irregularity is due to a balance of excitatory and inhibitory currents into the cortical cells. We study a network model with excitatory and inhibitory populations of simple binary units. The internal feedback is mediated by relatively large synaptic strengths, so that the magnitude of the total excitatory and inhibitory feedback is much larger than the neuronal threshold. The connectivity is random and sparse. The mean number of connections per unit is large, though small compared to the total number of cells in the network. The network also receives a large, temporally regular input from external sources. We present an analytical solution of the mean-field theory of this model, which is exact in the limit of large network size. This theory reveals a new cooperative stationary state of large networks, which we term a balanced state. In this state, a balance between the excitatory and inhibitory inputs emerges dynamically for a wide range of parameters, resulting in a net input whose temporal fluctuations are of the same order as its mean. The internal synaptic inputs act as a strong negative feedback, which linearizes the population responses to the external drive despite the strong nonlinearity of the individual cells. This feedback also greatly stabilizes the system’s state and enables it to track a time-dependent input on time scales much shorter than the time constant of a single cell. The spatiotemporal statistics of the balanced state are calculated. It is shown that the autocorrelations decay on a short time scale, yielding an approximate Poissonian temporal statistics. The activity levels of single cells are broadly distributed, and their distribution exhibits a skewed shape with a long power-law tail. The chaotic nature of the balanced state is revealed by showing that the evolution of the microscopic state of the network is extremely sensitive to small deviations in its initial conditions. The balanced state generated by the sparse, strong connections is an asynchronous chaotic state. It is accompanied by weak spatial cross-correlations, the strength of which vanishes in the limit of large network size. This is in contrast to the synchronized chaotic states exhibited by more conventional network models with high connectivity of weak synapses. c 1998 Massachusetts Institute of Technology Neural Computation 10, 1321–1371 (1998) °
1322
C. van Vreeswijk and H. Sompolinsky
1 Introduction The firing patterns of neurons in the cortex of intact animals often exhibit a strong degree of temporal irregularity. This can be seen by the broad interspike interval histograms (ISI) of cortical neurons, which are typically close to those generated by a Poisson process with a short refractory period (Abeles, 1991; Bair, Koch, Newsome, & Britten, 1994; Burns & Webb, 1976; Douglas, Martin, & Whitteridge, 1991; Softky & Koch, 1993). The irregular neuronal dynamics is also manifested in intracellular recordings of the membrane potential, which exhibit strong temporal fluctuations. One of the long-standing problems in cortical dynamics is understanding the origin of this irregularity and its computational implications (Douglas & Martin, 1991; Ferster & Jagadeesh, 1992). In vitro experiments show that cortical neurons fire in a relatively regular fashion when they are injected with a constant current. Thus, the irregularity of the in vivo neuronal activity must be due to fluctuations in their synaptic input (Holt, Softky, Koch, & Douglas, 1996; Mainen & Sejnowski, 1995). These fluctuations may be due to variations in the intensity of the sensory stimuli or may result from the stochastic action of synapses. However, since cortical cells have thousands of synaptic contacts, one would expect that the summation of the synaptic inputs at the soma averages out most of the fluctuations in the synaptic input and yields a membrane potential with only a small residual fluctuation. This is a particularly difficult issue in conditions where the cortex is vigorously active so that the cell receives many synaptic inputs within a single integration time constant (Holt et al., 1996; Softky & Koch, 1993). One possible resolution of this problem is to assume that the fluctuating synaptic inputs are substantially correlated and therefore are not averaged out. Indeed, the spike trains of pairs of neurons in cortex and in thalamus are often correlated in a relatively narrow time scale (of the order of 10 msec) (Abeles, 1991; Gray & Singer, 1989; Perkel, Gerstein, & Moore, 1967a, 1967b; Vaadia et al., 1995). However, the observed size of these correlations indicates that in general, only a small fraction of the neuronal activity is tightly correlated. Another possibility, addressed in this article, is that although the inputs to a cell are only weakly correlated, the cell is sensitive to the residual correlations in the somatic potential. Several mechanisms that generate enhanced sensitivity of a cell to small fluctuations in its potential have been explored (Bell, Mainen, Tsodyks, & Sejnowski, 1994; Ermentrout & Gutkin, in press; Gerstein & Mandelbrot, 1964; Shadlen & Newsome, 1994, 1995; Softky, 1995; Troyer & Miller, 1997). One possibility is that the excitatory and inhibitory inputs to a cortical cell are balanced in a way that leaves the cell’s average potential close to threshold, and its firing pattern is therefore susceptible to small fluctuations. An interesting question is what might be the mechanism that leads to this balance. An interesting recent study (Tsodyks & Sejnowski, 1995) explored the
Chaotic Balanced State in a Model of Cortical Circuits
1323
possible involvement of local cortical dynamics in balancing excitation and inhibition. This numerical study invoked a strong stochasticity in the synaptic action in the form of a large failure probability. In a related study (Amit & Brunel, 1997a, 1997b), the variability in the network activity is at least partially due to fluctuating external inputs to the local network. In addition, neither study properly addresses important issues concerning the behavior of the models and the robustness of their variability as the network size is scaled up. In this article, we investigate the hypothesis that the intrinsic deterministic dynamics of local cortical networks is sufficient to generate strong variability in the neuronal firing patterns. Neuronal dynamics is highly nonlinear; hence it may seem natural to expect that neuronal networks with deterministic dynamics will exhibit chaotic behavior. However, studies of simple models of large networks with a high degree of connectivity (Abbott & van Vreeswijk, 1993; Gerstner & van Hemmen, 1993; Grannan, Kleinfeld, & Sompolinsky, 1992; Hansel, Mato, & Meunier, 1995; van Vreeswijk, 1996; Wilson & Cowan, 1972; Tsodyks, Mitkov, & Sompolinsky, 1993) reveal that in the absence of external sources of strong stochastic noise, they tend to settle into temporally ordered states of tonic firing or oscillations. Recent extensive numerical study (Bush & Douglas, 1991; Hansel & Sompolinsky, 1992, 1996) of a model of local circuits in visual cortex with realistic conductance-based dynamics has shown the existence of parameter regimes in which these networks exhibit strongly irregular states, denoted as synchronized chaotic states. These chaotic states are generated by the emergence of strong synchrony in the fluctuating activity of different neurons, which consistently generates a strong fluctuating feedback to each cell. Thus, this is a network realization of the scenario of correlated synaptic inputs. The resulting patterns of activity show strongly synchronized bursting patterns, tightly timed by the common inhibitory feedback. Although bursting patterns are sometimes observed in cortical networks, these synchronized chaotic states are hard to reconcile with the Poisson-like weakly correlated firing patterns, commonly observed in cortex. In this work, we explore the possibility that local networks with intrinsic dynamics evolve toward states that are characterized by strong chaos in conjunction with weak cross-correlations, through the mechanism of balancing between excitation and inhibition. This possibility raises several questions: • What are the conditions under which a network evolves to a state in which the excitatory and inhibitory inputs are balanced? • What are the characteristics of this balanced state? Does the balanced state represent a cooperative state that is qualitatively distinct from the synchronized chaotic state? • What are the functional advantages of the balanced state?
1324
C. van Vreeswijk and H. Sompolinsky
We study these questions using a network model with the simplified dynamics of binary elements. The architecture consists of excitatory and inhibitory populations connected by sparse random connections. An essential ingredient of our model is the introduction of strong connections among the units. A cell is connected, on the average, to K other cells, and K is large. However, the gap between the threshold of the cell and its resting potential √ is only of the order of K excitatory inputs. Thus, the network will saturate unless a dynamically developed balance between the excitatory inputs and the inhibitory inputs to a cell emerges. Indeed, our analytical solution of the model in the limit of large network size shows that in a broad range of parameters, the network settles into a stable balanced state. An interesting feature of our theory is that it goes far beyond calculating the properties of the macroscopic order parameters. The theory yields a complete statistical characterization of the balanced state. It shows that the balanced state is associated with a strong Poisson-like firing pattern and also with a broad inhomogeneity in the average rates of individual neurons. Finally, we address the possible functional implications of the balanced state by showing that the network is capable of fast tracking of temporal changes in the external input to the network. In section 2, we present the model’s dynamics and architecture. Section 3 presents the mean-field dynamic equations of the evolution in time of the two macroscopic order parameters, which are the rates of activity of the two subpopulations. The mean-field theory is exact in the limit of large network size, N, and 1 ¿ K ¿ N. In section 4 the behavior of the population rates in the balanced state is studied. Section 5 is devoted to the spatial and temporal distribution of activity within the network. Section 6 addresses the stability of the balanced state. It shows that there is a comfortable parameter regime where the balanced state is stable. We also discuss what happens to the network when the balanced fixed point is unstable. Section 7 considers the effect of inhomogeneity in the local thresholds. We show that in the presence of inhomogeneity, the distribution of rates acquires a characteristic skewed shape with a long tail, qualitatively similar to the observed distribution of rates in populations of neurons in the cortex of behaving monkeys. In section 8, we evaluate the sensitivity of the temporal fluctuations in the local instantaneous activities to a small change in the initial condition. We conclude that a small change in the initial condition leads rapidly to a complete loss of memory of the unperturbed initial conditions. Thus, our network shows the main characteristics of chaotic systems. Section 9 studies the dynamical response of the system to dynamic changes in the external input and shows the fast tracking capabilities of the network. In section 10 we discuss the results and some open issues. Details of the theory are outlined in appendixes A and B. A preliminary report on this work was published in van Vreeswijk and Sompolinsky (1996).
Chaotic Balanced State in a Model of Cortical Circuits
EE
Excit. JEI
EI
1325
JEE
JIE
Inhib.
JII
External Figure 1: A schematic representation of the network architecture. Excitatory connections are shown as open circles; inhibitory ones as filled circles.
2 The Model We consider a network of N1 excitatory and N2 inhibitory neurons. The network also receives input from excitatory neurons outside it (see Figure 1). We will use either the subscript 1 or E to denote the excitatory population and 2 or I for the inhibitory one. The pattern of connections is random but fixed in time. The connection between the ith postsynaptic neuron of the kth population and the jth presynaptic neuron of the lth population, denoted √ ij Jkl , is Jkl / K with probability K/Nk and zero otherwise. Here k, l = 1, 2. The synaptic constants Jk 1 are positive and Jk 2 negative. Thus, on average, K excitatory and K inhibitory neurons project to each neuron. We will call K the connectivity index. The state of each neuron is described by a binary variable σ . The value σ = 0 (σ = 1) corresponds to a quiescent (active) state. The network has an asynchronous dynamics where only one neuron updates its state at any given time. The updated state of the updating neuron at time t is σki (t) = 2(uik (t)),
(2.1)
where 2(x) is the Heaviside function, 2(x) = 0 for x ≤ 0 and 2(x) = 1 for x > 0. The total synaptic input, uik to the neuron, relative to the threshold, θk , at time t is uik (t) =
Nl 2 X X
ij
j
Jkl σl (t) + u0k − θk ,
(2.2)
l=1 j=1
where u0k denotes the constant external input to the kth population. As explained in appendix B, the precise definition of the order of updates is not essential. One model is a stochastic model in which each neuron updates its
1326
C. van Vreeswijk and H. Sompolinsky
state at time intervals that have Poisson statistics. This model is the simplest to analyze. However, it has the drawback that it introduces a stochastic element (the random choice of the updating neuron). An alternative model is a fully deterministic one in which each neuron updates its state at equally spaced times where the time between updates is different for each neuron. We show in appendix B that the two models have the same mean-field equations. In both cases, the mean interval between consecutive updates of a neuron of the kth population is τk . We will use time units such that τE = 1 so that the only independent time parameter is τ ≡ τI . To correspond to point processes, we define a spike as the transition from the passive (0) to the active (1) state. Note that the firing rate, rik , of neuron i in population k is different from the average value, mik (t), of σki because before the cell can spike, it has to update to the passive state. However, if neuron i of the kth population updates to the active state in two consecutive updates, the synapses projecting from this cell stay active after the second update, even though no new spike is emitted. However, if mik , which we will call the activity rate, is small, the probability of two consecutive updates to the active state is low, and thus for small mik , the activity rate and the firing rate are nearly equal. Indeed, if we assume that at each update the probability of being in the active state is mik (very nearly true in this model for low rates, as shown in section 5.3), the firing rate is given by rik = mik (1 − mik )/τk . A central ingredient of our model is the assumption that the total excitatory feedback current and the total inhibitory current into a cell are large compared to the neuronal threshold. We model this by choosing thresholds θk that are of order √ 1 and by assuming that the strength of individual synapses is of order 1/ K, that is, the coefficients Jkl are of order unity. Furthermore, as will be seen later, it is crucial that the excitatory inputs from the external sources too are large compared to the threshold. This is modeled by denoting these inputs as √ (2.3) u0k = Ek m0 K k = 1, 2, where Ek is of order unity and 0 < m0 < 1 represents the mean activity of the external neurons. We will use the notation E1 = E
(2.4)
for the external input to the excitatory population and E2 = I
(2.5)
for the external input to the inhibitory neurons. We assume that the external input is temporally regular. Since the model neurons are threshold elements, the absolute scale of uki is irrelevant. We therefore set JEE = JIE = 1,
(2.6)
Chaotic Balanced State in a Model of Cortical Circuits
1327
so that the only connection parameters from the network are the inhibitory and external ones. We will denote the former as JE ≡ −JEI ; JI ≡ −JII ,
(2.7)
where JI , JE > 0. 3 Mean-Field Equations for Population Rates The dynamics of our model can be described by mean-field theory, which is exact in the limit of large Nk . To define this limit, we assume that NI /NE is held fixed as the network size N = NE + NI grows. The nature of the meanfield theory depends on the assumed relationship between the network size and the connectivity index. Conventional mean-field theory assumes that the networks are fully connected, defined here to mean that K/N is fixed as N → ∞. Here we assume sparse connectivity defined by assuming that K is fixed as N grows. We are primarily interested in temporal variability that is present in highly connected networks, which are either fully connected or sparsely connected with large connectivity index K. Therefore we will focus on the case of large K. Technically, we will first take the limit N → ∞ and then the limit K → ∞. In reality, networks have a large fixed size and connectivity, so that the distinction between full and sparse connectivity may be problematic. Nevertheless, the sparse limit is appropriate as long as 1 ¿ K ¿ Nk , k = 1, 2.
(3.1)
The mean-field theory of our model for arbitrary fixed K is presented in appendix A. Taking the large K limit provides a substantial simplification of the mean-field equations. In this limit, most of the properties of the system can be expressed in terms of the first and second moments of the neuronal activity levels, as will be shown here and in the following sections. We first consider the population-averaged firing rates of the excitatory and inhibitory cells as mk (t) = [σki (t)] =
Nk 1 X σ i (t) , k = 1, 2, Nk i=1 k
(3.2)
where [. . .] denotes a population average. In appendix A we show that the average activities satisfy in the large K limit µ ¶ −uk d . (3.3) τk mk (t) = −mk (t) + H √ dt αk Here H is the complementary error function, Z ∞ dx 2 H(z) ≡ √ e−x /2 , 2π z
(3.4)
1328
C. van Vreeswijk and H. Sompolinsky
1
0.75
H(x) 0.5
0.25
0 -4
-2
0
x
2
4
Figure 2: Complementary error function H(x). The error function varies sigmoidally from 1 for x → −∞ to 0 at x → ∞.
shown in Figure 2. The quantities uk (t) and αk (t) are √ uk (t) = K
à 2 X
! Jkl ml (t) + Ek m0 − θk
(3.5)
l=1
and αk (t) =
2 X
(Jkl )2 ml (t),
(3.6)
l=1
respectively. Equation 3.5 denotes the population average of the total input to a neuron in the kth population, relative to threshold. Equation 3.6 denotes the variance of this input. Note that the external population does not contribute to variance because we assumed that the input is the same for all the neurons in a population. In the case of a constant external input, m0 (t) = m0 , the network settles into a state in which the average activities are constant, mk (t) = mk , given by the stable fixed points of equation 3.3, µ
−uk mk = H √ αk
¶ ,
(3.7)
Chaotic Balanced State in a Model of Cortical Circuits
1329
where the mean inputs are √ uE = (Em0 + mE − JE mI ) K − θE ,
(3.8)
√ uI = (Im0 + mE − JI mI ) K − θI .
(3.9)
The variance of the inputs is αk = mE + Jk2 mI .
(3.10)
Equation 3.3 reflects the fact that the instantaneous input to each neuron uki (t) fluctuates across the population of neurons, and these fluctuations obey a gaussian statistics in the large K limit. The expressions for the mean and variance of the input to a cell can be derived in the large K limit by the following arguments. The population average inputs is uk (t) = [uik (t)] =
Nl 2 X X ij j [ Jkl ] [ σl (t) ] + u0k − θk .
(3.11)
l=1 j=1 ij
The population average [Jkl ] is equivalent to a quenched average over the √ random connectivity and is therefore equal to Jkl K/N, yielding equation 3.5. Note that on the right-hand side (r.h.s.) of equation 3.11 we have neglected the correlations between the random fluctuations in the activity of a cell and the particular realization of its output connectivity. This is justified since such correlations are weak in the large N limit. Similarly, the variance αk of the input is αk (t) = [(δuik (t))2 ] =
Nl 2 X X l,l0 =1 j,j0 =1
ij
j
[ ( δ(Jkl σl (t)) )2 ], ij
(3.12)
j
2 m /N, whereas where δX ≡ X − [X]. Observing that [(Jkl σl (t))2 ] = Jkl l ij
j
2 m2 K/N 2 , which is negligible, one obtains equation 3.6. [(Jkl σl (t))]2 = Jkl l
4 Population Rates in the Balanced State In a balanced state, the temporal fluctuations in the inputs are of the same order as the distance between the mean input relative to threshold (even when K is large). To show this, we need to probe the network temporal properties. Here we study the necessary consequences of the balanced state on the behavior of the population rates. A necessary condition for a balanced state is that both the excitatory and the inhibitory populations do not fire at their maximum rate, or are completely silent, when we take the limit
1330
C. van Vreeswijk and H. Sompolinsky
K → ∞. In other words we look for solutions with 0 < mk < 1 in the large K limit. To have equilibrium rates with mk 6= 0, 1 in the large K limit, both uE and uI have to be finite in this limit. This means that the r.h.s. of equations 3.8 through 3.9 vanish to leading order. This leads to the following equations: √ Em0 + mE − JE mI = O(1/ K).
(4.1)
√ Im0 + mE − JI mI = O(1/ K).
(4.2)
Thus, in the large K limit we obtain mE =
JI E − JE I m0 ≡ AE m0 . JE − JI
(4.3)
mI =
E−I m0 ≡ AI m0 . JE − JI
(4.4)
Since both AE and AI have to be positive, the coupling strengths have to satisfy JE E > >1 I JI
(4.5)
JE E < < 1. I JI
(4.6)
or
Besides this balanced solution, we should also examine the possibility √ of unbalanced solutions in which either √ mk = 0 and uk is of order K and negative, or mk = 1 and uk is of order K and positive. Equation 4.6 admits an unbalanced solution in which mE = 0. In this solution, mI is to leading order in k given by mI = Im0 /JI (since the leading order in uI should vanish) so that uE =
√ K(E − JE I/JI )m0 < 0.
(4.7)
Furthermore if JE < 1 and JI < 1, there exists a solution with mE = mI = 1 even for m0 = 0. In this solution, uk satisfies to leading order uk =
√ K(1 − Jk ) (k = E, I),
so uk is of order
√ K and positive.
(4.8)
Chaotic Balanced State in a Model of Cortical Circuits
1331
Thus if we require that there be no stationary solutions with mE = 0, 1 or mI = 0, 1 for small m0 , the following constraints have to be satisfied: JE E > > 1. I JI JE > 1.
(4.9) (4.10)
It is straightforward to show that these constraints eliminate all possible unbalanced states. Throughout the article, we will assume that equations 4.9 and 4.10 are satisfied and that m0 is small enough so that Ak m0 < 1. Equations 4.3 and 4.4 imply that the network activity rates grow linearly with the external rate, mk = Ak m0 , even though microscopic dynamics is highly nonlinear. This is because the network dynamically finds an operating point at which the net input in both populations is balanced. Thus, the linearity in the network rates reflects the linearity of the synaptic summation underlying our model. 4.1 The Net Input. Equations 4.3 and 4.4 determine the average rates of the populations, but they must be consistent also with the general equilibrium √ results of equation 3.7. According to equations 4.3 and 4.4, the leading O( K) contributions to uk cancel each other. Thus, the net value of u√ k is determined by subleading contributions, such as corrections of order 1/ K to mk . In fact, equations 3.7 should be viewed as equations that determine the net synaptic inputs uk given the mean activity rates mk , equations 4.3 and 4.4. It is useful to denote by h(m) the scaled input of m, defined as the solution of the equation m = H(−h).
(4.11)
Thus, equation 3.7 reduces to uk =
√ αk h(mk ).
(4.12)
The activity of neurons in cortex is usually much less than the saturation rate. It is therefore useful to consider the limit where m0 ¿ 1. In this regime, mk ¿ 1, and we can use the approximation H(x) ≈
exp(−x2 /2) √ 2π |x|
(4.13)
to obtain q h(m) ≈ − 2| log m|.
(4.14)
1332
C. van Vreeswijk and H. Sompolinsky
Substituting this result in equation 4.12 yields q uk ≈ − 2 αk | log mk | .
(4.15)
This relation between mk and uk will be needed to calculate the rate distribution (see section 5.1). √ 4.2 Finite K Corrections. For finite K the residuals of order 1/ K in the rates are not negligible, so that equations 4.3 and 4.4 no longer hold exactly. For finite K, the equilibrium activities satisfy mk = Fk (mE , mI ), with Fk given by equation A.5. However, as long as mk À K−1 ,
(4.16)
the gaussian assumption of the input statistics is a good approximation; hence, equations 3.7 still hold. Thus, the leading finite K corrections can be incorporated by resorting to the full mean-field equations: equations 3.3 through 3.10. In particular, the finite K equations for the fixed point are Em0 + mE − JE mI = (θE + Im0 + mE − JI mI = (θI +
√ √ αE h(mE ))/ K.
√ √ αI h(mI ))/ K.
(4.17) (4.18)
As long as m0 is not small, the r.h.s. of these equations are small for large K; hence the corrections to the linear solution, equations 4.3 √ and 4.4, are small. When m0 becomes sufficiently small (i.e., of order 1/ K or less), the strong nonlinearity in the single neuron dynamics reveals itself in a strong nonlinearity in the population response. In particular, the effect of the single neuron threshold θk becomes important. This is seen in Figure 3, where the population rates are evaluated by the finite K equations—equations 4.17 and 4.18—with K = 1000. For comparison, we also show the straight lines predicted by the large K limit. Since the steady-state rates in cortical networks are usually low, it is sometimes useful to incorporate the leading finite K corrections. Whenever we refer in subsequent figures to explicit values for K, we use equations 4.17 and 4.18, unless otherwise stated. Except for thresholding the population rates, the finite K corrections affect only the quantitative results, not the qualitative predictions of the simple large K theory. 5 Spatial and Temporal Variability So far we have been concerned only with the population average rates mk . However, the fact that the population averages are not saturated does not necessarily imply that the system’s state exhibits strong temporal variations.
Chaotic Balanced State in a Model of Cortical Circuits
1333
Network rates
0.6
0.4
0.2
0
0
0.1
0.2
External rate
0.3
Figure 3: The mean activity of the excitatory population (thick solid line) and the inhibitory population (thick dashed line) as a function of the input activity. For a network in which cells receive input from, on average, 1000 cells in each population. For comparison, the activities in the large K limit are also shown (thin solid line for the excitatory and thin dashed line for the inhibitory population, respectively). Parameter values: E = 1, I = 0.8, JE = 2, JI = 1.8, θE = 1, and θI = 0.7.
Specifically, a population average excitatory rate, mk , may be the outcome of a fluctuating state where all the cells in the kth population fire a fraction mk of the time. However, it can also be achieved by a frozen state in which a fraction mk of the cells fire every time these cells are updated, while all other cells never fire. In other words, the population average does not distinguish between temporal and spatial fluctuations of activity levels. Fortunately, the mean-field theory fully characterizes the statistics of both the spatial and the temporal fluctuations in the activities in the balanced state. The statistics can be expressed by writing the instantaneous activity of a cell as a threshold function of two random variables xi and yi (t), ³ ´ p p σki (t) = 2 −uk + βk xi + αk − βk yi (t) .
(5.1)
The means uk are given by equations 3.8 and 3.9. The parameter βk is given
1334
C. van Vreeswijk and H. Sompolinsky
by βk = qE + Jk2 qI .
(5.2)
The order parameter qk is defined as qk =
Nk 1 X (mi )2 , Nk i=1 k
(5.3)
where mik is the time-averaged activity rate of the ith cell, mik ≡ hσki (t)i.
(5.4)
The symbol h. . .i denotes average over long time. Both xi and yi (t) are independent gaussian variables with zero mean and unit variance. Quenched fluctuations of synaptic inputs. The term proportional to xi represents a quenched random component of the synaptic input received by different cells and thus represents a spatial inhomogeneity in the rates. The origin of this inhomogeneity is twofold. Since the connectivity is random in our model, cells may differ in the number of synaptic inputs they have. This component is given by δ1 huik i =
Nl 2 X X
ij
j
δJkl [ml ].
(5.5)
l=1 j=1 ij
ij
ij
Here, δJkl = Jkl − [Jkl ]. In addition, different neurons are connected to different cells so that even if all the cells would have received the same number of inputs, the system would evolve into a state with a self-consistently developed spatial inhomogeneity. The second component can be written as δ2 huik i =
Nl 2 X X
ij
j
Jkl δml ,
(5.6)
l=1 j=1
where δmil ≡ mil − ml . Adding the two contributions yields [ (δhuik (t)i )2 ] =
2 X
2 Jkl ql = qE + Jk2 qI = βk .
(5.7)
l=1
Thus, this variance represents the fluctuation in both the number and the identity of input cells to the different cells.
Chaotic Balanced State in a Model of Cortical Circuits
1335
Temporal fluctuations of synaptic inputs. The term in equation 5.1 that is proportional to yi (t) represents the stochastic component of the inputs to a cell—a temporally fluctuating component with a short-time correlation. This can be written as uik (t) − huik i =
Nl 2 X X
ij
j
j
Jkl (σk (t) − mk ),
(5.8)
l=1 j=1
from which one obtains [(uik (t) − huik i2 ] =
2 X
2 Jkl (ml − ql ) = mE − qE + Jk2 (mI − qI )
l=1
= αk − βk
(5.9)
Note that the variance of the temporal fluctuations in the inputs depends on mk − qk , which measures the temporal variability of the state. 5.1 Distribution of Time-Averaged Rates. The distribution of rates in the kth population is defined as ρk (m) ≡ Nk−1
Nk X
δ(m − mki ).
(5.10)
i=1
The statistics of the time-averaged local rates can be derived by averaging equation 5.1 over yi (t) (which is equivalent to average over time), √ ¶ −uk + βk xi = mk (xi ) = H . √ αk − βk µ
mik
(5.11)
Thus, the distribution of mik is fully determined by its first two moments. Averaging this equation over xi yields equation 3.7. Similarly, squaring equation 5.11 and averaging over xi yields Z qk =
√ · µ ¶¸2 −uk + βk x Dx H . √ αk − βk
(5.12)
√ Here we have used the gaussian measure, Dx ≡ dx exp(−x2 /2)/ 2π . In general, qk satisfies (mk )2 ≤ qk ≤ mk . The smaller qk is, the more homogeneous is the rate distribution. In a frozen state in which a fraction mk of the cells are active every time they are updated, while all other cells are always quiescent, qk is given by qk = mk . On the other hand, if all cells in the population have a probability mk of being active each time they are updated,
1336
C. van Vreeswijk and H. Sompolinsky
mki = mk , qk = (mk )2 . Equations 5.12 have two solutions: an unstable solution with qk = mk , corresponding to a frozen state, and a stable solution, (mk )2 < qk < mk , which corresponds to a temporally fluctuating state. Although the frozen solution is unstable, its existence highlights the fact that the temporal variability in our system is purely of deterministic origin and is not induced by external stochastic sources. Generalizing equation 5.12, we can write, Z ρk (m) =
Dx δ(m − mk (x)).
(5.13)
In section A.1 we analyze the properties of this distribution. A numerical evaluation of ρk (m) is shown in Figure 4, which displays the rate distribution of the excitatory activity for different values of mE . The distribution is plotted against m/mE . The synaptic couplings were kept constant, while the mean rates were varied by adjusting the external rate m0 . For high mean activity levels, the distribution has a pronounced skewed shape. Note, however, that according to equation 5.11 the distribution of the time-averaged inputs uki to the cells is gaussian for all values of mk . In the low rate limit, m0 ¿ 1, equation 5.12 can be solved using equations 4.13 through 4.15, yielding to leading order, qk = m2k + O(m3k | log mk |).
(5.14)
Thus, if the network evolves to a state with low average activity levels, mk ¿ 1, qk is slightly larger than m2k . The fact that qk ¿ mk implies that the balanced state is characterized by strong temporal fluctuations in the activity of the individual cells. On the other hand, the fact that qk is not exactly equal to m2k reflects the spatial inhomogeneity in the time-averaged rates within a population. Equation 5.14 implies that when the mean activity mk decreases, the width of the distribution is proportional to (mk )3/2 ; it decreases faster than the mean mk . Thus, for low mean activity, ρk (m) becomes narrowly peaked at m = mk , as shown in Figure 4. The reason for the narrow peak is that in our model, the fluctuations in the input are related to the fluctuations in the feedback from the network, hence their variance becomes small as the activity in the network decreases (see equation 5.2.) 5.2 Time-Delayed Autocorrelations. In order to complete the statistical characterization of the balanced state, we have to determine the statistics of the temporal fluctuations in the activities of single cells or, equivalently, the temporal fluctuations in their input. We have already stated that the temporal fluctuations in uki (t) obey gaussian statistics, with variance given by αk − βk . To characterize its statistics fully, we have to evaluate its autocorrelations. Using arguments similar to those already outlined, it is straightforward to show that the autocorrelation of the input is linearly related to
Chaotic Balanced State in a Model of Cortical Circuits
1337
Density
1.5
1
0.5
0
0
1
2
Rate/mean rate
3
4
Figure 4: Distribution of the activity rates of the excitatory population for two different values of the average rate in the large K limit: mE = 0.01 (solid line) and mE = 0.1 (dashed line). The distributions are shown as a function of the local rate divided by the mean rate. Parameter values as in Figure 3.
the autocorrelations in the local activities, βk (τ ) = [hδuki (t) δuki (t + τ )i] = qE (τ ) + Jk2 qI (τ ),
(5.15)
where qk (τ ) is the time-delayed autocorrelations of the local activities, qk (τ ) = Nk−1
Nk X
hσik (t)σik (t + τ )i,
(5.16)
i=1
and as before h. . .i denotes average over t. Note that qk (0) = mk , whereas qk (τ → ∞) = qk . Likewise, βk (0) = αk , whereas βk (τ → ∞) = βk . Using this relation, the following self-consistent equation for qk (τ ) (with τ ≥ 0) is obtained: Z ∞ dqk (τ ) dt = −qk (τ ) + τk dτ 0 τk √ ¶¸2 · µ Z −uk − βk (t + τ ) x × exp(−t/τk ) Dx H . (5.17) √ αk − βk (t + τ )
1338
C. van Vreeswijk and H. Sompolinsky
Autocorrelation
0.2
0.15
0.1
0
2
4
t
6
8
Figure 5: Population-averaged autocorrelation for the excitatory population in the large K limit (solid line). The dashed line shows the autocorrelation for a population of cells with the same rate distribution but Poissonian updating. Parameter values: τ = 0.9, m0 = 0.1, and other parameters as in Figure 3.
Note that the integral over t in equation 5.17 results from averaging over the distribution of update time intervals. The solution of this integral equation yields a function qk (t), which decays to its equilibrium value with a time constant of the order of τk . A numerical solution of equation 5.17 for qk (τ ) is shown in Figure 5. As can be seen, the autocorrelations are larger than those predicted by Poisson statistics. This enhancement of short-time correlations reflect the refractoriness in the activities of the cells that project the cell. 5.3 Numerical Realization of Synaptic Inputs to a Cell. In order to demonstrate the nature of the fluctuating synaptic inputs to a single excitatory cell in the balanced state, we have numerically generated samples of stochastic gaussian processes, which simulate the fluctuations of the synaptic inputs to a single excitatory cell. In order to show explicitly the effect of balancing, we have simulated separately the total excitatory and inhibitory components of uiE (t). The time average of the total excitatory (inhibitory) √ itself sampled from a gaussian distribution with a √ component is mean K(mE +Em0 ) ( KJE mI ) and a variance qE (JE qI ). The time-dependent fluctuations of the total excitatory (inhibitory) input have a time-delayed au-
Chaotic Balanced State in a Model of Cortical Circuits
1339
8
Input
4 0
Spikes
-4 -8 0
20
40
60
80
100
t Figure 6: Temporal structure of the input to an excitatory cell. The upper panel shows the total excitatory input, consisting of the external input and the excitatory feedback (upper trace), the total inhibitory input (lower trace), as well as the net input (middle trace). They are calculated by sampling from the timecorrelated gaussian statistics predicted by the theory. The times when the cell switched to the active states are indicated. Parameter values: m0 = 0.04 and other parameters as in Figure 5. K = 1000 was used to calculate the average input.
tocorrelation equal to qE (τ ) − qE (JE2 (qI (τ ) − qI )). The results are shown in Figure 6, where we have used K = 1000. They demonstrate that the total excitatory and inhibitory inputs are large compared to the threshold and have fluctuations that are small compared to their mean. Because the network is in the balanced state, the net input is of the same order as the threshold, and the fluctuations bring the input above threshold at irregular intervals. In the lower part of the figure, we show the output state of the cell. This is evaluated by generating the sequence of update times and thresholding the net input at these times. Because of the update rule, the cell does not switch from passive to active every time the net input crosses the threshold. In Figure 7 we present the ISI histogram of the cell. Because the interval between spikes is a convolution of two random events—first, a transition from 1 to 0, and then a transition from 0 to 1—the ISI vanishes at small intervals. Thus, the above definition of a spike to some extent captures the refractoriness of real spikes. In fact, if we ignore the short time correlations in the activities, the ISI of the ith (say, excitatory) cell with an average rate
1340
C. van Vreeswijk and H. Sompolinsky
0.06
P(t)
0.04
0.02
0
0
20
40
t
60
80
Figure 7: Interspike interval distribution for the cell shown in Figure 6 (solid line). The distribution was determined by measuring the time between consecutive switches from the inactive to the active state until 5 × 105 intervals had been accumulated. The dashed line shows the interspike interval distribution for Poissonian updating with mi = 0.06.
mi can be shown to be simply
Ii (t) =
mi (1 − mi ) (exp(−mi t/τE ) − exp(−(1 − mi )t/τE )), t ≥ 0. (5.18) τE (1 − 2mi )
This function rises linearly from zero and peaks at t ∝ τE . For intervals of the order of τE /mi or longer, I(t) decays purely exponentially with a decay constant mi /τE as in the ISI of a single Poisson process with a rate mi /τE . Comparison with Figure 7 shows that this is indeed a very good approximation of the ISI of our model. Finally, it should be noted that because of the sparsity of the connectivity, different cells receive input from different subpopulations, so that the fluctuations in their input will be only very weakly correlated. As a result, the correlations in their activity will be very small.
Chaotic Balanced State in a Model of Cortical Circuits
1341
6 Stability of the Balanced State To determine the stability of the balanced state, we have to study the response of the system to small perturbations in the population activity rates. However, because of the nature of the balanced state, we have to distinguish two scales of perturbations: local perturbations, in which the deviations in √ global perturbations, in which the rates are small compared to 1/ K, and √ these deviations are large compared to 1/ K. 6.1 Local Stability. Local stability of the balanced state requires that a sufficiently small perturbation in the populations rates will decay to zero. In our case, a sufficiently small perturbation means that it initially causes only a small disruption of the balanced state. This means that the√perturbations are small not only compared to mk but also compared to 1/ K, so that the perturbations of the inputs to the cells are initially small. We therefore consider a solution of equations 3.3 with an initial √ condition mk (0) = mk +δmk (0) with a small δmk (0) where |δmk (0)| ¿ 1/ K. In this case, the perturbation of the total mean input uk is also small; hence we can linearize the dynamic equations 3.3 around their fixed point. Thus, δmk (t) = mk (t) − mk satisfy a linear equation of the form τk
√ X d δmk (t) = −δmk (t) + K fkl δml (t). dt l=1,2
(6.1)
Calculating fkl by partial differentiation of the r.h.s. of equation 3.3 yields fkl =
exp(−u2k /2αk )Jkl . √ 2παk
(6.2)
Solving equations 6.1 one obtains δmk (t) = δmk,1 exp(λ1 t) + δmk,2 exp(λ2 t) where the eigenvalues √ λ1 and λ2 of the 2 by 2 equations (see equations 6.1) are both of order K. Requiring that their real part be negative yields a condition on τ of the form τ < τL ,
(6.3)
where τL is of order 1; its precise value √ depends on the system parameters. will Since both λ1 and λ2 are of order K, if τ < τL , small perturbations √ K. This is due to decay with an extremely short time constant of order 1/ √ the strong negative feedback, of order K, generated by the strong synaptic couplings. 6.2 Global Stability. The local stability condition in equation 6.3 guar√ antees that a perturbation smaller than O(1/ K) will die out. It is therefore
1342
C. van Vreeswijk and H. Sompolinsky
important to ask whether the balanced state is stable also to perturbations that are large compared to this order. However, such √perturbations will generate a large disruption in the inputs uk , of order K; hence, linearization of the dynamic equations is inadequate. We therefore have to consider the nonlinear evolution of perturbations in the rates under equations 3.3. In fact, since the perturbation destroys the balance between excitation and in√ hibition, H(−uk / αk ) of equation 3.3 can be approximated by 2(uk ); hence the evolution of the perturbations is described by τk
d δmk (t) = −δmk (t) + 2(δmE − Jk δmI ) − mk . dt
(6.4)
These equations are piecewise linear and therefore can be solved explicitly. One finds that the solution of these equations decays to zero provided that the inhibitory time constant satisfies τ < τG ,
(6.5)
where (s τG = JE min
JI mI (1 − mI ) mI 1 − mI , , JE mE (1 − mE ) mE 1 − mE
) .
(6.6)
In conclusion, the global stability condition guarantees that starting from arbitrary initial values mk (0), the population rates eventually will approach the balanced √ regime characterized by local fields uk , which are of order 1 and not K. In other words, the rates √ will deviate from the values of the balanced fixed point by at most O(1/ K) quantities. Whether they will actually approach this fixed point or will converge to a limit cycle around it depends on the local stability condition—equation 6.3. Depending on the system parameters τG may be greater or smaller than τL . Figure 8 shows the evolution of mE and mI in a network with K = 1000, for τ = 1.3, when the network starts far away from the balanced state. The initial evolution is similar to the global dynamics. It converges to the neighborhood of the balanced fixed point in an oscillatory manner characteristic of the dynamics of equation 6.4. In Figure 8B, we show the late portion of the dynamics, which corresponds to the local dynamics (see equations 6.1). For the parameters used in this figure, the large K critical inhibitory time constants are τL = 1.61 and τG = 1.50. To illustrate the region of stability of the balanced state, we have calculated the phase diagram of the network in terms of two parameters: (1) the inhibitory time constant τ and (2) the ratio between the external input into the inhibitory population and the external input into the excitatory population. We have chosen to scale m0 so that the excitatory population rate is held fixed. The results are shown in Figure 9, where both the local and global
Chaotic Balanced State in a Model of Cortical Circuits
A
0.084
1343
B
0.12 0.0835
mI
mI
0.1
0.083
0.08
0.06 0.1
0.0825
0.125
0.15
mE
0.175
0.2
0.082 0.127
0.1275
0.128
mE
0.1285
0.129
Figure 8: Evolution to the stable fixed point. The average inhibitory rate is plotted against the average excitatory rate. (A) The evolution of the rates when the rates are initialized far from their steady-state values. (B) A close-up view of the approach to the fixed point. Parameters: τ = 1.3. The other parameter values as in Figure 3.
stability lines are presented. For these parameters, τL is always smaller than τG . 6.3 Regimes of Instability. Stability of the balanced state requires that τ be smaller than both τL and τG . It is of interest to consider what happens if this condition is not fulfilled. 1. Unbalanced limit cycle. τ > max{τL , τG }. In this case equations 3.3 possess a stable unbalanced limit cycle—a stable oscillatory solution with √ uk (t) of order K. This is shown in Figure 10A. that are of 2. Balanced limit cycle. τL < τ < τG . In this case, perturbations √ order 1 will decrease until they√are of order 1/ K, while perturbations of order that √ are small compared to 1/ K will increase until they are √ 1/ K. Since there are no fixed points with δmk of order 1/ K, this means that √ there has to be a stable limit cycle with an amplitude of order 1/ K. Thus, in this regime, the system converges to a limit cycle that maintains the approximate balance between excitation and inhibition. This is described schematically in Figure 10B. 3. Balanced fixed point with shrinking basin: τG < τ < τL . In this case, perturbations of order √ 1 go to a global limit cycle, while perturbations much smaller than 1/ K evolve to the fixed point. √ There must be an unstable limit cycle with amplitude of order 1/ K that separates perturbations that go to the global limit cycle and perturbations that go to the fixed point.
1344
C. van Vreeswijk and H. Sompolinsky
5 4 3
τ
Unstable 2 1 0
Stable 0
0.2
0.4
I/E
0.6
0.8
Figure 9: Critical time constants τG (solid line) and τL (dashed line) as a function of I/E. The external rate was adjusted to keep the excitatory activity level constant at mE = 0.1. I/E was varied from 0 to (JI − mE )/(JE − mE ). For this range of I/E, mI varies from mE /JI to 1. Parameters: E = 1, JE = 2.0, and JI = 1.8.
7 Inhomogeneous Thresholds So far we have considered networks of identical neurons, except for their connectivity. Real neuronal systems exhibit a substantial inhomogeneity in single neuron properties. It is therefore important to consider how such inhomogeneities affect the behavior of our system. We will model the inhomogeneity by a variability in the thresholds of the neurons. Inhomogeneities in the local thresholds may have a particularly strong effect in a balanced state with low mean activity. The reason is that the intrinsic fluctuations are all generated by feedback from the network activity. Hence, they decrease in amplitude as the mean activity in the network drops. In particular, under these conditions, the intrinsic temporal fluctuations may not be of sufficiently large amplitude to overcome the quenched dispersion of local thresholds. Therefore, the important issue we address here is whether the balanced state remains temporally fluctuating in the limit of low mean activity in the presence of inhomogeneous thresholds, or whether it becomes a frozen state. We will show that the answer to these questions depends not only on the width of the threshold distribution but also on the form of its
Chaotic Balanced State in a Model of Cortical Circuits
0.14
1345
B
A 0.8
0.7
0.1
mI
mI
0.6
0.06 0.5
0.02 0.02
0.06
0.1
mE 0.14
0.18
0.22
0.4
0.2
0.4
0.6
mE
0.8
1
Figure 10: Different scenarios when the fixed point is unstable. (A) A case where τ is larger than both τG and τL . Here the rates evolve to the global limit cycle where the amplitude of the oscillations is of order 1. The solid line shows the evolution when the network is initiated outside the limit cycle; the dashed line corresponds to the trajectory for initial rates inside the limit cycle. Parameters as in Figure 8, except τ = 1.8. (B) Shows schematically a case where τL < τ < τG . The network evolves to a limit cycle with the amplitude of order K−1/2 . The figure shows the evolution of the rates with initial conditions far from the fixed point. The insert shows an expanded view of the area around the fixed point, with the trajectory of rates starting outside the limit cycle (solid line) and the trajectory of a network that was initiated close to the fixed point.
tail. We denote the local threshold of a neuron by θik + θk , where θk is the population-averaged threshold and θik is a quenched random variable with zero mean. We will call θik the local threshold. The mean activity rate of neurons in the kth population that have a local threshold θ is µ ¶ θ − uk , (7.1) mk (θ) = H √ αk and hence the population-averaged rate is µ ¶ Z θ − uk dθP(θ) H √ , mk = αk
(7.2)
where P(θ) denotes the quenched distribution of θ , and uk and αk are given as before by equations 3.8, 3.9, and 3.10. Note that we have absorbed the mean threshold, θk in the definition of uk (see equations 3.8 and 3.9). 7.1 Distribution of Thresholds with Long Tails. We first consider a distribution with a long tail of low thresholds. A concrete example is µ ¶ 1 1 2 (7.3) P(θ) = √ exp − (θ/1) . 2 1 2π
1346
C. van Vreeswijk and H. Sompolinsky
In this case, the spatial fluctuations in the inputs (relative to thresholds) consist of two gaussian terms. One is induced by the random connectivity and has a variance αk , and the other is induced by the thresholds and has a variance 1. The balance conditions that determine the population rates (equations 4.3 and 4.4) still hold. In addition, Ã ! −uk , (7.4) mk = H p αk + 12 which determines uk , and Z qk =
√ ¶¸2 · µ −uk − 1 + βk x . Dx H √ αk − βk
(7.5)
Now let us consider the limit of low mean rates, which is achieved by assuming that m0 is small. For fixed 1, if the mean rates become sufficiently low so that mk ¿ 1, the intrinsic variances αk and βk can be neglected compared with 1; hence one obtains ¶ µ −uk . (7.6) mk ≈ qk ≈ H 1 The fact that qk ≈ mk implies that the state is essentially frozen, namely, mk (θ) ≈ 2(uk − θ),
(7.7)
and, consequently, the distribution of mean rates has a distinct bimodal shape, ρk (m) ≈ (1 − mk )δ(m) + mk δ(m − 1) , mk ¿ 1,
(7.8)
as shown in Figure 11A. Thus, an unbounded threshold distribution has a relatively strong qualitative effect on the balance state, in the limit of low mean rate. 7.2 Bounded Distribution. We next consider the case of a bounded distribution of thresholds. As an example, we take a distribution of θ that is uniform between −1/2 and +1/2, and zero otherwise. In this case, equation 7.2 yields mk =
1 1
Z
µ
1/2
−1/2
dθ H
−uk + θ √ αk
¶ .
(7.9)
To assess the effect of 1, we analyze equation 7.9 in the low mk limit. In this case, the solution for uk is √ (7.10) uk + 1/2 = O( mk ).
Chaotic Balanced State in a Model of Cortical Circuits
A
4
1347
B 2.5
Density
2
Density
Density
1 3
2 1.5 0.99
0.995
1
Rate
0.5
1
0
0
0.5
1
Rate/mean Rate
0
2
0
1
2
Rate/mean rate
3
4
C
300
Number of cells
1.5
200
100
0
0
20
40
60
80
Rate (Hz)
Figure 11: Distribution of the activities of the cells in the excitatory population in the large K limit. (A) Distribution for a network of neurons with a gaussian distribution of thresholds. The distribution is shown for population-averaged rates mE = 0.01 (solid line) and mE = 0.1 (dashed line). The insert shows the divergence at m = 1 of the distribution for mE = 0.01 with the density in arbitrary units. Parameter values 1 = 0.2 and other values as in Figure 3. (B) Distribution of activity levels of the cells in the excitatory population in the large K limit for a network of neurons with a bounded distribution of thresholds. The distribution is shown for mean rates mE = 0.01 (solid line) and mE = 0.1 (dashed line). Parameter values as in A. (C) Firing rate distribution for neurons in the right prefrontal cortex of a monkey attending to a complex stimulus (light source and sound) and executing a reaching movement. The rates were averaged over the duration of events that showed a significant response. The average rate was 15.8 Hz.
Thus, the population rates adjust themselves so that synaptic input is slightly below the smallest threshold in the population, θk − D/2; see equation 3.8. The small gap between the mean synaptic input and the minimal threshold is such that the temporal fluctuations of the network, with the low variance αk , are sufficient to bring the neurons to threshold levels. Indeed, analyzing the rate distribution for this case, we find that it is unimodal with width
1348
C. van Vreeswijk and H. Sompolinsky
√ qk , where 3/2
qk ∝ 1 αk .
(7.11)
This means that the rate distribution is extremely broad and skewed. The full shape of the rate distribution is given by p αk /212 , m− < m < m+ ρk (m) ≈ p m | log m|
(7.12)
and zero otherwise. The bounds of m are: m− ∝ exp(−12 /(2αk )).
(7.13)
q m+ ∝ 1 αk /| log(αk )| À mk .
(7.14)
The results (see equations 7.9–7.14) show that in the case of a bounded threshold distribution, the temporal variability remains strong even in the limit of low mean rates. However, the inhomogeneity strongly affects the shape of the rate distribution, making it more skewed and broader. Figure 11B shows the results of numerical calculation of the rate distribution for the excitatory population, with a uniform distribution of thresholds between −1/2 and 1/2, for different values of mean rates. Comparing Figures 4, 11A, and 11B, we see that for moderate mean rates mk = 0.1, 1 does not have a big effect on the shape of the distribution. However, when the network mean activity is lowered, the distribution peak shifts to values that are much smaller than the mean, while its tail extends to rates of the order √ of mE . In contrast, in the case of a homogeneous threshold, lowering the mean rates shifts the peak toward the mean and decreases the width of ρ(m) (see Figure 4). In the case of a gaussian distribution, lowering the mean rates creates a pronounced bimodal distribution, characteristic of a frozen state, as seen in Figure 11A. In general, for small mk , a threshold distribution P(θ) will yield a rate distribution ρk for population k that is given by ρk (m) =
³ √ ´ 2 √ 2π P − αk (h(m) + h˜ k ) eh (m)/2 ,
where h˜ k ≡ h(mk ) is determined by Z dm mρk (m) = mk .
(7.15)
(7.16)
If P(θ) has tails that fall off as slow as or slower than a gaussian, ρk will diverge for m = 0 and m = 1; if P(θ ) falls off faster than a gaussian, ρk will
Chaotic Balanced State in a Model of Cortical Circuits
1349
be negligible for m < m− and for m > m+ with, for small mk , m− ¿ mk and mk ¿ m+ ¿ 1. In this case, ρk can be approximated by ³ √ ³p ´´ P − αk 2 log(m) − h˜ k p (7.17) ρk (m) ∝ m log(m) √ p for m− < m < m+ . Furthermore P(− αk ( 2 log(m)− h˜ k )) varies only slowly with m for these rates. Thus for a threshold distribution with a tail that falls off faster than a gaussian, the distribution of the rates goes to 0 for m = 0 and m = 1 and has a long power-law tail that extends up to a rate m+ that is much larger than the average rate. In contrast, if the tails of the distribution fall off as slow as or slower than a gaussian, the rate distribution will peak at m = 0 and m = 1 if the average rate is sufficiently low. 7.3 Experimental Rate Distribution. The above results make a clear prediction about the shape of the rate distribution in a local population of neurons with low mean rates. It seems reasonable to compare these predictions with the distribution of rates in cortical neuronal pools of behaving animals. Figure 11C presents an experimentally determined rate histogram of neurons in the right prefrontal cortex of a monkey (Abeles, Bergman, & Vaadia, 1988). The data was taken from time intervals when the monkey was attending to a variety of stimuli (light sources and sound) or executing simple reaching movements. The average rate (of the neurons that showed any activity during the time of measurement) was 15.8 Hz. The observed histogram has a distinct unimodal skewed shape with a tail extending up to 80 Hz. These results are consistent with the theoretical predictions of Figure 11B. 8 Chaotic Nature of the Balanced State The strong temporal fluctuations of the neuronal activity in our model and the resultant fast decay of temporal correlations strongly suggest that the balanced state corresponds to a chaotic attractor. However, to justify characterizing this state as chaotic, we need to study the sensitivity of the dynamic trajectory to small perturbations in the initial conditions. If the network evolves to a chaotic attractor, small perturbations in the state of the network should grow at least exponentially. After some time, the state of the network is far from the state the network would have been in had it not been perturbed. This definition of chaos is technically inapplicable to a system with discrete degrees of freedom such as ours, since in this case the size of a perturbation of the system state is bounded by the discreteness of the system’s state. In our case, the minimum perturbation is changing the state of a single neuron. Nevertheless, in the limit of large network size, we can consider such a perturbation as infinitesimal, as described below.
1350
C. van Vreeswijk and H. Sompolinsky
We consider two copies of the network. In one copy, the states of the neui i (t); in the other, they are given by σ2,k (t). Both networks rons are given by σ1,k ij
have the same connection matrices Jkl and the same update schedule. The networks get the same constant input m0 (t) = m0 and are assumed to have reached a balanced state with the same population rates, Nk 1 X hσ i (t)i = mk Nk i=1 p,k
for p = 1, 2.
(8.1)
The distance between the network states at time t is defined as Dk (t) = =
Nk ¿³ ´2 À 1 X i i σ1,k (t) − σ2,k (t) Nk i=1 Nk nD E D E D Eo 1 X i i i i (t) + σ2,k (t) − 2 σ1,k (t)σ2,k (t) . σ1,k Nk i=1
(8.2)
Here the angular brackets do not mean average over time but average over all initial conditions of the two networks subject to the constraints that each individual network is at equilibrium (e.g., its mk and qk have the equilibrium values), and that the distance between the initial states of the two networks equals a given Dk (0). If the network is in a chaotic state, the distance Dk (t) of the cells in population k, defined by equation 8.2, should grow at least exponentially for small Dk , The maximum Lyapunov exponent λL , defined by λL ≡ lim D−1 k Dk →0
dDk , dt
(8.3)
should be positive. Note that in calculating λL , we will first take the large N limit of Dk and then Dk → 0 limit. To write the dynamics of Dk , it is useful to write Dk (t) as Dk (t) = 2(mk − Qk (t)),
(8.4)
where Qk (t) denotes the overlap of the two trajectories. In appendix A, we show that Qk (t) satisfies an equation similar to that of qk (τ ), τk
dQk = −Qk dt √ ¶ ¸2 · µ Z −uk + γk (t)x + Dx H , √ αk − γk (t)
(8.5)
Chaotic Balanced State in a Model of Cortical Circuits
1351
with uk and αk are as above, and γk (t) given by γk (t) =
2 X
(Jkl )2 Ql (t) = QE (t) + Jk2 QI (t).
(8.6)
l=1
This equation has two stationary solutions. One is Qk = mk ,
(8.7)
which corresponds to a fully locked trajectories. This solution is unstable, as will be shown below. The stable fixed point is Qk = qk ,
(8.8)
which corresponds to a fully desynchronized trajectory so that at long times, the correlations between the two trajectories at the same time are those induced by the time-independent average activities. Starting from any nonidentical states, the two trajectories eventually will desynchronize them completely. To find the initial rate of divergence, we expand equation 8.5 for small Dk and find that to leading order, the distances satisfy 2 e−uk /2αk √ dDk = αk − γk . √ dt π αk 2
τk
(8.9)
Since αk − γk ∝ Dk , equation 8.9 has a growing solution even if Dk (0) = 0. This implies that the Lyapunov exponent λL is infinitely large in the balanced state. Figure 12 shows the evolution of DE . DE increases rapidly to the equilibrium value DE = 2(mE − qE ), for arbitrarily small initial positive value. This should be contrasted with systems with finite positive Lyapunov exponents, where the initial rate of growth depends on the magnitude of the initial perturbation of the initial conditions. The divergence of λL in our system is related to the discreteness of the degrees of freedom, which implies an infinitely high microscopic gain: a small change in the inputs to a cell can cause a finite change in its state. 9 Tracking of Time-Dependent Input We have shown that for √ a large range of parameters, a network with synaptic strengths of order 1/ K will evolve to a balanced state, and we investigated some of the characteristics of this state. But so far we have not addressed the question of potential functional advantages of this state. Why should a network generate an excitatory input that is much larger than the threshold input and then counterbalance this with a nearly equally large inhibitory input? If we consider the metabolic costs of such large currents, it seems
1352
C. van Vreeswijk and H. Sompolinsky
0.1
Distance
0.08 0.06 0.04 0.02 0
0
2
4
t
6
8
10
Figure 12: Evolution of the distance DE starting from a small initial distance in the large K limit. Parameters as in Figure 3 and τ = 0.9.
clear that a biological system would not choose such a mechanism unless it has some advantages over other mechanisms. In this section we present one possible advantage of the balanced network. We have already shown √ which √ that perturbations in the network rates, are small compared to 1/ K, die out in a time on the order of 1/ K. Therefore, the network is very stable against small fluctuations in the rates. We now consider the consequences of this for the response of the system to time-dependent change in the external driving force m0 . If the external activity m0 changes suddenly by a small amount, on the √ order of 1/ K, the equilibrium rates will change by an amount that is of the same order. So just after the change in external rate, the network rates differ slightly from the equilibrium rate. They√ will approach the new equilibrium rate on a time scale that is of the order 1/ K, so the network rates adapt very fast to a sudden change in m0 . This means that if m0 changes continuously with time, the network rates will track m0 very fast, provided that m0 does not change too rapidly. To quantify the speed of the tracking of a balanced network, we compare the network rates with the rates of a hypothetical network that tracks changes in the external rates instantaneously. In such ∞ a network, the rates m∞ k satisfy mk (t) = mk (m0 (t)), where mk (m0 ) is the
Chaotic Balanced State in a Model of Cortical Circuits
1353
equilibrium rate for m0 (t) = m0 , which are given by ! u∞ k (t) , = H −p ∞ αk (t) Ã
m∞ k (t)
(9.1)
with u∞ k (t) =
2 X √ K(Jk0 m0 (t) + Jkl m∞ l (t)) − θk
(9.2)
l=1
and αk∞ (t) =
X
(Jkl )2 m∞ l (t).
(9.3)
l=1,2
Note that to leading order in K m∞ k satisfies the balance condition m∞ k (t) = Ak m0 (t)
(9.4)
√ However, equations 9.1 through 9.3 take into account also the 1/ K corrections in m∞ k (t). We now assume that √ 1 mk (t) = m∞ k (t) + mk (t)/ K,
(9.5)
namely, that √the deviation from perfect tracking of the instantaneous is only of order 1/ K. The rates mk satisfy equation 3.3. To leading order in K this is ! Ã P 1 u∞ dm∞ l Jkl ml (t) k (t) k (t) + ∞ p ∞ , k = 1, 2. (9.6) = −mk (t) + H − τk dt αk (t) Using equation 9.4 we obtain, ! Ã P ¶ µ 1 u∞ dm0 (t) l Jkl ml (t) k (t) + p ∞ , k = 1, 2, (9.7) + m0 (t) = H − A k τk dt αk (t) √ which determines the small deviations m1k (t)/ K as functions of the timedependent drive m0 (t). Since H(x) is between 0 and 1, equations 9.7 have a solution only for 0 < m0 + τk
dm0 < 1/Ak . dt
1354
C. van Vreeswijk and H. Sompolinsky
This implies that the almost perfect tracking occurs for rates of change of the external input, which obeys the following bounds: 1 m0 dm0 < min < max − τk dt k=1,2 k=1,2 τk
µ
¶ 1 − m0 . Ak
(9.8)
To understand these results qualitatively, let us consider a system in the balanced state with a fixed m0 , where at time t = t0 , m0 is suddenly changed √ to m0 + δm0 . We assume that δm0 is much smaller than m0 but δm0 K is of order 1. This is shown in Figure √ 13, where m0 is increased by a series of small steps. Because the input is Km0 (t), the small change in m0 initially causes a change of order 1 in the total input. Hence, the probability Pk that the cells in the kth population, which are updated at time t0 , will go to the active state is initially increased by a large amount. This is denoted as 1Pk ; 1Pk is of order 1, as shown by the dashed curve in the figure. In fact, this probability is given by the r.h.s. of equation 9.4, which differs substantially from the previous equilibrium probability, Ak m0 . This initial increase in the number of active cells causes a large inhibitory feedback, which causes Pk to decrease quickly to its new equilibrium value, which is only slightly increased from its original equilibrium value, as seen in Figure 13. Thus, the initial response is highly nonlinear due to the initial disruption of the balance in the inputs and the highly nonlinear dynamics of single cells. This initial large response causes a fast rate of increase in the population rates, since δmk ≈ τk−1 dt1Pk , implying that δmk reaches the value Ak δm0 in time √ of order τk /δm0 ≈ τk / K; see the dotted line in Figure 13. The final change in the population rates linearly follows the change in the external input as required to maintain the balance between excitation and inhibition. The limitation on the change in the external rate is readily explained by the maximum increase (decrease) in the network rate that the microscopic dynamics allows. The fastest the network rates can increase (decrease) is by putting all newly updated cells in the active (passive) state, that is, Pk = 1 (Pk = 0), so that the change in the network rates is bounded by −mk < τk
dmk < 1 − mk . dt
(9.9)
If the external rates increase (decrease) faster than the bound, equations 9.8 the network√will not stay in the balanced state during the rate change, so uk is of order K. Consequently, the input is above (below) the threshold for all cells of the kth population that are updated, and all updated cells are in the active (passive) state. To compare the tracking capabilities of balanced networks with those of an unbalanced network, we consider a network of threshold linear neurons with synapses of strength Jkl /K for internetwork connections and J˜k0 /K for the strengths of the synapses projecting from the external population and
Chaotic Balanced State in a Model of Cortical Circuits
1355
0.085
0.08
mE 0.075
0.07
0
2
4
t
6
8
Figure 13: Reaction of the excitatory population to input that is increased by small steps. The solid line shows the activity of a network that responds instantaneously. The dashed line shows the probability PE of updating to the active state for neurons that happen to update. The dotted line represents the populationaveraged activity of the population in a balanced network. Parameter values: K = 1000, τ = 0.9. Other parameters as in Figure 3.
the thresholds Tk chosen so that the equilibrium rates of this network are the same as those for the balanced network. We choose the same neuronal time constants as in the balanced network. In this network, the rates satisfy τk
dmk = −mk + ( J˜k0 m0 + JkE mE + JkI mI − Tk )+ , dt
(9.10)
1 with (x)+ = (x + |x|)/2. If we set mk (t) = m∞ k (m0 (t)) + mk (t), the difference between the network rates and the rates of a perfectly tracking network, m1k (t), satisfies
dm1E dm0 = (JEE − 1)m1E + JEI m1I − τE AE dt dt 1 dm dm0 . τI I = JIE m1E + (JII − 1)m1I − τI AI dt dt
τE
In other words, m1k will be of order 1.
(9.11) (9.12)
1356
C. van Vreeswijk and H. Sompolinsky
Excitatory activity
0.26
0.16
0.06
0
1
2
t
3
4
5
Figure 14: Population-averaged activity of the excitatory cells for an input that varies with time. The input is constant from time t = 0 to t = 1. Between t = 1 and t = 2, the input increases linearly. After t = 2, the input is again constant. The solid line shows the excitatory rate of a network that responds infinitely quickly; the dashed line shows the response of the balanced network. Also shown is the response of an unbalanced network of threshold linear neurons (dotted line). Parameters for the balanced network as in Figure 13. For the unbalanced network, see the text.
Thus, in the unbalanced network, the difference between the√network rates and the rates in a perfectly tracking network will be of order K times larger than in a balanced network. Figures 14 and 15 show a comparison of the tracking capabilities of a balanced network with K = 1000 and an unbalanced network with threshold linear units. Between t = 0 and t = 1, the networks are at equilibrium. In Figure 14 the external activity is ramped between t = 1 and t = 2, m0 (t) = m0 + v0 t,
(9.13)
and after t = 2 m0 is kept constant again. The graph shows m∞ E and mE for both networks plotted against time. Clearly the balanced network is much better than the unbalanced network in tracking the change in external rate. Similar results are seen in the case of a sinusoidal external input (see Figure 15). Finally, in Figure 16, we plot the rate of change of mE versus v0 for the ramped input case. The results of this, as well as Figures 14 and
Chaotic Balanced State in a Model of Cortical Circuits
1357
Excitatory activity
0.20
0.15
0.10
0.05 0
2
4
t
6
8
10
Figure 15: Average rate of the excitatory population for a sinusoidally varying input. The rates of the excitatory population in an instantaneously responding network (solid line), a balanced network (dashed line), and an unbalanced network (dotted line) are shown. Parameters as in Figure 13.
15 are based on a full finite K solution of the dynamics. We also show in Figure 16 the large K predictions according to which there is a sharp upper bound for fast tracking at a value of v0 given in equation 9.8. 10 Discussion 10.1 Asynchronous and Synchronized Chaos. The purpose of our theory is to identify the different mechanisms by which the deterministic dynamics generates strongly irregular states in large neural networks, in which each cell receives input from many other cells. To understand these mechanisms from a theoretical point of view, it is important to study the network behavior in the limits of large system size N and large connectivity index K. In a finite network with fluctuating dynamics, there will always be some degree of synchrony and some compensation between inhibition and excitation. It is thus impossible to single out balancing between excitation and inhibition as a mechanism for variability separate from synchronized chaos (Bush & Douglas 1991; Hansel & Sompolinsky, 1992, 1996). It is only in the limit of large N, where states with synchrony that does not vanish
1358
C. van Vreeswijk and H. Sompolinsky
1 0.8 0.6
vE 0.4 0.2 0
0
0.5
1
v0
1.5
2
Figure 16: Rate vE with which the average excitatory rate changes as a function of the rate of change v0 of the external input. The solid line shows vE for a network with K = 1000. The dashed line shows the same for the large K limit. Parameters as in Figure 13.
in this limit can be distinguished from states where the synchrony does vanish, that the different mechanisms become clearly separate. Similarly, the importance of the limit of large K is that for fixed finite K, network parameters may be tuned so that fluctuations in individual synaptic inputs generate fluctuations in the membrane potential of the postsynaptic cells. These fluctuations can be due to stochastic synaptic failures or variability in the presynaptic cells from within or outside the network. In other words, for a finite fixed K, the issue of balancing between excitation and inhibition is a quantitative one. Only in the large K limit can the distance between the net input and the threshold clearly be separated, in the balanced state, from the corresponding distance for the excitatory and inhibitory components. The outcome of the present theory combined with our previous studies is that chaotic states in large, highly connected networks can be classified as synchronized chaos and asynchronous chaos. Synchronized chaos is likely to occur in fully connected networks, where K is proportional to N, yielding a strong overlap between inputs to different neurons. In this case, the chaotic state is characterized by cross-correlations between neuronal pairs whose amplitude is of order 1 even in the limit of N → ∞, thereby creating strong
Chaotic Balanced State in a Model of Cortical Circuits
1359
fluctuations of the common feedback. Thus, synchronized chaos can be viewed as resulting from an instability in the dynamics of the macroscopic degrees of freedom that comprise the common fluctuating mean field. Asynchronous chaotic states are distinguished by weak cross-correlations. In the present case, this is due to the sparseness of the connections. More specifically, in our networks, the amplitude of the cross-correlations has a broad distribution in the network due to inhomogeneity in the connectivity. Most of the cross-correlations are of the order 1/N, where N is the network size. The maximal value of the cross-correlations occurs for pairs that are directly connected, and √ this cross-correlation is of the order of the strength of the synapse, O(1/ K). Thus, chaos in this state is the result of instability in local degrees of freedom, similar to chaos in asymmetric spin glasses and neural networks. 10.2 Balanced State with Strong or Weak Synapses. The scaling of connection strength in our theory of the balanced state is different from conventional mean-field theories of highly connected networks. Most mean-field theories of large, highly connected neural networks assume that each connection is scaled as the inverse of the mean√number of inputs to a neuron, K. In contrast, we scale the connections as 1/ K. This aspect, together with the relative sparseness of the connections and the asynchrony of the dynamics, yields a highly irregular dynamical state, despite the fact that the singleneuron dynamics in our model is the simple threshold updates of binary units. The presence of relatively large connections is again analogous to the scaling of connections in highly connected spin glasses and random neural networks, where the interactions have to scale as the inverse square root of the connectivity index (Derrida, Gardner, & Zippelius, 1987; Sompolinsky, Crisanti, & Sommers, 1988). In Sompolinsky et al. (1988) the network is a fully connected asymmetric analog circuit with connections that are independent random variables with zero mean. The connections possess a square-root scaling with the number of inputs, as is natural for mean-field spin glasses (Mezard, Parisi, & Virasoro, 1987). In Derrida et al. (1987), the connectivity is randomly sparse, as in our model. The connections store random memories so that in the limit of a large K (and correspondingly large number of stored patterns), they are effectively random in sign and exhibit chaotic dynamics similar to the asymmetric spin glass. In contrast, in our case the connections are not random in sign but are organized in an excitatory-inhibitory two-population architecture. Consequently, the balance between excitation and inhibition that gives rise to the temporally disordered state is entirely a dynamic effect. Our results should be contrasted with a conventional, fully connected network with the same simple twopopulation architecture with more conventional 1/K scaling of connections. Such networks converge to either static states or globally coherent limit cycles (Abbott & van Vreeswijk 1993; Gerstner & van Hemmen, 1993; Grannan
1360
C. van Vreeswijk and H. Sompolinsky
et al., 1992; Hansel et al., 1995; van Vreeswijk, 1996; Wilson & Cowan, 1972). An important consequence of our assumption of relatively strong synaptic connections concerns the size of the external input to the local network. According to our theory, the balanced state is robust only when the DC external input to the local network is large—of the same order as the local excitatory and inhibitory feedback, and much larger than the net synaptic input to a cell. In √ the notation of our model, the external input to an excitatory cell is Em0 K, whereas the net input to this √ cell, uE , is smaller than the m0 is the rate of an external input by a factor of the order of 1/(m0 K), where √ input cell and is assumed to be much larger than 1/ K. Figure 3 shows that lowering the strength of the external input—that is, reducing m0 —will turn off the activity of the network. In fact, equation 3.3 implies that to maintain the balanced activity in the case of an external input that is only of order 1 requires the vanishing of the denominators in equations 4.3 and 4.4, which means that the interaction strengths have to be fine-tuned to a very narrow range. Because of the importance of the scaling of the synapses in our theory, it is very informative to consider the behavior of our model if we use a conventional scaling of synapses: that each synapse scales as 1/K—the weak synapses scenario. In this scenario, each component of the synaptic inputs, including the total external input to a cell, is of order 1. The solution of this model (van Vreeswijk & Sompolinsky, 1997) shows that when K is not large, the network settles in a strongly disordered state. This is not surprising given that the connectivity is randomly asymmetric and there is no danger of averaging of the fluctuating inputs to a cell. However, the fate of the temporal variability as K is increased is highly sensitive to the presence of local inhomogeneity. If the neurons have the same threshold, the chaotic state is maintained as K increases. In this case, the population rates adjust their value so that the net √ input is close to the threshold level within a distance of the order of 1/ K. This is shown in Figure 17. This figure displays the time course of the various synaptic inputs to a cell, evaluated by simulating a sample from the statistics predicted by the mean-field solution with the weak-synapses scaling. The results of this figure should be contrasted with the behavior of the strong synapses scenario (see Figure 6). In Figure 17 the variability is caused by the fact that the cell is always hovering close to its threshold. In contrast, in the case of Figure 6, the distance between the net input and the threshold is not small compared to the distance between threshold and rest. In this case, the variability is caused by the presence of excitatory and inhibitory inputs, each of which is much larger than the threshold. Despite the difference in behavior between the two scenarios, the dynamic mechanism for these balanced chaotic states is the same. In both cases, the distance between the net input and the threshold is smaller (by
Chaotic Balanced State in a Model of Cortical Circuits
1361
Input
1
0.5
Spikes
0
0
20
40
t
60
80
100
Figure 17: Inputs to an excitatory cell in a network with synaptic strengths Jkl /K. The total excitatory input (upper trace), the total inhibitory input (lower trace), and the net input (middle trace) are shown in the upper panel. At the bottom, the times when the cell switches from the passive to the active states are indicated. Parameters: K = 1000, E = 2.0, I = 1.6, JE = 2.0, JI = 1, 8, θE = 1.0, θI = 0.8, τ = 0.9, and m0 = 0.503.
√ a factor of 1/ K) from the distance to threshold of the excitatory and inhibitory components. Thus, it would seem that choosing between these scenarios is largely a matter of biological interpretation. However, there are some qualitative differences between the two scenarios. Since the synaptic inputs are all of the same order as the threshold, it is harder to obtain states with low mean rates in both the excitatory and inhibitory populations. To achieve low rates, the ratio between the external inputs to the two populations (I/E in the notation of equations 2.4 and 2.5) has to be close to the ratio of their thresholds θI /θE . More important, the weak-synapses scenario of Figure 17 breaks down in the presence of inhomogeneity in the local thresholds. In this case, the population rates are incapable of accommodating the different thresholds. As a result, in the case of inhomogeneous thresholds, when K increases, the network state becomes increasingly frozen; neurons with high thresholds become inactive, whereas neurons with low ones fire close to saturation. This freezing occurs √ as soon as the width of the inhomogeneity in threshold is larger than 1/ K. In contrast, in the scenario of
1362
C. van Vreeswijk and H. Sompolinsky
strong synapses, the state becomes frozen only when the inhomogeneity is large compared to 1. Note that in the case of Figure 17, the external input is of the same order as the net input to the cell. Equally important is the fact that in the weak scenario case and homogeneous networks, the collective time constants are of the same order as the single cell time constant so that the network will not exhibit the phenomenon of fast tracking predicted in our theory. Finally, the model can be generalized to a model with synaptic strengths that scale as K−α , with 0 < α < 1. Of course, these models can be distinguished from the present model only in the large K limit. In this limit, the net average inputs into the populations scale as K1−α , while the quenched and temporal fluctuations in the inputs scale as K1/2−α . Therefore the leading order in the inputs has to cancel, leading to the balance condition. For any α, this leads to asynchronous chaotic activity in a homogeneous network, similar to the case α = 1/2. However, if we introduce a distribution of the threshold with width of order 1, we have to distinguish two regimes, apart from α = 1/2 of our model. If α > 1/2, the fluctuations in the input decrease with K so that the network goes to the frozen state in the large K limit. On the other hand, if α < 1/2, the fluctuations grow with K, and therefore the inhomogeneity in the threshold becomes negligible in the large K limit. Thus, a network with inhomogeneous thresholds will act in the same way as a network with homogeneous thresholds. Specifically for low rates, the rate distribution will become √ narrow. Thus, only for a network with synaptic strengths of order 1/ K is there a nontrivial interaction between the fluctuation in the input and the threshold inhomogeneities.
10.3 Comparison with Other Network Models. Some of our results are consistent with those of the integrate-and-fire network models of Tsodyks and Sejnowski (1995) and Amit and Brunel (1997a, 1997b). Although constructing an exact mean-field theory for the integrate-and-fire dynamics similar to the one presented here for binary units is much more difficult, we believe that most of the predictions of our mean-field theory are applicable to the integrate-and-fire dynamics as well, provided that the same connectivity architecture and scaling of parameters with N and K are used. However, a direct comparison between our theory and the results of Tsodyks and Sejnowski (1995) and Amit and Brunel (1997a, 1997b) is difficult because of their introduction of stochasticity in the network, the combination of mechanisms such as resetting potential close to threshold, and the lack of full, explicit specification of scaling of parameters with N and K. Tsodyks and Sejnowski (1995) show numerically that their model is capable of “fast switching” in response to a fast change in the external stimulus. This may be related to the fast tracking predicted in our model. The fact that our model does not respond quickly to a sudden switching of the stimulus (see Figure 13) is probably a result of the dynamics of binary neurons. However,
Chaotic Balanced State in a Model of Cortical Circuits
1363
the switching time constants observed in Tsodyks and Sejnowski (1995) is of the same order as the single-cell integration time constant, while the fast tracking should occur on a much shorter time constant. In recent numerical simulations of integrate-and-fire networks, Amit and Brunel (1997b) showed that the strength of the average cross-correlations decreases as N increases (keeping the connectivity index constant). However, they do not show whether as N increases, the variability in the single cell remains the same. If this would be the case, their results are consistent with our predictions regarding asynchronous chaotic state. 10.4 Biological Implications. With regard to the biological systems, we should reemphasize that most likely temporal irregularity is a result of several mechanisms, including those mentioned in section 1. Our discussion makes it clear that even with regard to deterministic network mechanism in a finite system, the temporally irregular state is likely to be at best intermediate between the synchronized and the balanced chaotic states. An important question is whether external input is large relative to net input to a cortical cell. Recent experimental findings of Ferster, Chung, and Wheat (1996) in cat primary visual cortex suggest that the input from the lateral geniculate nucleus (LGN) to layer 4 cortical cells are in fact a fraction of the net input. Stratford, Tarczy-Hornoch, Martin, Bannister, and Jack (1996) show that the total strength of the LGN synapses is about 2.5 to 3 times smaller than the total strength of the excitatory feedback synapses from layer 4 cells; however, this study does not measure the strength of the feedback from the inhibitory interneurons, so it does not allow for the estimation of the net feedback. Further experimental clarification of this issue is called for. Measurements of the distribution of time-averaged rates within a local population of neurons and the change in its shape when the overall level of response increases, similar to those of Figure 11B, would be an interesting test of the underlying statistical characteristics of the network spatiotemporal fluctuations. 10.5 Future Work. On theoretical grounds, our work raises several interesting issues worth pursuing. First, it would be important to know whether the theory of the balanced state applies also to networks with more interesting connectivity architecture. Thus, it would be interesting to extend our theory to networks that model associative memory or hypercolumns in visual cortex. It is important to study the consequences of nonlinearities of synaptic summations, for example, by treating synaptic inputs as conductance changes instead of currents. In considering the functional implications of our theory, it is important to distinguish between the sensitivity of a chaotic autonomous system to changes in its initial condition and its ability to lock to a changing external drive. The analysis of tracking capabilities of our network shows that the macroscopic state of the network responds fast to changing input. In the
1364
C. van Vreeswijk and H. Sompolinsky
case of a homogeneous input, it can be shown that the microscopic state is not tightly locked to the changing stimulus. On the other hand, preliminary analysis (van Vreeswijk & Sompolinsky, 1998) shows that in the case of spatially inhomogeneous input fluctuations, the microscopic state of the network will tightly lock to the stimulus temporal variations. These findings are consistent with recent findings that cortical cells respond highly reliably to the fluctuations in the stimulus (Bair & Koch, 1996; Britten, Shadlen, Newsom, & Movshon, 1992). Elucidation of the computational aspects of balanced states in neuronal networks is a challenging issue. Recently Markram and Tsodyks (1996) have shown that the synapses between cortical pyramidal cells show a marked degree of depression. It should be investigated how such dynamical synapses affect the balanced state. If one assumes synaptic depression between the excitatoryto-excitatory synapses only, and facilitation between the synapses from the inhibitory to the excitatory and from the excitatory to inhibitory populations (Thomson, West, & Deuchars, 1995; Thomson, West, Hahn, & Deuchars, 1996), the equilibrium rates in the network decrease, relative to those in a network without facilitation. This synaptic depression and facilitation also has the effect that the constraints on the synaptic strengths (see equations 4.9 and 4.10) can be relaxed. Because the synaptic depression and facilitation become effective only on a time scale that is as slow as or slower than the membrane time constant, the response of such a network to an external input that changes with time is more complicated than in the model studied here. If the input is suddenly increased by a small amount, the network rates increase to the rate the network would have in equilibrium if the √ synaptic strengths were not changed in a time of order 1/ K and then, on a much slower time scale, the rates decrease due to the change in the synaptic strengths. Since in the balanced state the finite K corrections of the rates are determined by both the first and the second moment of input, the change in rate due to synaptic depression or facilitation depends not only on the average change in the synaptic strength but also on its fluctuation. Thus, synaptic depression due to a change in the height of the excitatory postsynaptive potentials (EPSPs), but without a change in the probability of release, will affect the rates differently from synaptic depression that leaves the height of the EPSPs unaffected but decreases the probability of release, even if both mechanisms result in the same average depression. Another effect of synaptic depression to take into account is that the effect of a spike is decreased if it follows shortly after the preceding spike. This will decrease the fluctuations in the input, relative to the fluctuations in the activity of the presynaptic cells. Facilitation will have the opposite effect, since it increases the effect of a spike if it closely follows the previous one. These issues warrant further study.
Chaotic Balanced State in a Model of Cortical Circuits
1365
Appendix A: Derivation of the Mean-Field Theory A.1 Population Rates. We first consider the population-averaged activities mE (t) and mI (t) in the limit of large NE and NI and finite K. We first assume that each cell in the kth population is updated stochastically at a rate τk . When a cell is updated, it moves to the active state if its total input is above threshold. Otherwise its updated state is 0. It is convenient to define a time-dependent local rate variable, mik (t) = hσki (t)i.
(A.1)
Here, the symbol h. . .i does not mean average over time, as in equation 5.4 and thereafter. Instead, it means average over all initial conditions that are consistent with given values for mk (0) and also over the random sequence of update times. It is well known that the rate of a binary variable that obeys the update rule satisfies the following continuous time dynamics (Ginzburg & Sompolinsky, 1994; Glauber, 1963), τk
d i m (t) = −mik (t) + 2(uik (t)), dt k
(A.2)
where uik (t) is the total synaptic input into a cell i in the kth population relative to its threshold and is given in our case by equation 2.2. If a cell receives nE (t) and nI (t) excitatory and inhibitory feedback inputs, respectively, then its input is uik (t) =
√ JkE JkI KJk0 m0 + √ nE (t) + √ nI (t) − θk . K K
(A.3)
The main assumption underlying the mean-field theory is that the activities of the different input cells to a given cell are uncorrelated. Technically, this holds rigorously provided that K ¿ log Nk (Derrida et al., 1987). Using this assumption, the population average of equation A.2 yields the following mean-field equations for the population activities, τk
d mk (t) = −mk (t) + Fk (mE (t), mI (t)), dt
(A.4)
where Fk denotes the probability that the updating cell at time t will be in an updated active state. It is given by à ! X Jkl √ p1 (n1 )p2 (n2 )2 KJk0 m0 + Fk (mE , mI ) = √ nl − θk , (A.5) K n1 ,n2 =0 l ∞ X
where pl (n) is the probability that a cell receives n active inputs from the lth population.
1366
C. van Vreeswijk and H. Sompolinsky
For NE , NI → ∞ the probability of s synapses of population l projecting to a cell is Ks e−K /s!. On average each of these synapses has a probability ml to be active, hence, pl (n) =
∞ X Ks s=n
=
s!
−K
µ
e
s n
¶ mnl (1 − ml )s−n
(ml K)n −ml K e . n!
(A.6)
Equations A.4 through A.6 define the mean-field equations for the population activity levels for finite K. The average values of nE and nI satisfy hnk i = mk K. The standard deviations σ (nE ) and σ (nI ) are given by σ (nk ) = mk K. In the large K limit, the probability distributions pk (n) can be replaced by gaussian distributions. According to equation A.6, the means and variances of this distribution are [nk ] = [(δnk )2 ] = Kmk . Therefore, in the limit K → ∞, Fk (mE , mI ) is given by µ ¶ Z −uk √ , (A.7) Fk (mE , mI ) = Dx 2(uk + αk x) = H √ αk √ where Dx = dx exp(−x2 /2)/ 2π . From the above statistics of nl , one obtains that the average input, relative to threshold, uk into a cell of population k given by √ uk = (Jk0 m0 + JkE mE + JkI mI ) K − θk
(A.8)
and standard deviation of the input αk , αk = (JkE )2 mE + (JkI )2 mI ,
(A.9)
from which equations 3.3 through 3.6 follow. A.2 Autocorrelations. We now extend the analysis to evaluate the dynamics of the autocorrelation function qk (τ ), (see equations 5.16). Using similar arguments as for equations A.2, qk (τ ) satisfies an equation of the following form, τk
dqk = −qk (τ ) + dτ
Z 0
∞
dt0 exp(−t0 /τk ) Fk ({ml }; {ql (t0 + τ )}), τk
(A.10)
where Fk =
£
®¤ 2(uik (t))2(uik (t + t0 + τ )) .
(A.11)
Here the averaging is also over absolute time t. The integral over time in the r.h.s. of equation A.11 takes into account the correlation between the inputs
Chaotic Balanced State in a Model of Cortical Circuits
1367
to a cell that updates its state at time t + τ and its inputs at the last update before time t. Thus, the time integral is an integral over the exponential distribution of update interval of the last update before time t. Separating the total number of active inputs into those that come from sources in the kth population that are active in both times (n1,k ) and those that are active only in one of the times (n2,k and n3,k , respectively), one can write Ã
! X Jkl √ pl (n1l , n2l , n3l )2 KJk0 m0 + Fk = √ (n1l + n2l ) − θk K l=1,2 nkl l ! Ã X Jkl √ KJk0 m0 + (A.12) ×2 √ (n1l + n3l ) − θk , K l Y X
where pl (n1 , n2 , n3 ) =
(ql K)n1 ((ml − ql )K)n2 +n3 −(2ml −ql )K e . n1 ! n2 !n3 !
(A.13)
In the large K limit, this can be written as Z
Z
Z
p p Dx2 Dx3 2(uk + βk x1 + αk − βk x2 − θk ) p p ×2(uk + βk x1 + αk − βk x3 − θk ) √ ¶¸2 · µ Z θk − uk − βk x , = Dx H √ αk − βk
Fk =
Dx1
(A.14)
with uk and αk as above, and βk given by βk (τ ) =
X
(Jkl )2 ql (τ ).
(A.15)
l=1,2
So qk satisfies equation 5.17. A.3 Sensitivity to Initial Conditions. The derivation of equation 8.5 for the overlaps Qk (t) = (mk + Dk )/2 of two trajectories, equation 8.2 which start with slightly different initial conditions, is similar to that of qk . Here the inputs n1,k are the sources to a given cell that are active at time t in both trajectories. The only difference between the equation for the delayed-time autocorrelations and the equal-time overlap between two trajectories is the integral over the previous update times which appears in equations A.10 and 5.17. This results from the fact that in the latter case, the update sequence is identical in the two trajectories.
1368
C. van Vreeswijk and H. Sompolinsky
Appendix B: Determinstic Update Rules The general form of equation A.4 is usually derived for a binary variable that is updated stochastically at a rate τk . One might therefore argue that the irregular firing in our model is due to the stochasticity of the update times of the model neurons. To show that this is not the case, we define here a completely deterministic dynamic model and show that it leads to exactly the same equations for the mean rates of activity as those given above. Consider the same network model, except that a neuron i of population k is updated at times t = (n + δki )τk with n = 0, 1, 2, . . . and δki is randomly chosen between 0 and 1. Let m+ k (t) be the probability that the neuron of population k, updated at time t, goes into (or stays in) the active state. Since all neurons of population k are updated exactly once between times t − τk and t, mk (t) is given by Z 1 τk 0 + dt mk (t − t0 ). (B.1) mk (t) = τk 0 Going through arguments similar to those shown above, one can show that mk satisfies Z 1 τk 0 dt Fk (mE (t − t0 ), mI (t − t0 )), (B.2) mk (t) = τk 0 with Fk given by equation A.5. If we introduce inhomogeneities in the rate with which the cells are updated so that cell i of population is updated at times t = (n + δki )τik , where τik has a probability Rk (τ )dτ of being between τ and τ + dτ , we find that mk evolves as Z Z ∞ Rk (τ ) τ 0 dτ dt Fk (mE (t − t0 ), mI (t − t0 )). (B.3) mk (t) = τ 0 0 For Rk (t) = te−t/τk /τk2 , this can be written as Z 1 ∞ 0 −t/τk dt e Fk (mE (t − t0 ), mI (t − t0 )), mk (t) = τk 0
(B.4)
and this is equivalent to equation A.4. Thus, in this completely deterministic model, the mean rates mk satisfy exactly the same equations as the model with stochastic updating. This also holds true for the other mean-field equations of the model. Acknowledgments We thank D. J. Amit, D. Hansel, T. Sejnowski, and M. Tsodyks for very helpful discussions. We are grateful to M. Abeles, H. Bergman, and E. Vaadia
Chaotic Balanced State in a Model of Cortical Circuits
1369
for permission to present their data. This work is partially supported by the Fund for Basic Research of the Israeli Academy of Science. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., & Vaadia, E. (1988). Unpublished data. Amit, D. J., & Brunel, N. (1997a). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex Cereb. Cortex, 7, 237–252. Amit, D. J., & Brunel, N. (1997b). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving Macaque monkey. Neural Comput., 8, 1185–1202. Bair, W., Koch, C., Newsome, W., & Britten, K. (1994). Power spectrum analysis of bursting cells in area MT in the behaving monkey. J. Neurosci., 14, 2870–2892. Bell, A., Mainen, Z. F., Tsodyks, M., & Sejnowski, T. (1994). Why do cortical neurons fire irregularly? Soc. Neurosci. Abstr., 20, 1527. Britten, K. H., Shadlen, M. J., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. J. Neurosci., 12, 4745–4765. Burns, B. D., & Webb, A. C. (1976). The spontaneous activity of neurons in the cat’s visual cortex. Proc. R. Soc. Lond. B, 194, 211–223. Bush, P. C., & Douglas, R. J. (1991). Synchronization of bursting action potential discharge in a model network of neocortical neurons. Neural Comp., 3, 19–30. Derrida, B., Gardner, E., & Zippelius, A. (1987). An exactly soluble asymmetric neural network model. Europhys. Lett., 4, 167–173. Douglas, R. J., & Martin, K. A. C. (1991). Opening the grey box. Trends in Neurosci., 14, 286–293. Douglas, R. J., Martin, K. A. C., & Whitteridge, D. (1991). An intracellular analysis of the visual response of neurons in cat visual cortex. J. Physiol., 440, 659–696. Ermentrout, J. B., & Gutkin, B. (in press). Dynamics of spike generation determines in vivo spike train statistics. Ferster, D., Chung, S., & Wheat, H. (1996). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380, 249–252. Ferster, D., & Jagadeesh, B. (1992). EPSP-IPSP interactions in cat visual cortex studied with in vivo whole-cell patch recording. J. Neurosci., 12, 1262–1274. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single cell. Biophys. J., 4, 41–68. Gerstner, W., & van Hemmen, J. L. (1993). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neural networks. Phys. Rev. E, 50, 3171–3190.
1370
C. van Vreeswijk and H. Sompolinsky
Glauber, R. J. (1963). Time-dependent statistics of the Ising model. J. Math. Phys., 4, 294–307. Grannan, E., Kleinfeld, D., & Sompolinsky, H. (1992). Stimulus-dependent synchronization of neuronal assemblies. Neural Comp., 4, 550–569. Gray, C. M., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Hansel, D., Mato, G., & Meunier, C. (1995). Synchronization in excitatory neural networks. Neural Comp., 3, 307–338. Hansel, D., & Sompolinsky, H. (1992). Synchronization and computation in a chaotic neural network. Phys. Rev. Lett., 68, 718–721. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Comput. Neurosci., 3, 7–34. Holt, G. R., Softky, W. R., Koch, C., & Douglas, R. J. (1996). A comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. J. Neurophysiol., 75, 1806–1814. Mainen, Z. J., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efficacy between neocortical pyramidal neurons. Nature, 382, 807–810. M´ezard, M., Parisi, G., & Virasoro, M. A. (1987). Spin glass theory and beyond. Singapore: World Scientific. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967a). Neuronal spike trains and stochastic pointprocesses. I. The single spike train. Biophys. J., 7, 391–418. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967b). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophys. J., 7, 419–440. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1995). Is there a signal in the noise? Curr. Opin. Neurobiol., 5, 248–250. Softky, W. R. (1995). Simple codes versus efficient codes. Curr. Opin. Neurobiol., 5, 239–247. Softky, W. R., Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosc., 13, 334– 350. Sompolinsky, H., Crisanti, A., & Sommers, H. J. (1988). Chaos in neural networks Phys. Rev. Lett., 61, 259–262. Stratford, K. J., Tarczy-Hornoch, K., Martin, K. A. C., Bannister, N. J., & Jack, J. J. B. (1996). Excitatory synaptic inputs to spiny stellate cells in cat visual cortex. Nature, 382, 258–261. Thomson, A. M., West, D. C., & Deuchars, J. (1995). Properties of single axon excitatory postsynaptic potentials elicited in spiny interneurons by action potentials in pyramidal neurons in slices of rat neocortex. Neurosci., 69, 727– 738. Thomson, A. M., West, D. C., Hahn, J., & Deuchars, J. (1996). Single axon IPSPs
Chaotic Balanced State in a Model of Cortical Circuits
1371
elicited in pyramidal cells by three classes of interneurones in slice of rat neocortex. J. Physiol., 496, 81–102. Troyer, T. W., & Miller, K. D. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular firing cell. Neural Comput., 9, 733–745. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interaction. Phys. Rev. Lett., 71, 1282–1285. Tsodyks, M., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Vaadia, E., Haalman, I., Abeles, M., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neural interaction in monkey cortex in relation to behavioral events. Nature, 373, 515–518 van Vreeswijk, C. (1996). Partial synchrony in populations of pulse-coupled oscillators. Phys. Rev. E, 54, 5522–5537. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1997). Irregular firing in sparse networks of weakly coupled neurons. Unpublished manuscript. van Vreeswijk, C., & Sompolinsky, H. (1998). Locking to fluctuating input in balanced networks. Unpublished manuscript. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions of localized populations of model neurons. Biophys. J., 12, 1–24. Received April 9, 1997; accepted December 10, 1997.
ARTICLE
Communicated by Barak Pearlmutter
Blind Source Separation and Deconvolution: The Dynamic Component Analysis Algorithm H. Attias C. E. Schreiner Sloan Center for Theoretical Neurobiology and W. M. Keck Foundation Center for Integrative Neuroscience, University of California at San Francisco, San Francisco, CA 94143-0444, U.S.A.
We derive a novel family of unsupervised learning algorithms for blind separation of mixed and convolved sources. Our approach is based on formulating the separation problem as a learning task of a spatiotemporal generative model, whose parameters are adapted iteratively to minimize suitable error functions, thus ensuring stability of the algorithms. The resulting learning rules achieve separation by exploiting high-order spatiotemporal statistics of the mixture data. Different rules are obtained by learning generative models in the frequency and time domains, whereas a hybrid frequency-time model leads to the best performance. These algorithms generalize independent component analysis to the case of convolutive mixtures and exhibit superior performance on instantaneous mixtures. An extension of the relative-gradient concept to the spatiotemporal case leads to fast and efficient learning rules with equivariant properties. Our approach can incorporate information about the mixing situation when available, resulting in a “semiblind” separation method. The spatiotemporal redundancy reduction performed by our algorithms is shown to be equivalent to information-rate maximization through a simple network. We illustrate the performance of these algorithms by successfully separating instantaneous and convolutive mixtures of speech and noise signals. 1 Sources as Dynamic Components The problem of blind source separation is defined as follows. Consider L independent signal sources (e.g., different speakers in a room) and L0 sensors (e.g., microphones at several locations). Each sensor receives a mixture of the source signals. The task is to recover the unobserved sources from the observed sensor signals. This separation should be performed in the absence of any information about the mixing process or the sources, apart from their mutual statistical independence, hence is termed “blind.” Successful techniques for blind separation can have many applications in areas involving the processing of multisensor signals, such as speech c 1998 Massachusetts Institute of Technology Neural Computation 10, 1373–1424 (1998) °
1374
H. Attias and C. E. Schreiner
recognition and enhancement, the analysis and classification of biomedical recordings, and target localization and tracking by radar and sonar devices. Such real-world situations generally involve source signals that are delayed and attenuated by different amounts on their way to the different sensors, as well as multipath propagation, resulting in a situation termed convolutive mixing. Mathematically, the mixing is described by a matrix of filters operating on the sources. The problem is complicated by the fact that the number of sources L is unknown and may be larger than the number of sensors L0 , the source properties (e.g., location) are time dependent, the mixing may be nonlinear due to the impulse response of the medium and sensors, and the signals are corrupted by propagation and sensor noise. Currently there exists no algorithm that can solve the general problem. The human auditory system, however, can solve it under some conditions for L0 = 2 (the cocktail party effect; see Bregman, 1990). Given the complexity of the actual problem, current work on blind separation focuses on an idealized version where the mixing is square (L0 = L), invertible, linear, noiseless, and time independent. Even for this version, significant progress has been made only recently and on a further simplified case where the mixing is instantaneous (nonconvolutive), that is, it involves no delays or frequency distortion. A robust and efficient set of algorithms that perform independent component analysis (ICA) has proved effective for a large class of instantaneous mixtures (Comon, 1994; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996; Pham, 1996; Pearlmutter & Parra, 1997, Hyv¨arinen & Oja, 1997). In that case, we seek a separating transformation Gij that, when applied to the sensor signals yi (t) which are generally correlated, will produce a new set of signals xˆ i (t), xˆ i (t) =
L X
Gij yj (t).
(1.1)
j=1
If successful, the separation results in these new signals being the original sources xi (t) (to within an order permutation and scaling), and thus mutually statistically independent. Methods to find Gij turn this observation around and impose the independence of xˆ i (t) as a condition on Gij . Expressed in equal-time moments, Gij should be chosen such that the resulting signals ˆ m xˆj (t)n i = hxˆ i (t)m ihxˆj (t)n i for i 6= j and any powers m, n; the satisfy hx(t) average is taken over time t. Thus, ICA methods attempt to deduce Gij from the equal-time (or spatial, referring to the sensor coordinate i) sensor statistics. Realistic situations are characterized by convolutive mixing, where many delayed and attenuated versions of each source signal arrive at each sensor. Since a sensor signal yi (t) consists of the sources not only at the present time t but also at preceding times t0 < t, recovering the sources must be done using both present and past sensor signals yi (t0 ≤ t). Hence, in contrast
Blind Source Separation and Deconvolution
1375
to instantaneous mixtures, which require a spatial separating transformation Gij (see equation 1.1), convolutive mixtures require a spatiotemporal separating transformation Gij (t), xˆ i (t) =
L Z X j=1
0
∞
dt0 Gij (t0 )yj (t − t0 ).
(1.2)
The simple time dependence Gij (t) = Gij δ(t) reduces the convolutive (equation 1.2) to the instantaneous (equation 1.1) case. In general, Gij (t) has a nontrivial time dependence as it couples mixing with filtering and cannot be found by ICA methods. In fact, equal time statistics are not sufficient to deduce Gij even for some instantaneous mixtures, such as of colored gaussian signals, which are consequently inseparable to ICA. In this article, we present a novel family of unsupervised learning algorithms for blind separation of convolutive mixtures, termed dynamic component analysis (DCA). “Dynamic” refers to the temporal nature of separating transformation Gij (t). The DCA algorithms learn Gij (t) from the spatiotemporal sensor statistics. Whereas the sensors yi (t) are generally correlated, the resulting new signals xˆ i (t) are mutually independent both spatially and temporally, and are termed the dynamic components (DCs) of the sensor data. For successful separation, the DCs correspond to the original sources. The key to the DCA approach lies in formulating the convolutive blind separation problem as an unsupervised learning task of a spatiotemporal generative model, which becomes an optimization problem using the KullbackLeibler distance as a cost function. We now give a quick overview of the DCA procedure. First, we observe that the condition on the equal-time moments is not sufficient to identify Gij (t). However, invoking the statistical independence of the sources, we can impose the stronger condition on the nonequal-time moments, hxˆ i (t)m xˆj (t + τ )n i = hxˆ i (t)m ihxˆj (t + τ )n i for i 6= j and any powers m, n at any τ . This is because the amplitude of source i at time t is independent of the amplitude of source j 6= i at any time t + τ . This condition requires processing the sensor signals in time blocks in order to exploit their temporal statistics to deduce Gij (t). Of course, requiring spatiotemporal independence results in separation of instantaneous mixtures as well. Indeed, the resulting instantaneous DCA algorithms are more powerful than the spatial-based ICA ones. Second, we demonstrate that an effective way to impose spatiotemporal independence is by the use of a generative model (Everitt, 1984) for the probability density of the sensor signals. Specifically, we construct a parametric model for the joint density of the L sensors over N-point time blocks, py [y1 (t0 ), . . . , y1 (tN−1 ), . . . , yL (t0 ), . . . , yL (tN−1 )]. The adaptive model parameters describe the densities and autocorrelations of the independent sources and the convolutive mixing process. Different models can be constructed using time- and frequency-domain representations. To learn
1376
H. Attias and C. E. Schreiner
a model, we derive the appropriate error function, which measures the information-theory distance between the model and observed sensor densities, and optimize the parameters to minimize this error, resulting in the DCA learning rules. The optimized parameter values provide an estimate of either the mixing process, from which the separating transformation Gij (t) is readily available as its inverse, or Gij (t) directly. The DCA approach has several advantages: (1) The generative model formulation, which includes adaptive source parameters, facilitates effective separation for arbitrary source densities. (2) Stability is ensured by the error function optimization procedure. (3) Thanks to our generalization of the relative gradient concept (Cardoso & Laheld, 1996; Amari, Cichocki, & Yang, 1996) to the convolutive case, the learning is accelerated and possesses the property of equivariance, which guarantees uniform performance across the space of all invertible mixing processes, and thus robustness. (4) It is sufficiently flexible to allow incorporating information about the mixing filters when available (resulting in “semiblind” separation), thus improving performance. DCA is designed to recover the sources from the sensor data when the latter are an unknown linear combination of the former. However, it actually performs the more general task of spatiotemporal redundancy reduction, and as such can be applied to any temporal multivariable data set to extract its dynamic components. Section 2 of this article discusses the DCA-I algorithms for separating instantaneous mixtures using different generative models in the time and frequency domains and a hybrid time-frequency one, and derives the corresponding error functions and learning rules. The DCA-C algorithms for separating convolutive mixtures are presented in section 3, based on three analogous models. In section 4 we discuss the advantage of learning the mixing rather than separating transformation and present the resulting semiblind separation algorithm DCA-CS. The learning of more general separating transformations using DCA-CR is discussed in section 5. An equivalent formulation of DCA algorithms in terms of maximizing the information rate through a network is presented in section 6. Finally, in section 7 we describe an effective use of DCA methods to achieve separation when there are more sensors than sources. Most derivations and technical details are relegated to the appendixes. We demonstrate and compare the performance of different DCA algorithms throughout the article by applying them to instantaneous and convolutive mixtures of speech signals, as well as random signals with different densities and autocorrelations.
Notation. Vectors are denoted by boldfaced lowercase letters and matrices by boldfaced uppercase letters. Vector and matrix elements are not boldfaced. The complex conjugate of z is z? . The inverse of a matrix A is
Blind Source Separation and Deconvolution
1377
denoted by A−1 , its transposition by AT (ATij = Aji ), and its complex trans-
position by A† (A†ij = Aji? ). Frequency-domain quantities are distinguished from their time-domain counterparts by the symbol ∼ above. We work in discrete time and usually consider N-point time blocks, thus t = tm , m = 0, . . . , N − 1. To these correspond in the frequency domain the discrete frequencies ω = ωk = 2π k/N with k = 0, . . . , N − 1, which are related to the actual sound frequencies fk by ωk = 2π fk /fs , where fs is the sampling frequency. For discrete-time signal processing issues, see Oppenheim and Schafer (1989). For example, x(tm ) is a vector of time-domain signals xi (tm ). The corresponding frequency-domain vector of signals is x˜ (ωk ); the two are related by the discrete Fourier transform (DFT), provided here for reference: DFT: x˜ i (ωk ) =
N−1 X
eiωk m xi (tm ),
m=0
X 1 N−1 eiωi m x˜ i (ωk ). inverse DFT: xi (tm ) = N k=0
(1.3)
Similarly, H(tm ) denotes a matrix of filters Hij (tm ); this time-domain repre˜ k ) contains the sentation is called the filter impulse responses. Its DFT H(ω ˜ filter frequency responses Hij (ωk ). This notation will often be simplified by converting the times and frequencies to subscripts: xm = x(tm ), x˜ k = x˜ (ωk ), and similarly Hm = H(tm ), ˜ k ). ˜ k = H(ω H g For a filter (or signal) vector gm , we define a diagonal matrix Dm that contains it by g
Dij,m = gi,m δij .
(1.4)
Finally, we define two linear operations on signals. The linear convolution of xi,m and yj,m is denoted by ∗: (xi ∗ yj )m =
∞ X
xi,n yj,m−n .
(1.5)
n=−∞
The convolution produces a new signal zm = (xi ∗ yj )m , −∞ < m < ∞. In practice the signals are finite and the actual limits on m, n are determined by their lengths; to use equation 1.5 for a finite signal, for example, xi,m , we define xi,m = 0 at time points tm where it is not defined. The cross-correlation of xi,m and yj,m is denoted by ×: (xi × yj )m =
∞ X n=−∞
xi,n yj,n+m .
(1.6)
1378
H. Attias and C. E. Schreiner
Like the convolution, the cross-correlation produces a new signal wm = (xi × yj )m with −∞ < m < ∞. Note from equation 1.6 that the averaging implied by the term correlation extends only over time. The averaging over both time and an ensemble of N-point signals will be denoted by h(xi ×yj )m i. We shall often use convolution and cross-correlation in a matrix notation. whose ith P entry at time point tm is Thus, ym = (H ∗ x)m is a signal vector P given by the convolution yi,m = j (Hij ∗ xj )m = in Hij,n xj,m−n . Similarly, the cross-correlation matrix (x × yT )m is an L × L matrix whose ij element is given by equation P 1.6. Note from equation 1.6 that m → −m simply means (xi × yj )−m = n xi,n yj,n−m . It also transposes the cross-correlation matrix while exchanging the order of xm and ym : (x × yT )−m = (y × xT )Tm . We recall (Oppenheim & Schafer, 1989) that the DFT of a convolution is simply the product of the DFTs of the convolved signals; hence, the DFT of the matrix (x∗yT )m is x˜ k y˜ Tk . The DFT of the cross-correlation matrix (x×yT )−m is x˜ k y˜ †k . These relations can be verified using equation 1.3. 2 Instantaneous Mixing In this section we derive the DCA learning rules for separating instantaneous mixtures. We also demonstrate that the use of temporal statistics of the sensor signals facilitates the separation of mixtures that equal-time ICA algorithms (e.g., Bell & Sejnowski, 1995) fail to separate. Note that the instantaneous mixing problem is obtained from the convolutive case in the limit where the propagation delays are negligible compared to the autocorrelation times of the source signals. Hence, algorithms derived in this section can be used to achieve approximate separation of convolutive mixtures when the mixing filters are sufficiently short. We denote the original, unobserved source signals by xi,m and the ob, L, m = 0, . . . , N − 1. The L × L served sensor signals by yi,m , i = 1, . . . P mixing matrix Hij relates them by yi,m = j Hij xj,m , or in matrix notation, ym = Hxm
(2.1)
for all m. This mixing is termed instantaneous since the sensor signals at tm depend on the sources at the same, but no earlier, time point. Were the mixing matrix given, its inverse could have been applied to the sensor signals to recover the sources by xm = H−1 ym . In the absence of any information about the mixing, the blind separation problem consists of estimating a separating matrix G from the observed sensor signals alone. The source signals can then be recovered by xˆ m = Gym .
(2.2)
Generally, the sources can be recovered only to within a scaling factor and an order permutation (see section 2.4). Hence, xˆ i,m = λi xπ(i),m for arbitrary
Blind Source Separation and Deconvolution
1379
scaling factors λi 6= 0 and an arbitrary permutation π of 1, . . . , L, and G is a correspondingly scaled and permuted version of H−1 . In this article we solve the separation problem by first converting it to an optimization problem. For this purpose, we construct a generative model (Everitt, 1984) of the observed sensor signals ym , where the hidden variables are the unobserved source signals. Since our approach is statistical, we model the density of the sensor signals py , rather than the signals themselves; note that py describes L jointly distributed stochastic processes and is thus the joint density of all sensor signals at all time points—py = py (y1,0 , . . . , y1,N−1 , . . . , yL,0 , . . . , yL,N−1 ). The sensor density can be expressed in terms of the separating matrix G and the densities of the independent sources. Each source, in turn, is modeled as a stochastic nongaussian process described by its marginal (one time-point) density and autocorrelation function, which are parameterized by ξ i and gi,m , respectively, as detailed below. To complete the optimization formulation, we shall define an error function that measures the distance between our model py and the observed poy sensor densities. This error is a function of ξ i , gi,m , and Gij , termed the separation parameters, which are then optimized to minimize the error, so that the model py best approximates the observed poy . The optimal separating matrix G is subsequently used to recover the sources according to equation 2.2. We assume that the sources are independent, stationary, and zero mean P (hxi,m i = 0). We also consider their autocorrelations h(xi × xi )m i = h n xi,n xi,n+m i. A simple way to model them is to represent the source xi,m as a filtered version of a white (that is, δ-correlated) zero-mean signal ui,m , so that ym = Hxm ,
xi,m = (hi ∗ ui )m ,
(2.3)
1 h(ui × ui )m i = δm,0 , N
(2.4)
where ui,m satisfies hui,m i = 0,
and ∗ denotes linear convolution (see equation 1.5). Recall that we are working with N-time point signal segments; hence, the 1/N factor in equation 2.4; h·i denotes averaging over an ensemble of such signals. The filter hi,m then determines the source autocorrelations through h(xi × xi )m i = (hi × hi )m . According to equations 2.3 and 2.4, therefore, the sensor signals are produced by filtering independent white sources individually and then mixing them. Since ui,m is white, the power spectrum of hi,m equals that of xi,m : |h˜ i,k |2 = h|x˜ i,k |2 i (see equation 1.3). For the purpose of estimating the separating matrix, however, rather than considering hi,m , it is convenient to use their inverse gi,m : ui,m = (gi ∗ xi )m ,
xm = Gym .
(2.5)
1380
H. Attias and C. E. Schreiner
Here, the gi,m operate on the recovered sources to produce white signals and are therefore termed whitening filters; their spectra are the inverse source spectra: | g˜ i,k |2 = 1/h|x˜ i,k |2 i. The resulting signals ui,m are termed whitened sources. Note that the hat symbol that distinguishes recovered xˆ i,m from actual xi,m sources (see equation 2.2) has been omitted. In the rest of this article, xi,m denotes a recovered source unless otherwise noted. In order to construct a generative model of the sensor density, we must provide a model source density. In fact, we shall be modeling the density pu of the whitened sources. In the following, we shall formulate generative models in the time and frequency domains, as well as a hybrid frequencytime model, resulting in three different error functions and learning rules. 2.1 DCA-IF: Frequency-Domain Generative Model. It is convenient to work in the frequency domain since the problem simplifies there in the following sense. Applying DFT to equation 2.5, we have u˜ i,k = g˜ i,k x˜ i,k ,
x˜ k = Gy˜ k .
(2.6)
Whereas the time-domain formulation couples the signals at different times tm (by the convolution in equation 2.5), here we have a separate problem at each ωk . Of course, these N problems are not independent since they all involve the same parameters, gi,m and Gij . For the whitened source density, we use a factorial frequency-domain model, pu˜ =
L N/2 Y Y
Pi,k (u˜ i,k ),
(2.7)
i=1 k=0
where N is assumed even (with no loss of generality). Note that k runs only up to N/2 since u˜ N−k = u˜ ?k (see equation 1.3). Also, since for 1 ≤ k ≤ N/2 − 1 the Fourier components u˜ k are complex, Pi,k is in fact the joint distribution of Re(u˜ i,k ) and Im(u˜ i,k ). In appendix A.1 we derive the model sensor density py˜ (equation A.1) from equations 2.6 and 2.7. As our error function, we choose the KullbackLeibler (KL) distance (Cover & Thomas, 1991) E(poy˜ , py˜ ) (equation A.3), an asymmetric measure of the distance between the correct density poy˜ and the model py˜ . As shown in appendix A.1, the DCA-IF (I = instantaneous mixing, F = frequency domain) error function is given by EDCA-IF = − log |det G| − −
L N−1 X 1 X log | g˜ i,k | N i=1 k=0
N/2 L X 1 X log Pi,k , N i=1 k=0
(2.8)
Blind Source Separation and Deconvolution
1381
P where the term log Pi,k actually represents hlog Pi,k (u˜ i,k = g˜ i,k j Gij y˜j,k )i, the average being taken over the observed sensor signals y˜ k . Thus, this term is a function of G, gm , and the functional form of Pi,k , as is the error function itself. We emphasize that the filters gi,m may have any lengths Mi (i.e., gi,m≥Mi = 0) and are usually much shorter than N. Before deriving the learning rules, let us make the whitened source model (equation 2.7) more specific. First, we shall use the same parameterized functional form for all sources. This is consistent with our report (Attias & Schreiner, 1997), which showed that a large class of natural sounds are characterized by the same parametric functional form of their frequencydomain density. Second, we shall omit the frequency dependence of Pi,k . Hence, Pi,k (u˜ i,k ) = P(u˜ i,k , ξ i ),
(2.9)
where ξ i is a vector of parameters for source i. A convenient form for P is a gaussian mixture with the means, variances, and weights of the gaussians contained in ξ i (see appendix C). Note that the form in equation 2.9 implies white signals, since their power spectra h|u˜ i,k |2 i are frequency independent. The separation parameters G, gm , and ξ i should now be optimized to minimize the error in equation 2.8. This minimization can be done iteratively using the gradient-descent method. As discussed in appendix A.1, the learning rules obtained from the ordinary gradient of EDCA−IF with respect to G and gm are less efficient than those obtained from the relative gradient (see equation A.9), which therefore constitute the DCA-IF learning rules:
δ R G = ²G − ²
X
g
Dm (φ × xT )−m G,
m R
δ gi,m = ²gi,m − ²
X (φi × ui )n gi,n+m , n
δξi = ²
1 N
N/2 X k=0
∂ log Pi,k , ∂ ξi
(2.10)
where δ and δ R denote increments derived from the ordinary and relative gradients of the error, respectively, and ² sets the learning rate. The rule for ξ i is further specified in appendix C. Notation. The rule for G is given in a matrix form, where (φ × xT )−m is an L × L matrix whose ij element is the cross-correlation between xi,m and φj,m (see equation 1.6 and below). The signal φi,m is a nonlinear function of the whitened source ui,m , termed modified whitened source. It is defined in the
1382
H. Attias and C. E. Schreiner g
frequency domain by equation A.7. The matrix Dm (see equation 1.4) is a diagonal matrix containing the separating filters gm . The summation limits xm . In component noon m in equation 2.10 are set by the lengths of gm , φm , P tation, the increment in G is given by δ R Gij = ²Gij − ² lmn gi,m φi,n+m xl,n Glj . The rules in equation 2.10 have a form common to all DCA learning rules. They involve three kinds of signals: the (recovered) sources xm and whitened sources um , related to each other and to the sensors via equation 2.6 and the modified whitened sources φm . One can view those signals as forming three successive output layers of a simple network with inputs ym and weights G, gm . The weight increments are computed by cross-correlating the different outputs across layers and with the weights. Note that those correlations involve high-order sensor statistics since one output layer is a nonlinear modification of the other. To interpret the learning rules (2.10), we point out that the rule for G converges when the cross-correlation (φi × xj )m = 0 for i 6= j, whereas the rule for gm converges when (φi × ui )m = 0. Therefore, the first makes the recovered sources xi , xj6=i independent, and the second attempts to whiten them. We are not interested in the whitened sources um and filters gm themselves; however, introducing them into the generative model enables the algorithm to exploit high-order spatiotemporal (rather than just spatial) sensor statistics to achieve separation. In fact, in practice the filters gi,m are of lengths Mi ¿ N to minimize model complexity, resulting in um whose spectra differ from the source spectra but may not be white (see Figure 2). The rules in equation 2.10 can be used in either batch or online learning by processing successive (possibly overlapping) N-point segments of the sensor signals ym . In online mode, the increments are computed from equation 2.10 using the current segment, corresponding to a stochastic gradient descent minimization of the error function in equation 2.8. In batch mode, the increments are computed by averaging equation 2.10 over a long sequence of the sensor signals, resulting in a deterministic gradient descent minimization. In practical applications, the computation of equation 2.10 can be accelerated by using the frequency-domain version (equation A.12) where fast Fourier transform (FFT) can be exploited. Section 2.4 analyzes the symmetries of the DCA-I error function. Here we only point out that equation 2.8 depends on Gij and gi,0 only via their product gi,0 Gij and hence possesses continuous symmetry, which can be avoided by setting gi,0 = 1. However, this cannot be done explicitly by keeping δ R gi,0 = 0 in a relative-gradient rule such as equation 2.10. To impose this constraint, one must allow gi,0 to change by δ R gi,0 and normalize Gij → Gij gi,0 and gi,m → gi,m /gi,0 at each iteration (note that this leaves the error unchanged). 2.2 DCA-IT: Time-Domain Generative Model. We now derive the error function and learning rules for the separation parameters by learning the time-domain generative model (equation 2.5). As in the frequency-domain
Blind Source Separation and Deconvolution
1383
case (equation 2.7), we use a factorial form for the whitened source density, pu =
L N−1 Y Y
pi,m (ui,m ).
(2.11)
i=1 m=0
In appendix A.2 we show that equation 2.11 leads to the model sensor density py (equation A.13), which in turn generates the DCA-IT (I = instantaneous mixing, T = time domain) error function, EDCA-IT = − log |det G| −
L X
log |gi,0 | −
i=1
L N−1 X 1 X log pi,m , N i=1 m=0
(2.12)
where Pthe term log pi,m in equation 2.12 actually represents hlog pi,m (ui,m = (gi ∗ j Gij yj )m )i, the average being taken over the observed sensor signals ym . To make the general form (equation 2.11) more specific, we note that, assuming stationary sources, the marginal densities pi,m are independent of the particular time point tm . Also, we use the same functional form for all whitened sources, parameterized by the vector ξ i , hence, pi,m (ui,m ) = p(ui,m , ξ i ).
(2.13)
The learning rules for the separation parameters G, gm , and ξ i are derived in appendix A.2 for both ordinary- and relative-gradient descent. The relative-gradient rule for G is more efficient, but, unlike the DCA-IF case, the one for gm is not. Here are DCA-IT learning rules: δ R G = ²G − ²
1 X g Dm (ψ × xT )−m G, N m
δgi,m = −²(ψi × xi )−m , δξi = ²
X ∂ log pi,m 1 N−1 . N m=0 ∂ ξi
(2.14)
The signals ψi,m , termed modified whitened sources (but are different than φi,m appearing in equation 2.10), are nonlinear functions of the whitened sources g ui,m and are defined in equation A.17. Dm (see equation 1.4) is a diagonal matrix containing gm . These rules should be iterated keeping gi,0 = 1, δgi,0 = 0. Like the DCA-IF rules, they can be used for either batch or online learning, and their performance is accelerated by exploiting FFT in their frequencydomain version (see equation A.18). Note that when considering single-time-point segments (N = 1) such that gi,m = δm,0 , and fixed whitened source densities pi,m , the rules for gm and ξ m in equation 2.14 become irrelevant, and the rule for G reduces to Bell and Sejnowski’s (1995) ICA rule.
1384
H. Attias and C. E. Schreiner
2.3 DCA-IFT: Frequency/Time-Domain Generative Model. In the DCAIF error function both the sources and whitening filters appear as frequencydomain quantities u˜ k , g˜ k , whereas in the DCA-IT error, they both appear as time-domain quantities um , gm . Two hybrid error functions can also be derived. The DCA-IFT error, derived in appendix A.3, includes the sources and whitening filters as time- and frequency-domain quantities, respectively: EDCA-IFT = − log |det G| −
−
L N−1 X 1 X log | g˜ i,k | N i=1 k=0
L N−1 X 1 X log pi,m , N i=1 m=0
(2.15)
where the full form of pi,m is specified below equation 2.12. Notice that equation 2.15 is a cross between equations 2.8 and 2.12. The DCA-IFT learning rules are similarly a cross between equations 2.10 and 2.14: X g Dm (ψ × xT )−m G, δ R G = ²G − ² m
δ R gi,m = ²gi,m − ²
X (ψi × ui )n gi,n+m , n
δξi = ²
X ∂ log pi,m 1 N−1 . N m=0 ∂ ξi
(2.16)
We present this hybrid error not merely to make a technical point. A comparison between all the DCA-I learning rules, presented in section 2.5, demonstrates that the DCA-IFT rules are the most efficient. A fourth error function, DCA-ITF, which combines frequency-domain sources with timedomain filters, can also be derived, but results in slow learning. 2.4 Symmetries in Parameter Space. It is well known that the sources can be recovered only to within a scaling factor and an order permutation. Indeed, the observed signals yi,m in equation 2.1 could also have arisen 0 = λj xπ(j),m and mixing matrix from the scaled and permuted sources xj,m 0 Hij = Hiπ(j) /λj , where λj 6= 0 are arbitrary factors and π(j) is an arbitrary permutation of 1, . . . , L. This implies that there exists an infinite family of separating matrices G ∈ {G0 } that cannot be distinguished from G = H−1 on the basis of the observed signals alone, and whose application to the ym would recover the original sources while modifying their intensities and order. This indistinguishability is manifested in the corresponding error function E as a family of global minima, since all separating matrices G0 satisfy
Blind Source Separation and Deconvolution
1385
E(G0 ) = E(H−1 ) = minG E(G). Hence, the global minimum of the error is invariant under the family of transformations H−1 → G0 . In fact, not only the minimum but the error function itself is invariant under this family, as we shall see shortly. In other words, the error possesses symmetries in parameter space. The symmetries of the error function may be discrete (e.g., permutation) or continuous (e.g., scaling). The presence of symmetries does not cause a problem in principle, since the resulting family of minima includes only separating solutions. In practice, however, a continuous symmetry may lead to a slow convergence since the algorithm may spend long periods of time on equipotential surfaces. It is therefore advantageous for the error to have minimal symmetry. In the following we analyze the symmetries of the DCA-I error functions. It is easy to see that the errors DCA-IF (equation 2.8), DCA-IT (equation 2.12), and DCA-IFT (equation 2.15) all possess permutation symmetry, that is, are invariant under the transformation Gij → G0ij = Gπ(i)j , gi,m → g0i,m = gπ(i),m , ξ i → ξ 0i = ξ π(i) for an arbitrary permutation π . Regarding continuous symmetries, we observe from equation 2.6 that transforming to G0ij = αi Gij , g˜ 0i,k = z˜ i,k g˜ i,k for arbitrary αi , z˜ i,k 6= 0 results in u˜ 0k with spectra h|u˜ 0i,k |2 i = αi2 |˜zi,k |2 h|u˜ i,k |2 i. However, the form of equation 2.9 (or 2.13) restricts those spectra to be white (i.e. ωk -independent); thus, only z˜ i,k = βi eiθi,k for βi > 0 will leave the errors invariant. We are left with the continuous symmetries parameterized by αi , βi , and θi,k . To fix βi , we choose the source parameters ξ i so as to make the variance h|u˜ i,k |2 i computed from p (of equation 2.9) (or, equivalently, hu2i,m i computed from p of equation 2.13) a source-independent constant, such as 1. Then βi = 1/|αi |. Next, to fix αi , we allow only whitening filters that satisfy gi,0 = 1. Thus, the above transformation g˜ 0i,k = (eiθi,k /αi ) g˜ i,k will leave the error P invariant only if αi = k eiθi,k g˜ i,k /N. Finally, we note that for general P and p, the errors are not invariant under a change in the phases θi,k , which therefore do not form a continuous symmetry. However, certain choices of P do leave the error phase-invariant, for example, P ∝ e−|u˜ i,k | . In such cases we can restrict the phases θi,k by choosing the lengths Mi of the whitening filters gi,m to be much smaller than N, the number of time points (or frequencies), thus imposing the L(N − M) conditions gi,m = 0 for M ≤ m ≤ N − 1 on the LN/2 phases. Note that N À Mi still does not guarantee a unique solution for θi,k , although it limits the allowed phases to a small and possibly discrete set. 2.5 Results. We used the DCA-I rules to separate an instantaneous mixture of four 10-second-long signals. Two speech signals x3,m , x4,m were obtained from a commercial CD at the original sampling rate of 44.1 kHz and down-sampled to fs = 4.41 kHz. In addition, two colored gaussian signals x1,m , x2,m were created by generating temporally independent signals ui,m
1386
H. Attias and C. E. Schreiner
Figure 1: Probability densities of the colored gaussian signals x1 , x2 , and the speech signals x3 , x4 (solid line), compared with a gaussian (dotted line). All signals are scaled to have unit variance.
by a random number generator at this sampling rate and were colored by filters hi,m (see equation 2.3), making each signal temporally correlated. The densities of those signals, scaled to have unit variance, are shown in Figure 1. As is typical for natural sounds (Attias & Schreiner, 1997), the speech signals have a sharply peaked density. The signals were mixed by the arbitrary matrix
0.4380 0.9691 H= −0.3856 −0.7057
0.7884 −1.5841 −0.0050 −0.8339
−0.3542 −0.8480 0.5450 −0.1307
−0.5844 0.4336 . −0.7037 −1.2325
(2.17)
We iterated the learning rules in batch mode, using N = 512-point (116 msec) overlapping blocks to compute the required cross-correlations. We started from a random matrix G and filters gm with 0 ≤ m ≤ 4, kept a constant learning rate ² = 0.05, when the relative increP and stopped P ment of the separating matrix 1² [ ij (δGij )2 / ij G2ij ]1/2 and whitening filters P 1 P 2 2 1/2 both decreased below 10−4 (see Figure 3 for the im gi,m ] ² [ im (δgi,m ) / exponential convergence of the error functions). The resulting separating matrices, when operating on H, produced
2.5673 −0.0123 DCA-IF H= G 0.0006 −0.0158
0.0420 −2.7358 −0.0155 0.0030
−0.0261 −0.0295 7.4880 −0.0041
−0.0019 −0.0112 , −0.0172 −4.5740
(2.18)
Blind Source Separation and Deconvolution
1.9891 −0.0062 GDCA-IT H = 0.0018 −0.0059 1.9909 −0.0072 GDCA-IFT H = −0.0029 −0.0077
1387
0.0309 −2.1202 −0.0191 0.0101
−0.0207 −0.0299 4.7581 −0.0086
0.0303 −2.1326 −0.0172 0.0062
−0.0196 −0.0293 5.0227 −0.0070
0.0031 −0.0157 , −0.0074 −2.8909 0.0027 −0.0155 . −0.0069 −2.8590
(2.19)
(2.20)
Notice that each row in the three matrices above has a single element (boldfaced) that is significantly larger than zero. Thus, the separated signals Gym would each consist of one of the (scaled) sources, with the rest being attenuated by more than 30 dB. A larger attenuation can generally be obtained by increasing the sample size. No attempt was made to optimize the minimization process, for example, by reducing ² with increasing step number. The two gaussian signals were also separated by DCA-I. To emphasize this point, we applied Bell and Sejnowski’s (1995) ICA to the sensors, resulting in
0.9559 1.4662 ICA G H= 0.0078 −0.0317
1.4828 −0.9258 −0.0432 0.00260
0.0316 −0.0506 1.9740 −0.0042
0.0175 −0.0380 . −0.0056 −1.9677
(2.21)
Whereas each of the speech signals was separated by ICA (see bottomright 2 × 2 matrix), the gaussian signals remained mixed, as manifested by the nondiagonal top-left 2 × 2 matrix in equation 2.21. This illustrates the fact, pointed out by Pearlmutter and Parra (1997), that a mixture of gaussian signals can be separated only by exploiting their temporal statistics, which is used by, for example, DCA-IT (equation 2.14) through the cross-correlation (ψ × xT )m , but is ignored by ICA. In Figure 2 we show that the power spectra of the recovered sources xm after processing them by the whitening filters gm (i.e., h|u˜ k |2 i for the signals um in equation 2.5). Those spectra were computed using 64-point DFT with overlapping windows. A comparison with the original source spectra shows that gm indeed act to equalize the power in all frequencies. Since the learned filters were short (5 times points) compared to the autocorrelation time of the speech sources, the latter have not been completely whitened. To further demonstrate the effectiveness of DCA-I, we mixed two white signals x1,m , x2,m with a uniform density (p(xi ) = 1 for −.5 ≤ xi ≤ .5 and p(xi ) = 0 otherwise) by the matrix H (see the following equation) and
1388
H. Attias and C. E. Schreiner
Figure 2: Power spectra of the whitened sources obtained by DCA-IF (solid line) compared with the original sources (dotted line). Five-point whitening filters were used.
Figure 3: Convergence of the DCA-IF (solid line), DCA-IT (dashed-dotted line), and DCA-IFT (dashed line) error functions to their minimum.
applied the DCA-IFT learning rules to the mixtures, resulting in ¶ −1.0107 0.6912 , −0.9605 −0.7586 ¶ µ −1.8699 0.0052 . GDCA-IFT H = −0.0113 −1.9457 µ
H=
(2.22)
The nearly vanishing nondiagonal elements in the matrix on the right indicate almost complete separation, with more than 30 dB attenuation of the undesired sources. To illustrate the separation performed by DCA-IFT, we display in Figure 4 the joint density of the mixtures, py (y1 , y2 ), against the joint density of the separated sources, px (x1 , x2 ). Note that uniform sources have negative kurtosis and therefore cannot be separated by ICA (Bell & Sejnowski, 1995).
Blind Source Separation and Deconvolution
1389
Figure 4: Joint density py (y1 , y2 ) of sensor (left) and px (x1 , x2 ) of separated source (right) signals, for instantaneous mixing of two uniformly distributed sources.
3 Convolutive Mixing In realistic situations, the signal from a given source arrives at the different sensors at different times due to propagation delays. Denoting by dij the time required for propagation from source j to sensor i, this mixing situation can be described by yi,m =
L X
Hij xj,m−dij .
(3.1)
j=1
More generally, sensor i may receive several progressively delayed and attenuated versions of source signal j, due to the multipath propagation in a reflective environment, creating multiple echoes. Each version may also be distorted by the frequency response of the propagation medium and sensors. This general convolutive mixing situation is described by ym = (H ∗ x)m ,
(3.2)
P P which in component notation reads yi,m = j (Hij ∗ xj )m = jn Hij,n xj,m−n . Convolutive mixing thus involves mixing coupled with filtering. Technically, the simple mixing matrix H of the instantaneous case (see equation 2.1) becomes a matrix of filters Hm , where Hij,m is the impulse response of the filter operating on source signal j on its way to sensor i. Viewed another way, equation 3.2 says that the sensor signals ym at time tm are generated not only by a mixture of the source signals xm originating at the same time, but also by mixtures of the delayed signals xm−n that originated at previous time tm − tn , with a different mixing matrix Hn for each delay tn≥0 . However, the description of equation 3.2 is problematic since it does not determine the mixing filters Hm uniquely. This can be seen when expressing
1390
H. Attias and C. E. Schreiner
the sources as filtered white processes, as in equation 2.3: ym = (H ∗ x)m ,
xi,m = (hi ∗ ui )m .
(3.3)
P The sensor signals yi,m = j (Hij0 ∗uj )m are thus given by mixing and filtering 0 = (Hij ∗ hj )m . the whitened sources ui,m using the mixing filter matrix Hij,m However, there exists a whole family of filters Hij,m , hj,m that produce the 0 upon convolution, only one of which corresponds to the situaactual Hij,m tion at hand. This can be seen clearly in the frequency domain where H˜ 0 = ij,k
H˜ ij,k h˜j,k : for any set of complex z˜j,k 6= 0, one can transform H˜ ij,k → H˜ ij,k z˜j,k and h˜j,k → h˜j,k /˜zj,k while leaving the observed yi,m unchanged. In other words, given the observed signals alone, the spectra and phases of the sources are indistinguishable from those of the mixing filters. Hence, in the absence of any information about the convolutive mixing process and the sources (the blind case), only the whitened sources can be recovered. Section 4 on semiblind separation shows how such information, when available, can be incorporated into the separation method to facilitate recovering the sources unwhitened. In this section we shall be focusing on estimating the separating filter matrix Gij,m , which recovers the whitened sources ui,m through um = (G ∗ y)m ,
(3.4)
where the properties of ui,m are described in equation 2.4. As in the instantaneous case, the problem will be given an optimization formulation by learning a generative model in either the frequency, time, or frequency-time domains, resulting in different error functions and learning rules. 3.1 DCA-CF: Frequency-Domain Generative Model. In the frequency domain, equation 3.4 becomes ˜ k y˜ k . u˜ k = G
(3.5)
A comparison with equation 2.6 shows that the separating matrix and filters ˜ k for G, g˜ k of the instantaneous case are generalized to a matrix of filters G convolutive mixing. To generate a model sensor density py˜ , we start from the factorial whitened source density pu˜ (equation 2.7). In appendix B.1 we derive py˜ (equation B.1) and the resulting DCA-CF (C = convolutive mixing, F = frequency domain) error function, EDCA-CF = −
N/2 L X X X 1 N−1 ˜ k| − 1 log |det G log Pi,k , N k=0 N i=1 k=0
(3.6)
Blind Source Separation and Deconvolution
1391
P where log Pi,k stands for hlog Pi,k (u˜ i,k = j G˜ ij,k y˜j,k )i and the average is taken over the observed y˜ k . The filters Gm are M-point long (i.e., Gm may be nonzero only for m = 0, . . . , M − 1) and are usually much shorter than N. The error in equation 3.6 is now minimized with respect to Gm and ξ i using the gradient-descent method. As in the instantaneous case, we show in appendix B.1 that the resulting learning rule for Gm (see equation B.6) derived from the ordinary gradient of EDCA-CF is quite expensive, requiring ˜ k for each of the N/2 frequencies the inversion of the complex L × L matrix G ωk at each iteration. However, it is shown there that the concept of the relative gradient, introduced for the DCA-I rules, can be extended to the convolutive case and produce efficient rules that avoid matrix inversions. The resulting DCA-CF learning rules are given by δ R Gm = ²Gm − ²
X (φ × uT )n Gn+m , n
δξi = ²
N/2 1 X ∂ log Pi,k , N k=0 ∂ ξi
(3.7)
where φi,m are the modified whitened sources defined in equation A.7 and the parameterized form of equation 2.9 was used. The rule for the adaptive source density parameters ξ i is further specified in appendix P C. In component notation, the increment in Gm is δGij,m = ²Gij,m − ² lnn0 φi,n ul,n+n0 Glj,n0 +m . Note the formal similarity between this rule and the DCA-IF rule for gi,m in equation 2.10, which stems from the fact that ui,m = (gi ∗ xi )m in equation 2.5 is the one-dimensional version of equation 3.4 with ym replaced by xm . The rules in equation 3.7 have the form common to all DCA learning rules (compare with DCA-IF—equation 2.10—and the comments following it). It involves the whitened (recovered) sources um , related to the sensors via equation 3.5 and the modified whitened sources φm . Those signals can be viewed as forming successive output layers of a simple network with inputs ym and weights Gm . The weight increments are computed by crosscorrelating the different outputs across layers and with the weights. Those cross-correlations involve high-order sensor statistics since one output layer is a nonlinear function of the other. Note that the rules in equation 3.7 converge when the cross-correlation (φi × uj )m = δij δm,0 , meaning ui,m are mutually independent and white. 3.2 DCA-CT: Time-Domain Generative Model. To construct a timedomain model sensor density py corresponding to equation 3.4, we borrow the factorial whitened source density (see equation 2.11) from the instantaneous case. In section B.2 we derive py (equation B.10), from which the DCA-CT (C = convolutive mixing, T = time domain) error function is
1392
H. Attias and C. E. Schreiner
obtained: L N−1 X 1 X log pi,m , (3.8) N i=1 m=0 P where log pi,m stands for hlog pi,m (ui,m = jn Gij,n yj,m−n )i and the average is taken over the observed ym . The DCA-CT learning rules for the separation parameters, derived from equation B.10, are given by
EDCA-CT = − log |det G0 | −
δ R G0 = ²G0 − ²(ψ × yT )0 GT0 G0 , δξi = ²
δGm>0 = −²(ψ × yT )−m ,
X ∂ log pi,m 1 N−1 , N m=0 ∂ ξi
(3.9)
where ψi,m are the modified whitened sources defined by equation A.17, and the parameterized P form in equation 2.13 was used. In component notation (ψi × yi )−m = n ψi,n yj,n−m . The rules in equation 3.9 are ordinary-gradient learning rules except for m = 0. As in the case of gm in equation 2.14, the optimization of the timedomain error does not benefit from the relative gradient approach; in fact, as shown in appendix B.2, the relative-gradient rule for Gm (equation B.16) is more complicated and less efficient than equation 3.9. 3.3 DCA-CFT: Frequency/Time-Domain Generative Model. As in the instantaneous case, a hybrid frequency/time error function can also be derived, which includes the separating filters in the frequency and the whitened sources in the time domain. This error, which is a cross between equations 3.6 and 3.8, is given by (see appendix B.3) EDCA-CFT = −
L N−1 X X X 1 N−1 ˜ k| − 1 log |det G log pi,m , N k=0 N i=1 m=0
(3.10)
where the full expression for pi,m0 is given below equation 3.8. The corresponding DCA-CFT learning rule for Gm , obtained from the relative gradient of equation 3.10 is X (ψ × uT )n Gn+m , δ R Gm = ²Gm − ² n
δξi = ²
X ∂ log pi,m 1 N−1 , N m=0 ∂ ξi
(3.11)
and has the same form as the DCA-CF rule (see equation 3.7) but with the modified whitened sources φm (see equation A.7), defined in terms of their frequency-domain density, replaced by ψ m (see equation A.17), which is derived from their time-domain density.
Blind Source Separation and Deconvolution
1393
3.4 Symmetries in Parameter Space. Whereas for instantaneous mixing the sources can be recovered to within an order permutation and scaling, in the convolutive case (as mentioned below equation 3.3) the scaling becomes arbitrary filtering. This is manifested in the error corresponding to P P P ˜ k |/N − ˜ i,k |/N − im log pi,m /N (equation 3.3), E = − k log |det G ik log | g (in the hybrid FT approach; compare to the semiblind error in equation 4.4 and the surrounding discussion), which is invariant under the transformation g˜ i,k → z˜ i,k g˜ i,k , G˜ ij,k → G˜ ij,k /˜zi,k , with the frequency-dependent scaling factors z˜ i,k being the arbitrary filters zi,m , as long as the separating filters are not constrained. Hence, the arguments of the DCA-C error functions do not include the whitening filters gi,m . In section 4 we reintroduce the latter and show how to make inaccessible as many zi,m as possible by constraining the separating filters according to available information. The discussion of the instantaneous case in section 2.4 can be repeated to show that, after fixing the source parameters ξ i (in equations 2.9 and 2.13) to make the variances h|u˜ i,k |2 i and hu2i,m i source independent, we are left with the order permutation G˜ ij,k → G˜ π(i)j,k and the continuous transformation G˜ ij,k → eiθi,k G˜ ij,k . However, the latter does not leave the errors invariant, except for special choices of the whitened source densities. The DCA-CF error (equation 3.6) forms an exception, however, since it appears to be invariant not only under the source ordering π but under an arbitrary permutation πk at each frequency, which may lead to low separation quality. Note that the errors DCA-CT,CFT do not possess this permutation invariance. Nevertheless, this invariance can be restricted by choosing the length M of the separating filters Gij,m to be much smaller than N, thus imposing the L2 (N − M) conditions Gij,m = 0 for M ≤ m ≤ N − 1. If P Gij,m are the desired separating filters, then G0ij,m = k eiωk m G˜ πk (i)j,k /N, albeit formally leaving the error invariant, are generally longer than M for M/N sufficiently small, and thus correspond to inaccessible minima. In practical experiments, we found DCA-CF to be no less effective than CT and CFT. 3.5 Results. To demonstrate the performance of the DCA-C algorithms, we present an experiment in which we applied them to a convolutive mixture of speech signals. We mixed two 10-sec-long signals, obtained from a commercial CD at the original sampling rate of 44.1 kHz and downsampled to fs = 4.41 kHz, by filters Hm , whose impulse response is displayed in Figure 5. We then used the learning rules in equations 3.7, 3.9, and 3.11 to find the separating filters Gm . The signals were processed in N = 512-point overlapping blocks with a constant learning rate of ² = 0.05. The iteration stopped when the relative increment of separating filters, P 1 P 2 ]1/2 , decreased below 10−4 . 2/ [ (δG ) G ij,m ijm ijm ij,m ² It is important to allow the learned separating filters to be sufficiently
1394
H. Attias and C. E. Schreiner
Figure 5: Impulse response of the mixing filters Hij,m .
long to be able to invert the mixing filters. In this example, we chose 60point separating filters. For the frequency-domain whitened source density √employed in DCA-CF, we used the exponential form P(u˜ i,k ) ∝ e−|u˜ i,k |/ N , where the N-scaling arises from the fact that h|u˜ i,k |2 i = Nhu2i,m i. This form approximates well the density of a large class of natural sounds, as we reported in Attias and We also experimented with √ Schreiner (1997). √ P(u˜ i,k ) = P1 (Re(u˜ i,k )/ N)P1 (Im(u˜ i,k )/ N), where P was either the sigmoidderivative form P(v) ∝ e−v /(1 + e−v )2 used by Bell and Sejnowski (1995), or a mixture of gaussians parameterized by ξ i (see Appendix C). However, the simple exponential form was sufficient to achieve separation in all our experiments with DCA-CF. Similarly, for DCA-CT,CFT, we found the simple exponential form p(ui,m ) = e−|ui,m | to be appropriate. To demonstrate that separation has actually been accomplished, we present in Figure 6 the convolution with the mixing filters (G ? H)ij,m = P DCA-CF , GDCA-CT , ln Gil,n Hlj,m−n of all three separating filters G = G
Blind Source Separation and Deconvolution
1395
Figure 6: Convolution of the separating filters Gij,m learned by DCA-CF (solid line), DCA-CT (dashed-dotted line), and DCA-CFT (dashed line) with the mixing filters Hij,m .
GDCA-CFT . The nondiagonal filters (G ? H)i6= j,m are strongly attenuated (≥ 30 dB) compared to the diagonal ones, indicating high separation quality with low cross-talk. Note that the recovered sources have modified power spectra, as is evident from the fact that (G ? H)ii,m 6= 0 for m > 0. The separating filters learned by DCA-CF are shown for illustration in Figure 7 and are similar to those learned by the other two algorithms. An interesting comparison between the different algorithms is presented in Figure 8, which shows that the frequency-domain algorithms achieve separation significantly faster than the time-domain one. In particular, the hybrid approach DCA-CFT was the fastest to converge. Note that this figure shows batch learning; convergence in online mode for DCA-CF,CFT was achieved in 15 to 20 passes through the data. To emphasize the efficiency of the relative-gradient rules, we also plot the minimization of the DCA-CFT error using the ordinary-gradient version of the DCA-CF rule (see equation B.6), which is as slow as the time-domain rule.
1396
H. Attias and C. E. Schreiner
Figure 7: Impulse response of the separating filters learned by DCA-CF.
Figure 8: Convergence of the DCA-CF (solid line), DCA-CT (short-dashed line), and DCA-CFT (dashed line) error functions to their minimum. Compare with convergence using the ordinary-gradient version of the DCA-CF learning rule (dotted line).
Blind Source Separation and Deconvolution
1397
4 DCA-CS: Semiblind separation In the previous section we derived learning rules for the separating filters Gm of convolutive mixtures. Those rules were made efficient by exploiting the relative-gradient concept. However, it is often advantageous to learn the mixing filters Hm rather than Gm , for example, when the latter are much longer. As an example, consider a situation where the mixing includes a single time-point delay, ¶ ¶ µ µ 1 0 a b = I, H1 = , (4.1) H0 = 0 1 b a ˜ k = I + e−iωk H1 . The and Hm = 0 for m ≥ 2. In the frequency domain H −1 ˜ and expanding in powers of ˜k =H separating filters are obtained using G k H: Gm =
∞ ∞ X X X 1 N−1 eiωk m e−iωk n (−H1 )n = (−H1 )m+Nl N k=0 n=0 l=0
= [I − (−H1 )N ]−1 (−H1 )m ,
(4.2) P
P
where the second equality was obtained using k eiωk (m−n) /N = l δn,m+Nl P and the last one using the identity l≥0 Al = (I − A)−1 , valid for any matrix A as long as the sum converges. The latter condition depends on the eigenvalues of H1 , which are λ± = a±|b|, and is satisfied when |λ± | < 1. However, convergence guarantees the existence of causal and finite separating filters, but implies nothing about their length, which is determined by the eigenvalues of H1 . In particular, we notice from equation 4.2 that Gm goes to zero at a rate that depends on the largest eigenvalue: Gm ∼ (max λ± )m , which, if |max λ± | is close to 1, would result in very long separating filters, even though the mixing filters are very short. In this case, learning Hm requires a significantly smaller number of adaptive parameters. Another advantage of learning Hm is that, in some cases, it may be combined with learning the source filters hm in equation 3.3 and facilitate recovering the sources without whitening. In general, the formulation of the blind separation problem assumes that nothing is known about the mixing situation, except for the statistical independence of the sources. Hence we must allow an arbitrary mixing filter matrix Hm in equation 3.2 and, consequently, the distinction between the source filters hm and the mixing filters is lost. However, when some information is available about the mixing process, it can be incorporated into the learned mixing filters Hm , enabling the learning of hm as well. We now derive the learning rules for such semiblind cases. In the frequency domain, the description of equation 3.3 gives ˜ k x˜ k , y˜ k = H
x˜ i,k = h˜ i,k u˜ i,k ,
(4.3)
1398
H. Attias and C. E. Schreiner
producing a relation between the time-domain densities py (y) and pu (u): Q ˜ −1 Q Q ˜ k |−1 py (y) = k |det H ik |hi,k | im pi,m (ui,m ). The resulting DCA-CS (C = convolutive mixing, S = semiblind separation) error function is therefore analogous to the hybrid DCA-CFT error (see equation 3.10) and is given by EDCA−CS =
N−1 X X 1 N−1 ˜ k| + 1 log |det H log |h˜ i,k | N k=0 N k=0
−
L XX 1 N−1 log pi,m , N m=0 i=1
(4.4)
P P where the term pi,m represents pi,m = hp(ui,m = n gi,n jl Gij,l yj,m−n−l )i and the average is taken over the observed ym . We define Gm and gm as the inverses of Hm and hm by ˜ −1 , ˜k =H G k
g˜ i,k =
1 . ˜hi,k
(4.5)
The DCA-CS learning rules are obtained from the ordinary gradient of equation 4.4 and are given here in terms of frequency-domain signals: ? ˜ x˜ † ), ˜ † (I − Dg˜ ψ ˜ k = −² G δH k k k k
δ h˜ i,k = −² g˜ ?i,k (1 − ψ˜ i,k u˜ ?i,k ),
(4.6)
g˜
where Dk is the diagonal matrix containing the whitening filters (see equation 1.4) and the modified whitened sources ψ m are defined by equation A.17. The rule for the source parameters ξ i is the same as the one in equation 3.11. The time-domain increments are obtained by inverse DFT, for example, P ˜ k /N. δHm = k eiωk m δ H To derive the rules in equation 4.4, we used the relations ∂ G˜ ij,k = −G˜ il,k G˜ pj,k , ∂Hlp,k X 1 N−1 ∂ui,m =− eiωk (m−n) g˜ i,k G˜ ij,k x˜ l,k , ∂Hjl,n N k=0 X 1 N−1 ∂ui,m =− eiωk (m−n) g˜ 2i,k x˜ i,k δij , ∂hj,n N k=0 derived from equations 4.3 and 4.5.
(4.7)
Blind Source Separation and Deconvolution
1399
Information available about the mixing process can now be incorporated into equation 4.6. For instance, if the mixing consists of a small number of echoes separated by known intervals, the appropriate elements of Hij,m are set to zero as the learning rules in equation 4.6 are being iterated. As another example, a useful application emerges in situations where a few-parameter description of the mixing process is available, based, for example, on the physical properties of the propagation medium. In this case, the dependence of the mixing filters on the physical parameters, denoted α, is given in a functional form Hm = Hm (α). ThePlearning rules for those parameters are deduced from equation 4.6: δ α = ijm δHij,m ∂Hij,m /∂ α. Of course, the rules can also be used to recover whitened sources, by learning only Hm , setting hi,m = δm,0 and fixing δhi,m = 0. 4.1 Results. Here we present a test of DCA-CS on a convolutive mixture of two speech signals and one colored gaussian signal. The mixing filters had equally spaced taps, with the nonzero elements of Hm being
H0 = I,
0.3 0.3 0.6 0.3 , −0.4 −0.5 −0.4 −0.2 . 0.4
−0.4 H3 = −0.3 0.4
0.2 H6 = −0.3 0.2
−0.3 0.5 −0.2
(4.8)
Thus, each mixture contained two or three delayed versions of each source, creating multiple echoes. We applied the rules in equation 4.6 to the resulting mixtures, simulating the situation in which information about the structure of the mixing filters (i.e., the tap spacing) is available by fixing the appropriate elements of the learned Hm at zero. Source filters hm of length 5 were allowed. We worked in batch mode using N = 512-point time blocks. Figure 9 shows that the algorithm learned the mixing filters successfully. To illustrate the role of the source filters hm , we display their 64-point DFT spectra |h˜ k |2 in Figure 10, together with the source spectra h|˜xk |2 i, computed using 64-point DFT with overlapping windows. Those filters model the source autocorrelations. More precise correspondence than shown in the figure requires longer hm , but the 5-point filters used here were sufficient to achieve separation in this situation. It is interesting to compare DCA-CS with an algorithm that learns the separating filters, such as DCA-CFT, in a situation where the latter are much longer than the mixing filters, as discussed above. We used two-point mixing Hm of the form of equation 4.1 with a = .47, b = .50, to mix two white signals x1,m , x2,m with an exponential density p(xi ) ∝ e−|xi | , sampled at each time point independently. The required length of Gm is determined by the
1400
H. Attias and C. E. Schreiner
Figure 9: Impulse response of the filters learned by DCA-CS (solid line) compared with the original mixing filters (dotted line).
Figure 10: Spectra of the DCA-CS source filters (solid line) compared with the original source spectra (dotted line).
Blind Source Separation and Deconvolution
1401
Figure 11: Left: Separating filter G21,m learned by DCA-CFT for two-point mixing. Right: Convergence of the DCA-CS (solid line) and DCA-CFT (dashed line) error functions to their minimum.
largest eigenvalue of H1 , λ+ = .97, to be ∼ 100; for illustration, on the left of Figure 11 we present G21,m learned by DCA-CFT (equation 3.11). As shown on the right, DCA-CFT required more than three times as many iterations to learn Gm as DCA-CS (used with hi,m = δm,0 ) required to learn Hm . However, the superiority of DCA-CS in this situation notwithstanding, this result also underscores the efficiency of relative-gradient algorithms. 5 DCA-CR: Separation of Rational Mixtures The algorithms presented in section 3 for blind separation of convolutive separating filters, mixtures are designed to learn the coefficients Gij,m of theP which then produce the (whitened) sources via ui,m = j (Gij ? yj )m . This approach becomes problematic in situations where the required separating filters are very long, increasing the number of adaptive parameters and possibly of undesired local minima of the error. One approach to overcome this problem, described in section 4, is to learn the coefficients of the mixing filters Hij,m rather than Gij,m . This approach is advantageous when the mixing filters are significantly shorter and has the additional benefit that information on the mixing process can be taken into account, facilitating the recovery of the sources unwhitened. Here we present a different approach to learning long separating filters, which is not restricted to short mixing situations. The key idea is to use a more general model for the separating filters Gij,m . Its form is given in the frequency domain in terms of two filter matrices Am , Bm by G˜ ij,k =
P −iω m B˜ ij,k e k Bij,m = P m −iω m , k A A˜ ij,k ij,m me
(5.1)
1402
H. Attias and C. E. Schreiner
and is termed rational system function (Oppenheim & Schafer, 1989). The task is to learn Am and Bm from the observed sensor signals ym . Notice that when Am = Iδm,0 , learning Bm reduces to the learning of Gm in the models of section 3. However, Am 6= 0 can generate arbitrarily long filters for even a small number of m ≥ 1. Q ˜ k |pu (u), the DCA-CR (C = convolutive Since we have py (y) = k |det G mixing, R = rational system function) error has the same form of the DCACFT error (equation 3.10), is now a function of Am and Bm through equation 5.1, EDCA−CR = −
L N−1 X XX 1 N−1 ˜ k| − 1 log |det G log pi,m , N k=0 N m=0 i=1
(5.2)
where the full expression for pi,m is given below equation 3.8. The DCA-CR learning rules are obtained from the ordinary gradient of equation 5.2. Defining the frequency-domain L × L matrices, ˜ u˜ † )H ˜ †, ˜ k = (I − ψ C k k k
(5.3)
˜ −1 , we have ˜k =G with H k δ A˜ ij,k = −² G˜ ?ij,k
C˜ ij,k , A˜ ?
δ B˜ ij,k = ²
ij,k
C˜ ij,k , A˜ ?
(5.4)
ij,k
where δAm and δBm follow by inverse DFT. The rule for the source parameters ξ i is the same as the one in equation 3.11. The rules in equation 5.4 learn the separating filters Gm (see equation 5.1). It is straightforward to derive analogous rules for the mixing Hm , which may be convenient to use in some situations. With a rational system function parameterization, H˜ ij,k =
B˜ 0ij,k A˜ 0ij,k
P = P
−iωk m B0 me ij,m , −iωk m A0 me ij,m
(5.5)
we have the learning rules δ A˜ 0ij,k = −² G˜ ?ij,k
C˜ 0ij,k A˜ 0? ij,k
,
δ B˜ 0ij,k = ²
C˜ 0ij,k A˜ 0? ij,k
,
(5.6)
˜ 0 in equation 5.6 where δA0m and δB0m follow by inverse DFT. The matrices C k are defined by ˜ u˜ † ), ˜ † (I − ψ ˜0 = G C k k k k
(5.7)
Blind Source Separation and Deconvolution
1403
˜k =H ˜ k in equation 5.5 and should not be confused with equa˜ −1 for H with G k tion 5.1. Note the symmetry between equations 5.3, 5.4 and equations 5.6, 5.7. Whereas we saw before (section 3) that the relative-gradient idea produces more efficient learning rules for the separating than the mixing filters, this is not the case when using the rational system function parameterization, hence the rules in equation 5.6 are comparable to equation 5.4 in efficiency.
6 The Information Connection Bell and Sejnowski’s (1995) ICA algorithm was derived by maximizing the mutual information between the inputs and outputs of a linear network with sigmoidal output units. Specifically, the sensor signals ym are fed as inputs to a single-layer network with an L × L weight matrix G that produces outputs xm = Gym . Those are passed through a sigmoidal function, usually chosen to be f (x) = 1/(1+e−x ), giving zi,m = f (xi,m )+ηi,m , where η m are independent noises. The network weights are then optimized to maximize the mutual information between zm and xm in the zero-noise limit hη 2m i → 0 (assuming the noise has zero mean), resulting in independent signals xm , which are the (scaled and permuted) original sources. In this limit, the mutual information becomes the output entropy. Pearlmutter and Parra (1997) pointed out that maximizing the output entropy is equivalent to minimizing the Kullback-Leibler distance between the observed sensor density and a model sensor density, with the model parameter being the Bell and Sejnowski (1995) weight matrix G and the corresponding source density being the derivative of their sigmoid f (x). In this section we show that the DCA algorithm is also equivalent to an information-maximization formulation involving a linear network with sigmoidal output units. However, here the network weights are dynamic, in the sense that they connect outputs at time tm to inputs as the same but also previous times tn ≤ tm , and the relevant quantity to maximize is the output entropy rate Hpz , which is a spatiotemporal (rather than spatial) quantity. Assume we have a linear spatiotemporal network with weights Gm , which receives the sensor signals ym as inputs and produces outputs ui,m = P jn Gij,n yi,m−n (see equation 3.4). Let um be fed into a sigmoidal function f , producing zi,m = fi,m (ui,m ). We shall now consider the entropy of the outputs zm . However, unlike ICA, which maximizes the joint entropy of all outputs at equal times, here we consider the joint entropy Hpz of all outputs at all time points z0 , . . . , zN−1 . As N → ∞, this quantity (divided by N) approaches the output entropy rate Hpz = limN→∞ Hpz /N (Cover & Thomas 1991).
1404
H. Attias and C. E. Schreiner
Using the relation in equation B.17, we have poy (y) =
N−1 Y
|det Gk |
k=0
L N−1 Y Y
0 | fi,m (ui,m )|pz (z),
(6.1)
i=1 m=0
where y stands for {y0 , . . . , yN−1 } and poy (y) is the joint density of these N × L observed variables; the same holds for z. The entropy rate is then given by Z 1 dz pz (z) log pz (z) Hpz = − N " Z N−1 X 1 o dy py (y) log poy (y) − log |det Gk | =− N k=0 −
L N−1 X X
#
0 log | fi,m (ui,m )|
i=1 m=0
= Hpoy + +
X 1 N−1 log |det Gk | N k=0
L N−1 X 1 X 0 hlog | fi,m (ui,m )i, N i=1 m=0
(6.2)
where um are related by Gm to the observed ym and the average is taken over the latter. We now point out that Hpz = Hpoy − EDCA−CFT (see equation 3.10) if we 0 | with the whitened source density p . Since H o is parameidentify | fi,m i,m py ter independent, minimizing the DCA-CFT error function is equivalent to maximizing the output entropy rate of our linear-sigmoidal network. Similar proofs can be provided for all the other DCA algorithms, for both the instantaneous and convolutive cases, presented in this article. 7 More Sensors Than Sources Blind separation algorithms, including those presented here so far, address the square problem where the number of sensor signals L0 equals the number of sources L. Unlike the nonsquare L0 < L problem, which requires a conceptually different approach, the nonsquare L0 > L problem ought to be solvable using the optimization method underlying DCA. However, our derivation of the DCA learning rules was given for square L × L matrices and cannot formally be extended to the nonsquare case. In principle, one can apply them to the observed L0 sensors and seek an L0 × L0 separating matrix. However, the algorithms may attempt to produce L0 recovered sources by splitting the L original sources apart.
Blind Source Separation and Deconvolution
1405
There is a simple resolution to this problem. Focusing first on the instantaneous case, it relies of the observation that L0 linear mixtures of L < L0 signals cannot be linearly independent and that it is possible to extract from them L mixtures that are linearly independent. Of course, this assumes that the mixing matrix is of rank r = L; in general r ≤ L, but the case r < L corresponds to a situation with effectively fewer mixtures than sources and lies outside the scope of this article. To see how this can be done, assume ym = Hxm for an L0 × L matrix H, and consider the sensor correlation matrix C = hym yTm i. This L0 × L0 matrix has L positive eigenvalues, and the rest vanish. Specifically, we can write PT CP = Λ,
(7.1)
where P is a real orthogonal matrix (P−1 = PT ) containing the eigenvectors of C, and Λ is a diagonal matrix containing its eigenvalues. We now order the columns of P such that the L positive eigenvalues are 3ii for i = 1, . . . , L. Next, we consider the principal components (PCs) of the sensor signals, y0m = Λ−1/2 PT ym ,
(7.2)
where we define (Λ−1/2 )ii = 0 if 3ii = 0, and observe that only the first L of them are nonzero, since hy2i,m i = 0 for i = L + 1, . . . , L0 . Those first L PCs satisfy hyi,m yj,m i = δij , hence are linearly independent and constitute appropriate inputs for instantaneous DCA algorithms. This method is advantageous to simply picking L sensor signals as inputs, since those may not be linearly independent, resulting in a situation with effectively fewer mixtures than sources. The convolutive mixing case can be treated similarly. However, here the use of PCA is not sufficient, since the sensor equal-time correlation matrix C may be of rank r > L even if L0 > L, due to the contribution of delayed versions of the source signals to theP mixtures. Hence, one should consider the cross-correlation matrix Cm = h n yn yTn−m i. Equivalently, we consider ˜ k = hy˜ k y˜ † i, which measures the corits DFT, the cross-spectrum matrix C k relation between the sensors in the frequency domain. For each frequency ωk , this matrix has at most L positive eigenvalues and the rest vanish. In analogy with the instantaneous case above, we can write ˜ k, ˜ k P˜ k = Λ P˜ k† C
(7.3)
˜† where P˜ k is a complex unitary matrix (P˜ −1 k = Pk ) containing the eigenvectors ˜ k is a diagonal matrix containing its eigenvalues. Next, we ˜ k , and Λ of C consider the PCs of the frequency-domain sensor signals at ωk , −1/2
˜ y˜ 0k = Λ k
P˜ k† y˜ k ,
(7.4)
˜ −1/2 )ii = 0 if 3 ˜ ii,k = 0, and observe, as above, that where again we define (Λ k
1406
H. Attias and C. E. Schreiner
only the first L of them P are nonzero. Going back to the time domain, those L new signals y0i,m = j (Pji0 × yj )m , i = 1, . . . , L (where P0m is the inverse DFT of
˜ −1/2 P˜ k ) are convolutive mixtures of the sensor signals and form suitable Λ k inputs to DCA algorithms. Note that not only the equal-time correlations of y0m vanish, but also their 0 i = δ δ . Hence, the filter matrix P0 performs cross-correlations: hy0i,m yj,n ij mn m decorrelation in space and time simultaneously, generalizing the ordinary PCA transformation (see equation 7.1), which performs only spatial decorrelation. In practice, the eigenvalues of the correlation and cross-spectrum matrices seldom actually vanish, due to the presence of noise and finite machine precision. Therefore, as is the case when using SVD, a cut-off has to be determined, based, for example, on the known noise level, and eigenvalues below it are taken to be zero. Finally, we point out that the spatiotemporal dimensional reduction procedure, described above in the frequency domain, can also be performed in the time domain. To do this we return to the cross-correlations and define the L0 N × L0 N matrix C¯ (im)(jn) = hyi,m yj,n i. This matrix has a Toeplitz structure since C¯ (im)(jn) = Cij,m−n /N, with Cm being related to the above crossspectrum matrix via DFT. It can be diagonalized by an orthogonal L0 N ×L0 N ¯ , ordering the columns of P¯ such that all ¯ P¯ = Λ matrix P¯ (im)(jn) to give P¯ T C ¯. ¯ the positive eigenvalues 3(im)(im) are contained in the first LN elements of Λ 0 The spatiotemporally decorrelating filter Pm above is then obtained by considering only the first LN columns of P¯ and exploiting its Toeplitz structure ¯ −1/2 )(im)(im) . P¯(im)(jn) = Pij,m−n /N, as well as normalizing by (Λ 8 Discussion and Conclusion We have presented the DCA family of algorithms for separating instantaneous and convolutive mixtures of independent sources by learning a separating transformation in an unsupervised manner from the sensor highorder spatiotemporal statistics. The DCA approach is based on a generative model of the sensor density in either the time or frequency domains, with the model parameters describing the separating filter matrix, as well as the source densities and autocorrelations. The DCA-I algorithms for instantaneous mixing (section 2) are more powerful than ICA methods, as has been demonstrated by two examples. In both cases, the use of spatiotemporal statistics facilitated by introducing the whitening filters gm gave DCA-I an advantage. In the case of uniformly distributed sources, ICA could have achieved separation by carefully taking into account the source distribution, whereas for colored gaussian sources, information on the source autocorrelations obtained by learning gm is crucial.
Blind Source Separation and Deconvolution
1407
The DCA-C algorithms for convolutive mixing (section 3) are a direct generalization of DCA-I, obtained by replacing the spatiotemporal generative model parameterized by filters gm and matrix G by a matrix of filters Gm . Time-, frequency-, and hybrid frequency/time-domain error functions were derived for both DCA-I and DCA-C, resulting in different learning rules. The time-domain rules were the slowest learners, and the other two were almost comparable, with the hybrid rules being the fastest in both the instantaneous and convolutive cases. The relative-gradient concept, extended from the spatial to the spatiotemporal case and incorporated into DCA in this article (see appendixes A.1 and B.1), is credited with this performance. Note that the frequency-domain DCA-IF,CF and hybrid DCA-IFT,CFT errors include the whitening-separating filters in the frequency domain; it is due to this feature that they benefit from the convolutive relative gradient approach. This improvement is seen particularly clearly by comparing the performance of the relative-gradient DCA-CF rule in equation 3.7 to its ordinary-gradient version (equation B.6) in Figure 8. No optimization of the error minimization process was attempted; the utility of standard methods like conjugate gradients and Newton’s method will be assessed in a subsequent study. Whereas exploiting the relative gradient accelerates convergence of some DCA rules by minimizing both the number of iterations and of floatingpoint operations, the use of FFT accelerates all rules equally by minimizing the number of floating-point operations required to compute the crosscorrelations, which form a necessary ingredient of all DCA algorithms. The rules in the body of the article are usually given in terms of time-domain quantities; their FFT-computable versions appear in the appendixes. Of course, any learning rule derived from either F-, T-, or FT-type error function can be written using either time- or frequency-domain quantities. In the frequency domain the convolutive mixing problem factorizes equation 3.5, and it appears that we are faced with a separate instantaneousmixing problem at each frequency with a complex mixing matrix. This may lead one to expect two difficulties. First, frequency-domain signals are obtained by filtering time-domain ones; based on the central limit theorem, this summation over M time points could produce gaussian frequency-domain densities. However, most naturally occurring signals (e.g., speech, music) are nongaussian and have long-range temporal correlations (see Attias & Schreiner, 1997), resulting in their frequency components also being nongaussian; in fact, their density is well approximated by the exponential family. This does not contradict the central limit theorem since the latter assumes that we sum over temporally independent variables. Second, since the mixing problem is defined only to within a source permutation, we could be facing an arbitrary permutation at each frequency, necessitating the design of clever schemes for putting the separated signals together by, for example, exploiting the correlation between frequency components of a given source. However, DCA avoids this difficulty since the mixing problems at different
1408
H. Attias and C. E. Schreiner
frequencies are not mutually independent, but are solved simultaneously by minimizing an error that couples all of them. This is manifested by the noninvariance of the DCA-C errors under such permutations, as discussed in section 3.4. Since working in the frequency domain is a crucial feature of the DCA approach, it is important to make the following comment. A cross-correlation, P ? u ˜ j,k /N such as (φi × uj )m in equation 3.7, equals its DFT version k eiωk m φ˜ i,k only for periodic signals. Thus, computing the time-domain increments δGm using the DFT version in equation B.9 is an approximation, as can be seen from (φi × uj )m =
N−1−m X n=0
=
φi,n uj,n+m =
X X 1 N−1−m 1 N−1 ? u˜ j,l eiωl m φ˜ i,k ei(ωk +ωl )n N kl=0 N n=0
· ¸ N−1 X X 1 1 − ei(ωk +ωl )m 1 N−1 ? ? u˜ j,k − u˜ j,l , eiωl m φ˜ i,k φ˜ i,k N k=0 N 1 − ei(ωk +ωl ) kl=0
(8.1)
P obtained using equation 1.3 and n ei(ωk +ωl )n /N = δk,−l . The second term on the last line of equation 8.1 is a correction to the DFT version (first term there). However, since the term in the square brackets is of the order of m/N, this correction is negligible for m ¿ N. Next, note from equation B.9 that m is limited by the length M of Gm ; thus, our approximation is valid if we keep M ¿ N so the separating filters are much shorter than the time blocks. In fact, the root of this approximation is in the transition from the time(see equations 2.5 and 3.4) to frequency-domain (see equations 2.6 and 3.5) formulations. The other reason to keep M ¿ N is our very use of N-block processing, rather than processing the full signals, which can be arbitrarily long. The P G y filtering xm = n m−n (see equation 3.4) for 0 ≤ m < M − 1 thus n approximates the ym preceding the current block as zero. Again, for crosscorrelations computed using the resulting xm , this approximation becomes valid when M ¿ N. In practice we found that M/N ∼ 1/6 was usually sufficiently small. Note that M ¿ N is actually a condition on the time-block length, since we must choose the separating filters to be sufficiently long to invert the mixing; hence, M is determined by the problem at hand. This condition can be justified intuitively by observing that in order to recognize ym as a delayed (shifted) version of xm , the shifting involved must be short compared to the signal lengths. We point out that the stability of DCA is ensured by its formulation as an optimization problem using the KL distance, which is bounded from below (Cover & Thomas, 1991), as an error function. Previously proposed separation methods, such as the original H-J network (Jutten & H´erault, 1991) and its extension to convolutive mixing by Platt and Faggin (1992),
Blind Source Separation and Deconvolution
1409
are not derived from an error function, except for special choices of the nonlinearities (Comon, Jutten, & H´erault, 1991; Sorouchyari, 1991; Comon, 1994), and indeed exhibit occasional unstable behavior. Although the formulation of DCA in terms of learning a generative model assumes that the model approximates well the situation at hand, we found that in practice, separation can be achieved even when this condition is relaxed. In particular, the use of whitening filters too short to produce perfect whitening (see, e.g., Figure 2) and nonadaptive model source densities that differ from the actual ones (e.g., using the sigmoid-derivative form for separating speech signals whose density is nearly exponential) did not lead to noticeable degradation in separation quality. Nevertheless, we do expect performance to degrade when the approximation provided by the model is sufficiently far from the actual situation; it is well known, for example, that the use of sigmoid-derivative form for the model source density fails when the sources have negative kurtosis (Bell & Sejnowski, 1995). The analysis needed to determine the necessary and sufficient conditions for the model to achieve separation is quite difficult and lies beyond our scope here. A brief sketch of such an analysis for ICA was outlined by Cardoso (1997). Separation of instantaneous mixtures using spatiotemporal statistics of the second order was suggested by Molgedey and Schuster (1994) and Belouchrani, Abed-Meraim, Cardoso, and Moulines (1997), and of higher orders by Pearlmutter and Parra (1997). None of these methods has a natural generalization to the convolutive case. Note that DCA-I can be made to exploit only second-order statistics by using gaussian model sources: 2 2 P ∝ e−|u˜ i,k | in equation 2.9 and p ∝ e−ui,m in equation 2.13. In this way it essentially uses the full autocorrelations of the sources rather than their values only at a given time lag as do Molgedey and Schuster (1994). However, any algorithm that uses only second-order statistics will be completely unable to separate convolutive mixtures, unless strong constraints are imposed on the mixing filters. For instantaneous mixtures, it may generally be effective, but will fail to separate sources whose autocorrelations are identical; thus, for example, mixtures of white sources can be separated only by exploiting higher orders as in DCA-I. Torkkola (1996) proposed the ordinary-gradient rule corresponding to the frequency-domain version of equation 3.11, and its relative-gradient form was described by Lee, Bell, and Lambert (1997) (see also Lambert, 1996); in the absence of a spatiotemporal/spectral error function, both relied on information-maximization considerations in the frequency domain. A rule similar to equation 3.11 appeared in Cichocki, Amari, and Cao (1996). Methods that use cumulant information in the frequency domain (i.e., polyspectra) were also suggested (Thi & Jutten, 1995; Yellin & Weinstein, 1995) but were restricted to L = 2 sources. Comon (1996) suggested a polyspectrabased optimization formulation but did not derive a separation algorithm. DCA is sufficiently flexible to allow a formulation in which information
1410
H. Attias and C. E. Schreiner
on the mixing process can be exploited when available. As discussed in sections 3.4 and 4, in the absence of such information, the sources can be recovered only with their spectra modified, whereas incorporating such information when available (the semiblind case) facilitates recovering the sources unwhitened. As discussed in section 4, this requires learning the mixing, rather than the separating, filters, using ordinary-gradient rules. Quite general forms of filters, useful since they can model very long filters using relatively few parameters, can also be learned by DCA (section 5). All the DCA algorithms can be viewed as simple networks with two output layers (the whitened sources and modified whitened sources in DCAC; three layers in other versions), sensor signals as inputs, and separating (or mixing) filters as weights. The weight increments are determined by cross-correlating the different outputs across layers and with the weights. Convergence is achieved when the cross-correlations between different outputs vanish, producing separation. Note that the cross-correlated signals are whitened sources and a nonlinear modification thereof; hence, high-order sensor statistics are used to achieve separation. This network was shown in section 6 to maximize the information rate between its inputs and outputs. Algorithms that solve the problem of blind source separation address, in fact, the more general need for an efficient tool for statistical analysis of spatiotemporal data sets, e.g., biomedical multisensor recordings such as EEG (Makeig, Bell, Jung, & Sejnowski, 1996; Makeig, Bell, Jung, Ghahremani, & Sejnowski, 1997) and MEG (Poeppel, Attias, Rowley, & Schreiner, 1997). The DCA separating filters produce simultaneous spatial and temporal redundancy reduction. Combined with dimensionality reduction as described in section 7, the resulting dynamic components of a given data set may have a natural interpretation in terms of its generating mechanisms. These algorithms may also shed light on the methods employed by the nervous system to process data from its various receptor arrays, in accord with Barlow’s (1989) suggestion of redundancy reduction as an important goal of sensory processing. This idea has been formulated as a quantitative theory for the subcortical visual system in Atick and Redlich (1990), Atick (1992), and Dong and Atick (1995). Bell and Sejnowski (1996) construct a computational model of primary visual cortex cells based on ICA. Hopfield (1991) applied Jutten and H´erault’s (1991) ideas to explain odor discrimination. Furthermore, the equivalent information-rate maximization formulation of DCA given in section 6 connects this approach to a recent line of physiological experiments (see, e.g., Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Attias & Schreiner, 1998) suggesting that the nervous system is designed to maximize the rate at which spike trains carry information about the stimuli. Our results here motivate us to hypothesize that the nervous system performs dynamic component analysis of its inputs from various sensory modalities. Future work will explore the consequences of this hypothesis for spa-
Blind Source Separation and Deconvolution
1411
tiotemporal characteristics of neural filter properties. On the computational front, progress in designing efficient methods to separate noisy, time-dependent, and nonsquare mixtures will be necessary in order to mount a fresh attack on the notorious cocktail party separation problem (Bregman 1990), which is routinely and successfully confronted by the human brain. Appendix A Rules
Instantaneous Mixing: Error Functions and Learning
A.1 DCA-IF. Here we provide a somewhat detailed derivation of the DCA-IF error function (equation 2.8) and learning rules (equation 2.10). The relation between the whitened sources u˜ k , sources x˜ k , and sensors y˜ k , given P by u˜ i,k = g˜ i,k x˜ i,k = g˜ i,k j Gij y˜j,k (see equation 2.6), leads to the following relation between their model densities: ˜ = |det G|N px˜ (x) ˜ = |det G|N py˜ (y)
L N−1 Y Y i=1 k=0
| g˜ i,k |
N/2 Y
Pi,k (u˜ i,k ),
(A.1)
k=0
where equation 2.7 has been used. The symbol y˜ stands for {y˜ 0 , . . . , y˜ N/2 }, ˜ is the joint density of these N × L variables (note from the DFT and py˜ (y) ˜ The subdefinition in equation 1.3 that Im y˜ 0 = Im y˜ N/2 = 0); similarly for x. script k in the last product runs from 0 only to N/2 since, from equation 1.3, the frequency components satisfy u˜ N−k = u˜ ?k . Note from equation A.1 that the model sensor signals ym are generally not white and their density is not factorial. The relation in equation A.1 is a corollary of the transformation rule of a probability density under a linear transformation of variables, with some care taken due to both the transformation and the variables’ being complex. The first equality is obtained by noting that for each k 6= 0, N/2, G operates on the real and imaginary parts of y˜ k = y˜ 0k +iy˜ 00k separately, and thus x˜ 0k = Gy˜ 0k and x˜ 00k = Gy˜ 00k each contribute |det G| to the transformation py˜ → px˜ ; for k = 0, N/2 only the real parts contribute. The second equality follows from writing u˜ i,k = g˜ i,k x˜ i,k in a matrix form, µ
u˜ 0i,k u˜ 00i,k
¶
µ =
g˜ 0i,k g˜ 00i,k
− g˜ 00i,k g˜ 0i,k
¶µ
x˜ 0i,k x˜ 00i,k
¶ ,
(A.2)
which shows that for each i and k 6= 0, N/2, this change of variables contributes | g˜ i,k |2 , the determinant of the matrix in equation A.2, to the transformation px˜ → pu˜ ; for k = 0, N/2 the contribution is just | g˜ i,k |. Finally, to extend the sum up to k = N − 1 we use g˜ i,N−k = g˜ ?i,k . The separation parameters, which consist of the separating matrix G, whitening filters gm , and source density parameters ξ i , should now be optimized to minimize the distance between the model py˜ and the observed poy˜
1412
H. Attias and C. E. Schreiner
sensor densities. To do this we choose our error function to be the KullbackLeibler distance, which is bounded from below (Cover & Thomas, 1991) and is given by E(poy˜ , py˜ ) = =
1 N 1 N
Z ˜ log dy˜ poy˜ (y) Z
==−
˜ poy˜ (y) ˜ py˜ (y)
˜ log poy˜ (y) ˜ − dy˜ poy˜ (y)
1 N
1 1 ˜ po . Hpo − hlog py˜ (y)i y˜ N y˜ N
Z ˜ log py˜ (y) ˜ dy˜ poy˜ (y) (A.3)
We divide by N in equation A.3 to prevent divergence as N → ∞. Both terms on the last line have simple interpretations. Hpoy˜ is the entropy of the observed frequency-domain sensor signals {y˜ 0 , . . . , y˜ N/2 }. For stationary signals, the limit limN→∞ Hpoy˜ /N = Hpoy˜ exists and is termed the entropy rate of the sensors. The second term is the log-likelihood of the data given our model. Since the sensor entropy rate is independent of the separation parameters, minimizing the KL distance is equivalent to maximizing the log-likelihood of the data with respect to those parameters. Substituting equation A.1 in A.3 and dropping terms independent of the separation parameters, we obtain the error function in equation 2.8. Of course, in the time domain we can similarly obtain E(poy , py ) = −
1 1 Hpo − hlog py (y)ipoy , N y N
(A.4)
where y = {y0 , . . . , yN−1 } and py is the joint spatiotemporal density of the sensors. A.1.1 Ordinary Gradient Descent. In order to derive the learning rules, we must differentiate E with respect to the separation parameters. This is facilitated by the following identities: ∂ log |det G| = (G−1 )ji , ∂Gij ∂ u˜ i,k = g˜ j,k y˜ l,k δij , ∂Gjl ∂ u˜ i,k = e−iωk m x˜j,k δij , ∂gj,m à ! eiωk m ∂ log | g˜ i,k | = Re δij , ? ∂gj,m g˜ j,k
(A.5)
Blind Source Separation and Deconvolution
1413
where ? denotes complex conjugation. The first identity can be derived using log |det G| = 12 log det G2 = 12 Tr log G2 and the eigenvalue decomposition of G. The second and third identities result from equation 2.6 and the last one from the DFT definition in equation 1.3. The gradient-descent learning rules can now be obtained using the chain rule together with equation A.5: δGij = −²
δgi,m
X ∂E 1 N−1 ? = ²(G−1 )Tij − ² , g˜ ? φ˜ i,k y˜j,k ∂Gij N k=0 i,k
X ∂E 1 N−1 = −² =² eiωk m ∂gi,m N k=0
δ ξ i = −²
Ã
! 1 ? ˜ − φi,k x˜ i,k , g˜ ?i,k
N/2 ∂E 1 X ∂ log Pi,k =² . ∂ ξi N k=0 ∂ ξi
(A.6)
The first two rules in equation A.6 make use of the modified whitened ˜ , which are defined in the frequency domain in terms of u˜ k and sources φ k their density Pi,k (u˜ i,k ) (see equation 2.9) by φ˜ i,k = −αk
∂Pi,k ∂Pi,k − iαk 00 , u˜ 0i,k u˜ i,k
αk =
1 (1 + δk,0 + δk,N/2 ), 2
(A.7)
where u˜ i,k = u˜ 0i,k + iu˜ 00i,k . The αk appear essentially to compensate for u˜ 00i,k = 0 at k = 0, N/2. The learning rules in equation A.6 can also be given in terms of timedomain signals, by exploiting their inverse DFT form (i.e., a time-domain quantity given by a sum over all frequency components k = 0, . . . , N − 1; see equation 1.3): δG = ²(GT )−1 − ²
X
g
Dm (φ × yT )−m ,
m
δgi,m =
²g0i,m
− ²(φi × xi )−m ,
(A.8)
where the signals φi,m and filters g0i,m are the time-domain counterparts of φi,k g (see equation A.7) and g˜ 0i,k = 1/ g˜ ?i,k , and Dm is a diagonal matrix containing gm , as defined in equation P 1.4. In component notation, the rule for G is given by δGij = ²(G−1 )ji − ² mn gi,m φi,n yj,n−m . A.1.2 Relative Gradient Descent. The learning rules for G in equations A.6 and A.8 are expensive since they require matrix inversion at each iteration. This can be avoided by using rules based on the relative gradient of the error function rather than the ordinary gradient. The concept of the relative
1414
H. Attias and C. E. Schreiner
gradient was first introduced in Cardoso and Laheld (1996) and Amari et al. (1996) in the context of algorithms for separating instantaneous mixtures using equal-time statistics and is incorporated into DCA in the following. We first define the relative gradient of E with respect to G, denoted ∇G E, in terms of the ordinary gradient ∂E/∂G, by ∇G E =
∂E T G G. ∂G
(A.9)
In the previous section we used the well-known fact that incrementing G in each iteration by δG = −²∂E/∂G produces a nonpositive change in E, as long as ² is sufficiently small, to derive ordinary-gradient learning rules for G. Next we show that incrementing G by δ R G = −²∇G E has precisely the same effect. Indeed, ∂E R T ∂E δ G = −² Tr (∇G E)T ∂G ∂G µ ¶µ ¶ ∂E T ∂E T T G G = −² Tr ≤ 0. ∂G ∂G
δE = E(G + δ R G) − E(G) = Tr
(A.10)
To prove the second equality, we use component notation to write E(G + P δ R G) = E(G) + ij ∂E/∂Gij δ R Gij , which is true to order ² 2 . The fourth equality is obtained using equation A.9. The lastP inequality can be verified by noting that any matrix A satisfies Tr AAT = ij A2ij ≥ 0. The advantage of the relative gradient goes beyond the fact that the resulting learning rules avoid matrix inversion. These algorithms were shown by Cardoso and Laheld (1996), in a proof that is readily extendable to the spatiotemporal formulation employed here, to have the property of equivariance: the performance of the source estimator (i.e., the learned separating matrix) is independent of the actual mixing matrix (in fact, their definition is different from equation A.9 but the resulting increments are identical). The relative gradient can be extended to the convolutive mixing case where G becomes a matrix of filters Gm . This is done in Appendix B. Applying this extension to the instantaneous case, we now define the relative gradient of our error function with respect to the separating filters gi,m . The relation between δgi,m = −²∂E/∂gi,m and δ R gi,m = −²∇gi,m E, the ordinaryand relative-gradient increments, is defined by an excursion to the frequency domain by δ R g˜ i,k = δ g˜ i,k | g˜ i,k |2 , where δ R g˜ i,k and δ g˜ i,k are the DFT of δ R gi,m and δgi,m . It follows that ¶ X µ ∂E × gi gi,n+m . (A.11) ∇gi,m E = ∂gi n n That incrementing gi,m using δ R gi,m produces a nonpositive change in E is a corollary of equation B.8, a convolutive-mixing analog of the proof in equation A.10.
Blind Source Separation and Deconvolution
1415
Using equations A.9 and A.11, the relative-gradient learning rules for G and gi,m can easily be obtained from the rules in equation A.6:
δ R G = −²∇G E = ²G − ²
δ R gi,m = −²∇gi,m E = ²
X g˜ ? 1 N−1 ˜ x˜ † G, D φ N k=0 k k k
X 1 N−1 eiωk m ( g˜ i,k − φ˜ i,k u˜ ?i,k g˜ i,k ), N k=0
(A.12)
where equation 2.6 has been used. The time-domain version of these rules (see equation 2.10) follows from DFT relation (equation 1.3). A.2 DCA-IT. In this section we derive the DCA-IT error (equation 2.12) and learning rules (equation 2.14). As in the frequency-domain case, the relation of equation 2.2 between the sensors ym , sources xm , and whitened sources um leads to a relation between their model densities:
py (y) = |det G|N px (x) = |det G|N
L Y i=1
|gi,0 |N
N−1 Y
pi,m (ui,m ),
(A.13)
m=0
where the factorial form for pu (equation 2.11) has been used. y stands for {y0 , . . . , yN−1 }, and py (y) is the joint density of these N × L variables; the same holds for x. The first equality in equation A.13 follows from the fact that for each time point m, the change of variables xm = Gym contributes |det G| to the transformation P py → px . To derive the second equality, we write ui,m = (gi ∗ xi )m = n gi,n xi,m−n in a matrix form,
xi,0 xi,1 . xi,N−1
=
gi,0 gi,1 .
0 gi,0
. .
gi,N−1
gi,N−2
.
0 yi,0 yi,1 0 . gi,0 yi,N−1
,
(A.14)
and notice that the determinant of the N × N matrix is (gi,0 )N . The error function in equation 2.12 is now obtained by substituting py (equation A.3) in the KL distance (equation A.4) and omitting terms independent of the separation parameters. A.2.1 Ordinary Gradient Descent. The gradient of E with respect to the separation parameters can be computed with the help of the first identity
1416
H. Attias and C. E. Schreiner
in equation A.5 and the following two identities, m X ∂ui,m = δij gj,n yl,m−n , ∂Gjl n=0
∂ui,m = δij xj,m−n , ∂gj,n
(A.15)
obtained from equation 2.5. We then have X g Dm (ψ × yT )−m , δG = ²(GT )−1 − ² m
δgi,m = ² δξi = ²
1 δm,0 − ²(ψi × xi )−m , gi,0 X ∂ log pi,m 1 N−1 , N m=0 ∂ ξi
(A.16)
where the modified whitened sources ψ m are defined by ψi,m = −
1 ∂ log pi,m , N ∂ui,m
(A.17)
g
and the diagonal matrix Dm is given in equation 1.4. Note that (ψ × yT )−m is a L×L matrix whose ij element is (ψi ×yj )−m (see equation 1.6). In component P notation, Gij should be incremented by δGij = ²(G)ji−1 − ² mn gi,m ψi,n yj,n−m . A.2.2 Relative Gradient Descent. The relative-gradient rules for G and gm are obtained by considering the frequency-domain version of the rules in equation A.16 and using equations A.9 and A.11. This yields δ R G = ²G − ²
R
δ gi,m
X g? 1 N−1 ˜ x˜ † G, D ψ N k=0 k k k
µ ¶ X ˜ i,k |2 1 N−1 iωk m | g ? ˜ ˜ ˜ =² e − ψi,k ui,k gi,k . N k=0 gi,0
(A.18)
The time-domain version of the rule for G is included in the DCA-IT rules (equation 2.14). As for gm , however, the relative-gradient concept produces a more complicated rule, having the time-domain form δ R gi,m = ²
X 1 (gi × gi )m − ² (ψi × ui )n gi,n+m , gi,0 n
while offering no advantage in this case.
(A.19)
Blind Source Separation and Deconvolution
1417
A.3 DCA-IFT. The frequency-domain relation in equation A.1 between py˜ and pu˜ yields a similar time-domain relation, py (y) = |det G|N
L N−1 Y Y
| g˜ i,k |
i=1 k=0
N−1 Y
pi,m (ui,m ),
(A.20)
m=0
˜ (and similarly for pu (u)) obby exploiting the identity py (y) = NN/2 py˜ (y) tained from the DFT definition in equation 1.3. Using equation A.20 in the KL distance (equation A.4) produces the error function in equation 2.15. The resulting ordinary-gradient learning rules consist of the rule for G in equation A.16 and the rule for gm in equation A.6, with φm replaced by ψ m in equation A.17. The relative-gradient rules in equation 2.16 are then obtained along the lines described above. Appendix B
Convolutive Mixing: Error Functions and Learning Rules
B.1 DCA-CF. The derivation of the model sensor density py˜ for convolutive mixing is analogous to the instantaneous case (section A.1). Starting from the factorial whitened source density in equation 2.7, the linear relation P u˜ i,k = j G˜ ij,k y˜j,k in equation 3.5 leads to ˜ = py˜ (y)
N−1 Y
˜ k| |det G
N/2 L YY
Pi,k (u˜ i,k ),
(B.1)
k=0 i=1
k=0
˜ is the joint density of these variables. where y˜ = {y˜ 0 , . . . , y˜ N/2 } and py˜ (y) P To derive equation B.1, we write u˜ i,k = j G˜ ij,k y˜j,k in a matrix form, µ
u˜ 0k u˜ 00k
¶
µ =
˜0 G k ˜ G00k
˜ 00 −G k ˜0 G k
¶µ
y˜ 0k y˜ 00k
¶ (B.2)
¯ k, with u˜ k = u˜ 0k + iu˜ 00k , and observe that, denoting the 2L × 2L matrix by G ¯ each ωk for k 6= 0, N/2 contributes a factor of |det Gk | to the transformation ¯ k = |det G ˜ k |2 , assume G ˜ k has an pu˜ → py˜ in equation B.1. To see that det G ˜ † has an eigenvalue eigenvalue λ associated with an eigenvector v. Then G k
¯ k has two λ? , associated with a (left) eigenvector v† . Correspondingly, G eigenvalues λ, λ? , associated with the eigenvectors (ivT , vT ) and (−iv† , v† ), Q 2 ˜ k det G ˜† = ¯ k = det G respectively. Thus, we have det G λ |λ| . Finally, for k k = 0, N/2 the imaginary parts in equation B.2 vanish and the relevant ˜ k |. contribution is |det G The error function in equation 3.6 follows by substituting py˜ (in equation B.1) in the general expression for the KL distance E(poy˜ , py˜ ) in equation A.3. To obtain the learning rules for the separating filters Gm and the
1418
H. Attias and C. E. Schreiner
source parameters ξ i , we use the following identities: h i ˜ k| ∂ log |det G ˜ −1 )† , = Re eiωk m (G ij k Gij,m ∂ u˜ i,k = e−iωk m y˜ l,k δij . Gjl,m
(B.3)
The resulting ordinary-gradient learning rules in the frequency domain are ˜ y˜ † , ˜ † )−1 − ² φ ˜ k = ²(G δG k k k
(B.4)
˜ k is the DFT of the ordinary-gradient increment δGm , where δ G δGm = −²
X ∂E 1 N−1 ˜ k. = eiωk m δ G ∂Gm N k=0
(B.5)
˜ k since E is ˜ k = −²∂E/∂ G Note that it cannot be obtained directly by δ G ˜ ˜ not analytic in Gk . The modified whitened sources φk are defined in equation A.7. The time-domain version of the rule in equation B.4 is given by δGTm = ²G0m − ²(y × φT )−m ,
(B.6)
where φi,m are the time-domain counterparts of φ˜ i,k in equation B.4 obtained by inverse DFT, and G0m is the impulse response of the filter matrix whose ˜ † )−1 (compare to g0i,m in equation A.8). frequency response is (G k The rule in equation B.6 is inefficient in requiring the inversion of the ˜ k for all frequencies ωk , k = 0, . . . , N/2 at each iteracomplex L × L matrix G tion. This can be avoided by extending the concept of the relative gradient (Cardoso & Laheld, 1996; Amari et al., 1996), introduced for instantaneous mixtures in appendix A (see equation A.9), to the convolutive mixing case. We denote the relative gradient of E with respect to Gm by ∇Gm E, and the relative-gradient increment of Gm by δ R Gm = −²∇Gm E. Next, we define it in terms of the ordinary-gradient increment δGm = −²∂E/∂Gm by the ˜ kG ˜ k = δG ˜ †G ˜ frequency-domain relation δ R G k k , resulting in ∇Gm E =
X µ ∂E n
∂G
¶ T
×G
Gn+m , n
where T denotes transposition (compare with equation A.9).
(B.7)
Blind Source Separation and Deconvolution
1419
It is left to show that incrementing Gm by δ R Gm produces a nonpositive change in E, generalizing the proof (see equation A.10) from the instantaneous case. Indeed, for sufficiently small ², δE = E(Gm + δ R Gm ) − E(Gm ) =
X m
= −²
X
Tr
∂E (∇Gm E)T ∂Gm
Tr
∂E T G Gn+l ∂Gm l+m
m
= −²
X mnl
= −²
X l
µ Tr
∂E × GT ∂G
¶ µ l
µ
∂E ∂Gn
Tr
∂E R (δ Gm )T ∂Gm
¶T
∂E × GT ∂G
¶T ≤ 0,
(B.8)
l
P where we used δ R Gm = nl δGn GTn+l Gl+m (see equation B.7) to prove the third equality in equation B.8. The relative-gradient learning rule is now obtained from equation B.4 by ˜ ˜ †G multiplying both sides by G k k: ˜ u˜ † G ˜ k = ²G ˜ ˜ k − ²φ δRG k k k,
(B.9)
where equation 3.5 was used. As in the instantaneous case (see section A.1.2), the relative gradient benefits us beyond avoiding matrix inversion. In fact, the proof of Cardoso and Laheld (1996) can be extended to the convolutive case to show that relativegradient algorithms have the equivariance property, that is, the quality of the obtained source estimates xm = (G ? y)m is independent of the actual mixing process. This is a desired feature since it implies that the performance of our separation method is uniform across the space of invertible mixing processes and is not affected by, say, the closeness of the mixing to being singular. Of course, this holds only in the absence of noise. B.2 DCA-CT. Using the factorial whitened source density model (equaP tion 2.11) and the linear relation ui,m = jn Gij,n yj,m−n (equation 3.4), we get the model sensor density
py (y) = |det G0 |N
L N−1 Y Y i=1 m=0
pi,m (ui,m ),
(B.10)
1420
H. Attias and C. E. Schreiner
where y = {y0 , . . . , yN−1 }. To derive equation B.10, we write equation 3.4 in a matrix form, 0 . 0 G0 y0 u0 u 1 G1 G0 . 0 y1 , (B.11) . . = . uN−1 yN−1 GN−1 GN−2 . G0 and observe that the determinant of this LN × LN matrix is (det G0 )N . The error function corresponding to equation B.10 is equation 3.8, obtained from the KL distance in equation A.4 after terms independent of the separation parameters are dropped. To derive the learning rule for the separation filters Gm , we use the first identity in equation A.5, together with ∂ui,m = δij yl,m−n . ∂Gjl,n
(B.12)
The resulting rule is δGm = −²
∂E T T = ²(G−1 0 ) δm,0 − ²(ψ × y )−m , ∂Gm
(B.13)
where ψ m is defined by equation A.17. Note that (ψ × yT )−m P is an L × L matrix whose ij element is the input-output cross-correlation n ψi,n yj,n−m . The inversion of G0 at each iteration as required by equation B.13 can be avoided by resorting once again to the relative-gradient idea, this time only with respect to G0 . Replacing ∂E/∂G0 by ∇G0 E, defined by equation A.9, we obtain the first rule in equation 3.9. Note that we have not used the convolutive relative gradient (equation B.7), which produced an efficient learning rule in the frequency-domain case. In the time-domain case, in contrast, this approach produces a rule that is more complicated than equation B.13, while not avoiding the matrix inversion. To see that, consider the frequency-domain version of equation B.13, ˜k = δG
N−1 X
e−iωk m δGm = ²(G0 )−1 − ²
m=0
1 ˜ † ψ y˜ . N k k
(B.14)
The corresponding relative-gradient rule is obtained using equation B.7, ˜ ˜ †G ˜ k = ²(G0 )−1 G ˜ †G ˜ ˜ δRG k k − ² ψk u k k, leading to the time-domain form δ R Gm = ²(G0 )−1 (GT × G)m − ²
X (ψ × uT )n Gn+m n
(compare to equations 3.9 and B.13).
(B.15)
(B.16)
Blind Source Separation and Deconvolution
1421
B.3 DCA-CFT. The relation in equation B.1 between the frequency-domain ˜ and pu˜ (u) ˜ leads to a similar relation in the time domain: densities py˜ (y) py (y) =
N−1 Y
˜ k| |det G
k=0
L N−1 Y Y
pi,m (ui,m ),
(B.17)
i=1 m=0
˜ and its analog for pu (u). Substituting equation using py (y) = NN/2 py˜ (y) B.17 in the KL distance (equation A.4) results in the error function of equation 3.10. The corresponding learning rules for Gm have the same forms as the ordinary- of equation B.4 and relative-gradient (equation B.9) frequencydomain rules, but with φm (equation A.7) replaced by ψ m (equation A.17). The relative-gradient rule is given in equation 3.11. Appendix C
Learning Rules for the Whitened Source Densities
For the time-domain whitened source density pi,m (equation 2.13), we use a 2 : mixture of K gaussians with weights wi,α , means µi,α , and variances σi,α pi,m = p(ui,m , ξ i ) =
K X
wi,α gi,α (ui,m ) =
α=1
K X α=1
q
wi,α 2 2π σi,α
−
e
(ui,m −µi,α )2 2 2σi,α
.
(C.1)
P P To satisfy α wi,α = 1 we write wi,α = eγi,α / β γi,β . The parameter vector is then ξ i = {γi,α , µi,α , σi,α }. The learning rules for ξ i use ∂ log pi,m /∂ ξ i , given by ∂ log pi,m = pi,α − wi,α , ∂γi,α ui,m − µi,α ∂ log pi,m = pi,α , 2 ∂µi,α σi,α " # 1 (ui,m − µi,α )2 ∂ log pi,m = pi,α − , 2 ∂σi,α σi,α σi,α
(C.2)
where pi,α =
wi,α gi,α pi,m
(C.3)
(see equation C.1). For the frequency-domain whitened source density Pi,k (equation 2.9), we use a product of gaussian mixtures for the real and imaginary parts, P(u˜ i,k ) = p(u˜ 0i,k )p(u˜ 00i,k ), using p of equation C.1.
1422
H. Attias and C. E. Schreiner
Acknowledgments We thank A. Bell, B. Bonham, J.-F. Cardoso, M. Kvale, K. Miller, S. Nagarajan, and V. de Sa for helpful discussions during the course of this work and useful comments on a previous version of this article. We also thank the anonymous referees for numerous suggestions that greatly improved the quality and clarity of the manuscript. This work was supported by the Office of Naval Research (N00014-94-1-0547), the Sloan Foundation, and NIDCD 2260. References Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8, 757–763. Cambridge, MA: MIT Press. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251. Atick, J. J., & Redlich, A. N. (1990). Towards a theory of early visual processing. Neural Computation, 2, 308–320. Attias, H., & Schreiner, C. E. (1997). Low-order temporal statistics of natural sounds. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9, 27–33. Cambridge, MA: MIT Press. Attias, H., & Schreiner, C. E. (1998). Coding of naturalistic stimuli by auditory midbrain neurons. In M. I. Jordan, M. J. Kearns, S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A. J., & Sejnowski, T. J. (1996). Edges are the independent components of natural scenes. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9, 831–837. Cambridge, MA: MIT Press. Belouchrani, A., Abed-Meraim, K., Cardoso, J.-F., & Moulines, E. (1997). A blind source separation technique using second-order statistics. IEEE Transactions on Signal Processing, 45, 434–444. Bialek, W., Rieke, F., de Ruyter van Steveninck, R., & Warland, D. (1991). Reading the neural code. Science, 252, 1854–1857. Bregman, A. S. (1990). Auditory scene analysis. Cambridge, MA: MIT Press. Cardoso, J.-F. (1997). Infomax and maximum likelihood for source separation. IEEE Signal Processing Letters, 4, 112–114. Cardoso, J.-F., & Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44, 3017–3030. Cichocki, A., Amari, S.-I., & Cao, J. (1996). Blind separation of delayed and convolved signals with self-adaptive learning rate. In Proceedings of the International Symposium on Non-linear Theory and Applications. Kochi, Japan. Comon, P. (1994). Independent component analysis: A new concept? Signal Processing, 36, 287–314.
Blind Source Separation and Deconvolution
1423
Comon, P. (1996). Contrasts for multichannel blind deconvolution. IEEE Signal Processing Letters, 3, 209–211. Comon, P., Jutten, C., & H´erault, J. (1991). Blind separation of sources, Part II: Problem statement. Signal Processing, 24, 11–20. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dong, D. W., & Atick, J. J. (1995). Temporal decorrelation: A theory of lagged and nonlagged responses in the lateral geniculate nucleus. Network, 6, 159–178. Everitt, B. S. (1984). An introduction to latent variable models. London: Chapman and Hall. Hopfield, J. J. (1991). Olfactory computation and object perception. Proceedings of the National Academy of Sciences, 88, 6462–6466. Hyv¨arinen, A. & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492. Jutten, C., & H´erault, J. (1991). Blind separation of sources, Part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. Lambert, R. (1996). Multichannel blind deconvolution: FIR matrix algebra and separation of multipath mixtures. Unpublished doctoral dissertation, University of Southern California. Lee, T.-W., Bell, A. J., & Lambert, R. (1997). Blind separation of delayed and convolved sources. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9, 758–764. Cambridge, MA: MIT Press. Makeig, S., Bell, A. J., Jung, T.-P., & Sejnowski, T. J. (1996). Independent component analysis of electroencephalographic data. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8, 145–151. Cambridge, MA: MIT Press. Makeig, S., Bell, A. J., Jung, T.-P., Ghahremani, D., & Sejnowski, T. J. (1997). Blind separation of auditory event-related brain responses into independent components. Proceedings of the National Academy of Sciences USA, 94, 10979–10984. Molgedey, L., & Schuster, H. J. (1994). Separation of independent signals using time-delayed correlations. Physical Review Letters, 72, 3634–3637. Oppenheim, A. V., & Schafer, R. W. (1989). Discrete-time signal processing. Englewood Cliffs, NJ: Prentice-Hall. Pearlmutter, B. A., & Parra, L. C. (1997). Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9, 613–619. Cambridge, MA: MIT Press. Pham, D. T. (1996). Blind separation of instantaneous mixture of sources via an independent component analysis. IEEE Transactions on Signal Processing, 44, 2768–2779. Platt, J. C., & Faggin, F. (1992). Networks for the separation of sources that are superimposed and delayed. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4, 730–737. San Mateo, CA: Morgan-Kaufmann. Poeppel, D., Attias, H., Rowley, H. A., & Schreiner, C. E. (1997). Dynamic component analysis of auditory evoked neuromagnetic fields. Society for Neuroscience Abstracts, 23.
1424
H. Attias and C. E. Schreiner
Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: exploring the neural code. Cambridge, MA: MIT Press. Sorouchyari, E. (1991). Blind separation of sources, Part III: Stability analysis. Signal Processing, 24, 21–30. Thi, H.-L. N., & Jutten, C. (1995). Blind source separation for convolutive mixtures. Signal Processing, 45, 209–229. Torkkola, K. (1996). Blind separation of convolved sources based on information maximization. In Neural networks for signal processing VI. New York: IEEE. Yellin, D., & Weinstein, E. (1995). Criteria for multichannel signal separation. IEEE Transactions on Signal Processing, 42, 2158–2168. Received January 7, 1997; accepted January 5, 1998.
NOTE
Communicated by David Wolpert
Bias/Variance Decompositions for Likelihood-Based Estimators Tom Heskes Foundation for Neural Networks, University of Nijmegen, 6525 EZ Nijmegen, The Netherlands
The bias/variance decomposition of mean-squared error is well understood and relatively straightforward. In this note, a similar simple decomposition is derived, valid for any kind of error measure that, when using the appropriate probability model, can be derived from a KullbackLeibler divergence or log-likelihood.
1 Introduction Finding bias/variance decompositions for all kinds of error measures or loss functions is an active area of research. The decomposition for meansquared error is well known and easily derived (see, e.g., Geman, Bienenstock, & Doursat, 1992). Recently several suggestions have been made for other loss functions, such as zero-one loss (see Breiman, 1996; Dietterich & Bakiri, 1995; Friedman, 1996; James & Hastie, 1997; Kohavi & Wolpert, 1996; Tibshirani, 1996; Wolpert, 1997 and references therein). The generalization of the decomposition for mean-squared error to a decomposition for zeroone loss depends on one’s definition of desirable properties for the bias and the variance term. In this note, we will follow the requirements and definitions stated in James and Hastie (1997). Applying these definitions to the Kullback-Leibler divergence, we will arrive at a simple generalization of the decomposition for mean-squared error.
2 Theory Let Y be a random variable, which may be either discrete or continuous. We will proceed as if Y is continuous, where the discrete case follows immediately if one replaces integrals by summations and probability densities by probability distributions. q(y) is defined as the target probability denˆ sity function that Y = y; p(y) is an estimator of this density. For example, ˆ may correspond to a probability statement derived from the output of p(y) a neural network (see the examples below). We have a (possibly infinite) ensemble of such estimators. Expectation with respect to this ensemble is c 1998 Massachusetts Institute of Technology Neural Computation 10, 1425–1433 (1998) °
1426
Tom Heskes
indicated by the operator E. We use the Kullback-Leibler divergence, ·
Z ˆ ≡ K(q, p)
dy q(y) log
¸ q(y) , ˆ p(y)
(2.1)
ˆ and q(y). The goal is to find to measure the distance between densities p(y) ˆ in a bias and variance term. a decomposition of the error EK(q, p) In the usual setting, the ensemble consists of models obtained through application of a learning algorithm on different training sets, generated from the same problem domain. In a decomposition of the average error of these models, the bias is supposed to measure how closely the learning algorithm’s average guess matches the target and the variance how much the learning algorithm’s guess bounces around for different training sets (Kohavi & Wolpert, 1996). Modifications on a learning algorithm tend to have an opposite effect on the bias and the variance: an increase in the number of degrees of freedom usually leads to a smaller bias and a higher variance. Note that contrary to most other articles on bias/variance decompositions, we do not write the loss function as a direct measure of the distance between a model’s output and a target. Instead, we first translate the model’s output to a probability statement and then define the loss function as the Kullback-Leibler divergence between this probability statement and a target probability (see below for the straightforward generalization to the case where the target probability q(y) is unknown and only a realization Y = t is provided). A similar approach is pursued in Wolpert (1997). As suggested in James and Hastie (1997), we start the decomposition by defining the variance as the smallest average distance—in this case, the ˆ smallest average Kullback-Leibler divergence—between an estimator p(y) ¯ and some average model p(y). The asymmetry of the Kullback-Leibler diˆ in the role vergence forces us to be more precise. We keep the densities p(y) of estimators and define the average model as the target density that leads to the smallest possible Kullback-Leibler divergence between the target and the estimators: Variance = R min a:
ˆ = EK(p, ¯ p) ˆ . EK(a, p)
(2.2)
dy a(y)=1
R Introducing a Lagrange multiplier for the constraint dya(y) = 1 and taking the functional derivative to a(y), we easily obtain for the average model ¯ p(y) =
1 ˆ exp[E log p(y)] , Z
(2.3)
with Z a normalization constant independent of y. In other words, the av¯ ˆ erage model p(y) is a (normalized) geometric mean of the densities p(y) rather than an arithmetic mean, as, for example, proposed in Hall (1987)
Bias/Variance Decompositions
1427
and Wolpert (1997). In the literature on combining experts’ probability statements, equation 2.3 is called a logarithmic opinion pool (see, e.g., Bordley, 1982; Genest & Zidek, 1986; Heskes, 1998; or Jacobs, 1995, for the similar but somewhat more involved supra-Bayesian techniques). A disadvantage of the logarithmic opinion pool is that if any of the experts assigns probability zero to a particular outcome, the average model assigns probability zero, independent of what the other experts claim. This property of the logarithmic opinion pool, however, is perfectly consistent with a Bayesian point of ˆ are not carefully estimated view1 and is a drawback only if the densities p(y) (Bordley, 1982). ¯ between the average model The bias is defined as the distance K(q, p) and the target distribution. Substituting equation 2.3 into 2.1, we obtain ¯ = EK(q, p) ˆ + log Z. Bias = K(q, p) Using equation 2.3, the second term on the right-hand side can be transformed into ¸ ˆ exp[E log p(y)] ∀y:p(y)>0 log Z = log ¯ ¯ p(y) ¶¸ ·Z µ ˆ p(y) ¯ p) ˆ = − variance , ¯ log = −EK(p, =E dy p(y) ¯ p(y) ·
with the variance defined in equation 2.2. Rearrangement of terms then gives the desired decomposition: ˆ = K(q, p) ¯ + EK(p, ¯ p) ˆ = bias + variance . Error = EK(q, p)
(2.4)
Other bias/variance decompositions often include a term measuring the intrinsic noise, which is a lower bound on the error that can be obtained by any learning algorithm. A learning algorithm that reproduces the probability distribution q(y) has Kullback-Leibler divergence equal to zero. This explains why there is no intrinsic noise term in equation 2.4. Equation 2.4 gives a decomposition for the Kullback-Leibler divergence between probability densities.2 Now suppose that we do not know the complete target distribution q(y), but have only a particular observation Y = t. 1 Whenever a Bayesian assigns probability zero or one, all further discussion is closed. No amount of new information can ever change his mind (Bordley, 1982). 2 If we define the error between the estimated probability p(y) ˆ and target q(y) the other R ˆ ≡ dy p(y) ˆ log[p(y)/q(y)]—we ˆ way around—K(q, p) obtain exactly the same decomposi¯ ˆ (see, e.g., Genest tion, but with as the average model the linear opinion pool p(y) = Ep(y) & Zidek, 1986; Jacobs, 1995, for a discussion of linear opinion pools). This error measure, however, is much less in use since it cannot be transformed into a log-likelihood for a (finite set of) observation(s) instead of a target probability.
1428
Tom Heskes
ˆ In that case, it is more appropriate to consider the log-likelihoods log p(t) for which we, following the same lines, obtain ˆ = − log p(t) ¯ + EK(p, ¯ p) ˆ . −E log p(t)
(2.5)
The first term on the right-hand side is the error of the average model, the second term the variance of the models in the ensemble. For a further decomposition of the error of the average model into an intrinsic noise term and a bias term, we again have to integrate over the probability density generating the targets.3 This then yields ·Z ¸ ˆ Error = −E dt q(t) log p(t) Z ¯ + EK(p, ¯ p) ˆ = − dt q(t) log q(t) + K(q, p) = intrinsic noise + bias + variance .
(2.6)
The decompositions (see equations 2.4 and 2.6) differ only in their definition of the error function. With the error definition in equation 2.6, the intrinsic noise term is equal to the Shannon entropy of the density q(y). In the following examples, we will illustrate the decomposition in equation 2.5 for a single observation t. 3 Examples The mean-squared error is a special case of the Kullback-Leibler divergence ˆ as estimates of the mean of a normal disif we interpret model outputs m tribution with some fixed variance σ 2 : r · ¸ ˆ 2 −(y − m) 1 ˆ exp . p(y) = 2πσ 2 2σ 2 The logarithmic opinion pool (see equation 2.3) yields as the average model ¯ ¯ = Em, ˆ as p(y) a gaussian with the same standard deviation and mean m expected. The decomposition (see equation 2.5) is, up to an irrelevant proportionality constant, equivalent to the usual one as, for example, in Geman et al. (1992): h i i h ˆ − m) ¯ 2 . ¯ − t)2 + E (m ˆ − t)2 = (m E (m ˆ and As a generalization, we consider the case where we have estimates m σˆ 2 for both the mean and the variance (see, e.g., Bishop & Qazaz, 1997; 3 This can be easily illustrated on the mean-squared error. Suppose (see the first exam¯ when the target is t. Without knowing the ple below) that the average model predicts m distribution from which the targets t are drawn, it is impossible to decompose the error ¯ − t)2 into a separate noise and bias term. (m
Bias/Variance Decompositions
1429
¯ Williams, 1996, and references therein). The average model p(y) is still a ¯ and variance σ¯ 2 obeying gaussian with mean m µ ¶ µ ¶ ˆ ¯ 1 m m 1 = E = E and , σ¯ 2 σˆ 2 σ¯ 2 σˆ 2 that is, the logarithmic opinion pool (see equation 2.3) leads to an averaging of reciprocal variances and a weighted averaging of the estimated means. The decomposition (see equation 2.5) yields ¸ · ¸ · ¯ − t)2 ˆ − t)2 (m (m 2 2 + log σ ˆ + log σ ¯ = E σˆ 2 σ¯ 2 µ 2 ¶¸ · ˆ − m) ¯ 2 σˆ (m + log . +E 2 σˆ σ¯ 2 The first term between brackets on the right-hand side is the error of the average model; the second term measures the variance of the different estimators. A new decomposition is obtained for the cross-entropy or logarithmic scoring function that can be used for classification purposes. We consider the binary case with Y a binary random variable, for example, Y ∈ {0, 1}. In the ˆ shorthand notation pˆ ≡ p(1), the logarithmic opinion pool (see equation 2.3) yields ¶ · µ ¶¸ µ pˆ p¯ = E log , log 1 − p¯ 1 − pˆ that is, the average model can be found by averaging the logits (log-odds) of the estimated probabilities. Given an observed target t, the decomposition (see equation 2.5) can be written as £ ¤ ˆ = t log p¯ + (1 − t) log(1 − p) ¯ E t log pˆ + (1 − t) log(1 − p) ¶¸ µ · µ ¶ 1 − p¯ p¯ ¯ log . (3.1) + (1 − p) − E p¯ log 1 − pˆ pˆ This decomposition can be contrasted with the one proposed in Wolpert (1997), which for the binary case in our notation reads £ ¤ ˆ = t log p¯ + (1 − t) log(1 − p) ¯ E t log pˆ + (1 − t) log(1 − p) ¶¸ µ · µ ¶ 1 − p¯ p¯ , (3.2) + (1 − t) log − E t log 1 − pˆ pˆ ˆ The where the average model is the linear opinion pool,4 that is, p¯ = Ep. main disadvantage of this decomposition is that the variance term still de4 Equation 3.2 is, of course, true for any definition of the average model p. ¯ Only by averaging the logits can one make the variance independent of the target t and arrive at equation 3.1.
1430
Tom Heskes
pends directly on the target t. Whenever the expectation E is defined by averaging over models optimized on training sets generated from the target distribution, the variance term in equation 3.1 also depends on the target distribution (see Wolpert, 1997, for a full exposition of this point). However, keeping the operation E fixed, for example, by keeping the distribution over training sets the same, the variance in equation 3.1 is independent of the (distribution of the) target t, whereas the variance in equation 3.2 does depend on the target t. Most recent articles on bias/variance decompositions focus on zero-one loss for classification tasks. Given the target class label t, the loss is 0 if the model’s estimate yˆ equals t and 1 otherwise. As we will see, we can try to interpret zero-one loss as a limit case of a log-likelihood-type error. Suppose that we transfer the classification yˆ into a probability statement that assigns ˆ probability 1 to class yˆ and probability ² ¿ 1 to all other class labels y 6= y: ½ pˆ² (y) =
1 ²¿1
ˆ if y = y, ˆ if y 6= y.
In principle we should normalize this distribution, but it is easy to show that for small ², this normalization constant can be set to one. We call f (y) the fraction of models that assigns the class label y, that is, f (y) = Eδy,y ˆ . Application of equation 2.3 then yields in leading order of ², p¯² (y) = ² maxy0
f (y0 )− f (y)
,
(3.3)
that is, in the limit ² → 0, the average model is nothing but the majority vote y¯ = argmax f (y). Decomposition (see equation 2.5) is still valid: y
−E log pˆ² (t) = − log p¯² (t) + EK(p¯² , pˆ² ). If we divide by − log ² and take the limit ² → 0 on both sides, we arrive at the decomposition £ ¤ £ ¤ ¯ − f (t) + 1 − f (y) ¯ . 1 − f (t) = f (y)
(3.4)
Considering the way in which we have arrived at this decomposition for zero-one loss, we are tempted to call the second term between brackets the variance. However, in taking the limit ² → 0, we have lost the interpretation of the first term as the error of the average model. The crux is that the average model (see equation 3.3) in leading order of ² still depends on the classification frequencies f (y), which for that reason also appear in equation 3.4. The average model for ² = 0, on the other hand, depends on only the majority vote and is independent of the exact frequencies f (y). For
Bias/Variance Decompositions
1431
a further decomposition of the first term into a bias term and an inherent noise term, we have to sum over the distribution q(t) that generated the class labels t. Most authors define the inherent noise term to be the error of the Bayes classifier and ascribe the remaining term to the bias. The exact decomposition seems to be somewhat arbitrary, since in practice one is interested only in changes in the bias and variance terms rather than in their absolute values. Our definition of variance is equivalent to those given in Tibshirani (1996) and James and Hastie (1997). 4 Discussion We slightly reformulate what in James and Hastie (1997) are called obvious requirements for a bias/variance decomposition. These requirements are similar in spirit to the desiderata stated by Wolpert (1997): 1. The decomposition for the mean-squared error is a special case. 2. The variance does not depend on the target distribution directly. Furthermore, it is nonnegative and zero iff all estimators are equivalent. 3. The bias depends on only the target distribution and the “average model,” which is defined as the model minimizing the variance. The main result of this note is that for any likelihood-based estimator, it is indeed possible to find a decomposition fulfilling these requirements. To see that this is nontrivial, we will sketch how many decompositions are derived (see, e.g., Dietterich & Bakiri, 1995; Kohavi & Wolpert, 1996; Wolpert, 1997). For convenience, we will stick to the probabilistic notation. One starts by ˆ ¯ translating the models p(y) into some average model p(y) and defines the bias to be the error between this model and the target, minus the lowest error that can be obtained by any learning algorithm. In our notation we have ¯ − K(q, q). Bias = K(q, p) The variance is defined as the part of the error that cannot be attributed to the noise and the bias: ˆ − K(q, p). ¯ Variance = EK(q, p) In principle, there is no need for this variance to fulfill the second requirement. In fact, this is where previously proposed bias/variance decompositions of Kullback-Leibler divergence (see, e.g., Hall, 1987; Wolpert, 1997)
1432
Tom Heskes
have to give in. However, we have shown that for any likelihood-based estimator: ¯ 1. There is an average model p(y) such that the variance no longer directly depends on the target density q(y). 2. This variance is the average error to this average model. 3. This average model is the model that yields the lowest variance. The mean-squared error, for which these nice properties have been known for long, appears to be nothing but a special case. Only in some limit case, zero-one loss can be interpreted as a kind of Kullback-Leibler divergence. The resulting decomposition still obeys the first and second requirement, but the limiting operation “destroys” the third requirement: the bias is no longer just a function of the average model. None of the bias/variance decompositions for zero-one loss suggested in the literature (see Breiman, 1996; Dietterich & Bakiri, 1995; Friedman, 1996; Kohavi & Wolpert, 1996; Tibshirani, 1996; Wolpert, 1997, and see James & Hastie, 1997, for a discussion of most of them) satisfies all three requirements.5 Most of them either define the bias and take for granted that the variance depends on the distribution of targets (the approach sketched in the beginning of this discussion), or start by defining the variance and are left with the difficult task to interpret the bias. The natural decomposition for likelihood-based estimators obtained in this note may be seen as an argument in favor of the latter approach. Acknowledgments I thank Bert Kappen and the anonymous referees for useful comments on drafts of this article and Padhraic Smyth for pointing me to the bias/variance decomposition in Hall (1987). Special thanks to Wim Wiegerinck for showing me the connection between Kullback-Leibler divergence and zero-one loss. This research was supported by the Dutch Technology Foundation, applied science division of the Dutch Organization for Scientific Research, and the technology program of the Ministry of Economic Affairs. References Bishop, C., & Qazaz, C. (1997). Regression with input-dependent noise: A Bayesian treatment. In M. Mozer, M. Jordan & T. Petsche (Eds.), Advances in neural information processing systems 9 (pp. 347–353). Cambridge, MA: MIT Press. 5 It is possible simply to define the bias and the variance term such that they fulfill all three requirements, but then they do not add up to give the loss function (see James & Hastie, 1997).
Bias/Variance Decompositions
1433
Bordley, R. (1982). A multiplicative formula for aggregating probability assessments. Management Science, 28, 1137–1148. Breiman, L. (1996). Bias, variance, and arcing classifiers (Tech. Rep.). Berkeley, CA: University of California. Available from: http://www.stat. berkely.edu/users/breiman. Dietterich, T., & Bakiri, G. (1995). Solving multiclass learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research, 2, 263–286. Friedman, J. (1996). On bias, variance, 0/1-loss, and the curse of dimensionality (Tech. Rep.). Stanford: Department of Statistics, Stanford University. Available from: http://stat.stanford.edu/people/faculty/friedman.html. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Genest, C., & Zidek, J. (1986). Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1, 114–148. Hall, P. (1987). On Kullback-Leibler loss and density estimation. Annals of Statistics, 15, 1491–1519. Heskes, T. (1998). Selecting weighting factors in logarithmic opinion pools. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Jacobs, R. (1995). Methods for combining experts’ probability assessments. Neural Computation, 7, 867–888. James, G., & Hastie, T. (1997). Generalizations of the bias/variance decomposition for prediction error (Tech. Rep.). Stanford: Department of Statistics, Stanford University. Available from: http://playfair.Stanford.EDU/˜gareth/. Kohavi, R., & Wolpert, D. (1996). Bias plus variance decomposition for zero-one loss functions. In L. Saitta (Ed.), Proceedings of the 13th International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Tibshirani, R. (1996). Bias, variance and prediction error for classification rules (Tech. Rep.). Toronto: University of Toronto. Available from: http:// utstat.toronto.edu/tibs/research.html. Williams, P. (1996). Using neural networks to model conditional multivariate densities. Neural Computation, 8, 843–854. Wolpert, D. (1997). On bias plus variance. Neural Computation, 9, 1211–1243. Received June 3, 1997; accepted January 12, 1998.
NOTE
Communicated by Erkki Oja
The Influence Function of Principal Component Analysis by Self-Organizing Rule Isao Higuchi Graduate University for Advanced Studies, Tokyo, Japan
Shinto Eguchi Institute of Statistical Mathematics, Tokyo, Japan
This article is concerned with a neural network approach to principal component analysis (PCA). An algorithm for PCA by the self-organizing rule has been proposed and its robustness observed through the simulation study by Xu and Yuille (1995). In this article, the robustness of the algorithm against outliers is investigated by using the theory of influence function. The influence function of the principal component vector is given in an explicit form. Through this expression, the method is shown to be robust against any directions orthogonal to the principal component vector. In addition, a statistic generated by the self-organizing rule is proposed to assess the influence of data in PCA. 1 Introduction Recently a number of new methods comparable to classical methods in statistical inference have been proposed from the neural network perspective. This approach is considered to be promising and is heavily oriented to computer implementation. Theoretically, however, performances of these new method have not been completely elucidated. One of these new methods addresses itself to principal component analysis (PCA) and a method for PCA by the self-organizing rule was proposed by Xu and Yuille (1995). But the robustness of their method was shown only through simulation study. The objective of this article is to give theoretical support for PCA by the self-organizing rule. The PCA is a method to reduce multidimensional data to lowerdimensional data while minimizing the amount of information loss. In classical PCA, the eigenvector corresponding to the largest eigenvalue of the sample covariance matrix, called the principal component vector, extracts the features of the data by projecting them onto the vector. The classical PCA has been reported to suffer from the severe defect of nonrobustness when data contain outliers. There are two different approaches to tackling this problem: one is to detect the outliers in the influential analysis (see Critchley, 1985), and the other is to propose an alternative that makes the classical c 1998 Massachusetts Institute of Technology Neural Computation 10, 1435–1444 (1998) °
1436
Isao Higuchi and Shinto Eguchi
method more robust. These two approaches should be chosen in accordance with the purpose of the analysis and the properties of the observed data. The robustness is more useful if the data are automatically available online and the result of the PCA is simultaneously requested along a data generation process. The approach by outlier detection gives global information on the hull of the data and would be helpful for the investigation of reproducibility and consistency with the result. We will focus on the robustness aspect, while incorporating the approach by outlier detection. Let {xi ; i = 1, · · · , N} be a set of n-dimensional observed vectors with the sample covariance matrix, S=
N 1 X ¯ )(xi − x ¯ )T , (xi − x N i=1
(1.1)
P ¯ = N1 ¯ is the sample mean, x xi . The principal components are where x given through this inner products of data points xi , i = 1, · · · , N, and the principal component vector. One implements the classical method by solving the eigenvalue problem associated with S. The PCA is not robust because the sample covariance matrix is sensitive to outliers. In the neural network approach, the rule of modifying the connection weights for online data has been studied using learning theory (see Amari, 1977, and Oja, 1982). When a simple linear neuron is considered, the output is given as the inner product of the input and the vector of connection weights. The vector of connection weights of the neuron attained by the learning rule corresponds to the principal component vector in the PCA. Thus, the learning rule of the neuron yields the principal component vector. The learning rule is usually organized by the algorithm for minimizing the energy function on the analogy of statistical physics. The energy function with discrimination of outliers has been discussed to ensure the robustness of self-organizing (see Xu & Yuille, 1995). Thus, the energy function is designed for detecting outliers in the search for the principal component vector. The PCA by the self-organizing rule aims at stable behavior only against outliers orthogonal to the principal component direction. In general, to assess the behavior under small contamination in data, the influence function has been defined (see Hampel, 1974, and Huber, 1981). In this article, we give a closed expression of the influence function of the principal component vector by the self-organizing rule and show that the boundedness property of the PCA by self-organizing rule depends on the direction of a possible outlier. It is remarkable that the influence function smoothly vanishes when the variable is remote from the line associated with the principal component vector. In section 2, we review the PCA and neural networks and introduce the method of Xu and Yuille (1995) for the PCA by the self-organizing rule. We
Influence Function of Principal Component Analysis
1437
propose a modification of the algorithm. Then, in section 3, we calculate the influence function for our method and discuss its use. We give an example in section 4 and present the conclusion in Section 5. 2 PCA and Neural Networks The principal component vector is defined as the vector of coefficients of the linear combination of the observed vector x such as the maximum variance among all the linear combinations of x. The principal component vector ˆ 1 of the sample covariance matrix S, defined in equais the eigenvector µ tion 1.1, corresponding to the largest eigenvalue: P T ˆ 1 (xi − x ¯ ))2 (µ ˆ 1. ˆ1 = µ (2.1) Sµ 2 ˆ 1k kµ ˆ 1 in the classical method is the minimizer The principal component vector µ of the energy function, the sum of squared lengths of the residual vectors ¯ when they are projected onto a direction m, of xi − x N X
¯ , m), z(xi − x
i=1
with respect to m, where z(x, m) = kxk2 −
(mT x)2 . kmk2
A linear neuron is a unit of information process from the input vector x to output of the form mT x, where m is called the connection weight vector and controls the input-output relation of the neuron. The learning process of a neuron leads to a modification of m in accordance with the data set {xi }. Thus, the self-organizing rule is formed by the learning based on each pair of input xi and output mT xi . Xu and Yuille (1995) proposed the modified algorithm defined by
m(t+1) = m(t) + α(t)
N X
¯ , m(t)) γ (t, xi − x
i=1
∂ ¯ , m(t)), (2.2) z(xi − x ∂m
where α(t) is the step size and γ (t, x, m(t)) =
1 . 1 + exp[β(t){z(x, m(t)) − η(t)}]
(2.3)
Here β(t) in equation 2.3 corresponds to the inverse of temperature in the context of statistical physics, and η(t) is the threshold value associated with the neuron. We will control them so that lim β(t) = β0
t→∞
and
lim η(t) = η0 .
t→∞
(2.4)
1438
Isao Higuchi and Shinto Eguchi
If γ is a constant, then the algorithm is reduced to that discussed by Amari (1977) and Oja (1982), which is equivalent to the classical PCA. Algorithm 2.2 searches for the maximizer of the marginal likelihood, N X Y
¯ m) + η(1 − Vi )}] exp[−β{Vi z(xi − x,
i=1 Vi =0,1
where V = {Vi } is a binary field, that is, Vi = 0 or 1. Each Vi plays the role of a decision indicator for deciding whether xi is an outlier. If xi is judged as an outlier, then Vi = 0. In Xu and Yuille (1995) the simulation result proved this method is robust. In the subsequent discussion, we will show that the PCA method is obtained through the eigenvalue problem associated with the adaptively weighted covariance matrix, Sγ =
N X
¯ )(xi − x ¯ )T , γ0 (xi )(xi − x
i=1
where ¯ , m(t)). γ0 (x) = lim γ (t, x − x t→∞
(2.5)
Thus, it suffices to investigate the behaviors of S and Sγ in order to compare the classical PCA and the PCA discussed here. In the following section we give a more tractable expression for γ0 (x). 3 Influence Function of Self-Organizing PCA Hampel (1974) presented the notion of influence function for the assessment of local behaviors of a target statistic, say TN , based on the data {xi : i = 1, . . . , N}. We assume that TN has an expression as a functional, say T(F¯N ). Here F¯N denotes the empirical distribution function on the histogram of data, N 1 X δx (x), F¯N (x) = N i=1 i
where δy (x) = 1 at x = y and zero otherwise. In practice almost all statistics are explicitly or implicitly of the form of such functionals if the data come ¯ can be written as from random sampling. For example, the sample mean x a functional of F¯N ,
x¯ ≡
Z N 1 X xi = xdF¯N (x), N i=1
Influence Function of Principal Component Analysis
1439
and so can the sample covariance matrix Z N 1 X T ¯ )(xi − x ¯ ) = (x − x ¯ )(x − x ¯ )T dF¯N (x). ( xi − x S≡ N i=1 ˆ 1 can be viewed as a functional In this way, the principal component vector µ ˆ 1 (F¯N ) is implicitly defined by equation 2.1. The ˆ 1 (F¯N ) of F¯N , though µ µ influence function of T(F¯N ) against y is defined by T[(1 − ε)F¯N + εδy ] − T(F¯N ) . ε→0 ε
IF(y ; F¯N , T) = lim
Thus the influence function gives information on local sensitivity against gross error. Let N be sufficiently large. Then we have IF(y ; F¯N , T) ≈ N(T(F¯N+1 ) − T(F¯N )), where F¯N+1 =
N ¯ 1 FN + δy . N+1 N+1
Hence the influence function gives approximately the difference between the actual statistic T(F¯N ) and the statistic T(F¯N+1 ) with an additional outlier y . If the influence function is bounded, then the influence of the statistic by any y should have a mild effect. On the other hand, unboundedness of the influence function indicates potential fatal sensitivity of the statistic to some y , even for a large sample size. For the classical PCA, the influence function of the principal component ˆ 1 is vector µ ˆ 1 ) = −ˆa1 (y ) IFclassic = IF(y ; F¯N , µ
n X
ˆ j, aˆj (y )(λˆ j − λˆ 1 )−1 µ
(3.1)
j=2
P
ˆ jT (y − x ¯ ), and µ ˆ j , λˆ j or equivalently aˆj (y ) = µ are the normalized eigenvectors and eigenvalues λˆ 1 > λˆ 2 > · · · > λˆ n of S ˆ 1 ) = 0, then y − x ¯ must be in the line spanned (Critchley, 1985). If IF(y ; F¯N , µ ˆ 1 or orthogonal to the line. Further, we observe that by µ ¯ = where y − x
ˆj (y )µ ˆ j, ja
¯ ; F¯N , µ ˆ 1 ) = λ2 IF(y ; F¯N , µ ˆ 1 ), IF(λy + (1 − λ)x for any λ. This shows the sensitivity of the classical PCA, because ¯ ; F¯N , µ ˆ 1 )k = O(λ2 ), kIF(λy + (1 − λ)x
(3.2)
ˆ 1 ) 6= 0. In this sense the classical principal for any y such that IF(y ; F¯N , µ component vector is clearly nonrobust. The approach by the sensitivity analysis is taken to measure the influence of each observation xi on the classical PCA.
1440
Isao Higuchi and Shinto Eguchi
We take another approach to PCA by the neural network method, as introduced in section 2. We will show in the appendix that m(t) defined in ˜ 1 of Sγ which corresponds to equation 2.2 converges to the eigenvector µ the maximum eigenvalue λ˜ 1 . It follows from assumption 2.4 that the limit γ0 (y ) defined in equation 2.5 is expressed as −1 n X a˜j2 (y ) − η0 , γ0 (y ) = 1 + exp β0 j=2
P
˜j (y )µ ˜ j , or equivalently a˜j (y ) ja
˜ jT (y −x ¯ ). Here µ ˜ 1, . . . , µ ˜n =µ are the unit eigenvectors of Sγ , corresponding to eigenvalues λ˜ 1 > λ˜ 1 > · · · > λ˜ n , respectively. We can similarly calculate the influence function of ˜ 1 from the influence function of Sγ as the principal component vector µ follows: ¯ = where y −x
˜ 1 ) = −γ0 (y )˜a1 (y ) IFneural = IF(y ; FN , µ
n X j=2
a˜j (y)λj ˜ j. µ λ˜ j (λj − λ1 )
(3.3)
It follows from equations 3.1 and 3.3 that the difference between the influence functions IFclassic and IFneural is only whether the scalar function γ0 (y ) is multiplied. A similar argument to equation 3.2 leads to ¯ ; F¯N , µ ˜ 1) = IF(λy + (1 − λ)x where C1 = β0
Pn
˜j2 (y ). j=2 a
˜ 1) IF(y ; F¯N , µ λ2 , 1 + exp(C1 λ2 − β0 η0 ) γ0 (y )
˜ 1 against λy + Thus the influence function of µ
¯ is a bounded vector, and as λ2 → ∞, the limit becomes a zero (1 − λ)x vector. The scalar γ0 (y ) quickly vanishes as y increases with any direction ˜ 1 direction. We furthermore pursue the behavior of γ0 (y ) by except for the µ Pn 2 a˜j (y ), that is, viewing it as a function of z = j=2 γ0 (y ) =
1 . 1 + exp{β0 (z − η0 )}
We observe that the graph γ0 (y ) against z varies in accordance with a few values of β0 and η0 (see Figure 1). Here, β0 and η0 are the limits of the inverse temparature β(t) and the threshold value η(t), respectively, as given by equation 2.4. In accordance with this, we observe that the temperature and threshold values jointly play the direct role on the robustness of the neural PCA against outliers. In the following section, we suggest another use of γ0 (y ) in an influential analysis against outliers. 4 Example In this section, we apply our method to soil composition data (Kendall, 1975, Tables 2.1, 2.4). There are 20 observations with 4 variables. The principal
Influence Function of Principal Component Analysis
1441
Figure 1: Graphs of γ0 (y).
component vector is PC = (0.955785, 0.293681, 0.014972, 0.001317). In this case, we choose α(t) = 0.001/t, η(t) = 10 (constant), and β(t) = 10 (constant). The result of our method is
m0 = (0.953688, 0.300199, 0.012510, 0.014235). The cosine of the angle between PC and m0 is 0.99989008. Thus, the two vectors are nearly equal. The statistics |IFclassic | for soil numbers 4, 13, and 17 are large, as seen in Table 1, showing that these data are highly influential on the principal
1442
Isao Higuchi and Shinto Eguchi
Table 1: Comparison Between Influence Measures |IFclassic | and γ0 for Soil Composition Data. Soil
|IFclassic |
γ0
Soil
|IFclassic |
γ0
16 20 12 3 5 6 10 14 19 15
0.00162 0.02565 0.03021 0.04696 0.05993 0.06288 0.07236 0.08564 0.08949 0.15425
1 1 1 1 4.97 × 10−26 1 1 1.49 × 10−39 1 1
18 9 11 2 8 7 1 17 13 4
0.16165 0.18078 0.19051 0.19452 0.20902 0.23871 0.27304 0.33459 0.45593 0.80959
1 1 0.98581 1 1 1 1 1.02 × 10−7 1.04 × 10−60 1.72 × 10−44
Table 2: Projected Values of Soil Composition Data by the Classical PCA. Soil
1st PC
2d PC
3d PC
1 2 3 4 5 6 7 8 9 10
−9.557 −13.070 −1.458 13.073 −1.060 −13.294 −10.997 14.680 14.839 3.337
−2.203 −0.987 −2.536 −4.803 −4.233 −0.122 1.708 −0.981 0.484 −1.564
0.587 −0.154 0.707 −0.034 −0.226 0.329 −0.692 −0.962 −0.068 0.253 0.067 0.381 −0.276 0.023 0.231 0.570 0.661 0.626 0.356 0.704
4th PC
Soil 11 12 13 14 15 16 17 18 19 20
1st PC
3d PC
4th PC
4.887 −3.090 −0.275 −1.012 2.090 −1.209 6.799 5.283 0.788 1.457 4.401 −1.658 8.410 1.382 0.497 −0.142 0.490 −0.632 7.822 3.271 0.849 −10.269 0.861 0.630 −12.242 0.195 0.104 −2.203 0.355 −0.442
2d PC
0.012 0.183 −0.162 0.416 −0.096 −0.543 −0.555 −0.759 0.586 −0.816
component vector. This observation is nearly equivalent to that for the principal value given in Critchley (1985). Let us compare the performance of the statistic γ0 with that of |IFclassic |. This statistic γ0 also detects the three soil numbers 4, 13, and 17 as influential data. Further, two soil numbers, 5 and 14, have relatively small values of γ0 . This would suggest that γ0 detects a masking effect beyond the one-delete detection. As shown in Table 2, the two vectors of principal components corresponding to soil numbers 5 and 14 are almost symmetric with regard to the origin. 5 Conclusions Our method can remove the bad influence by outliers, which fails in the classical PCA. This advantageous method is characterized by the region
Influence Function of Principal Component Analysis
1443
of tube with centerline µ1 . The usual robustification of PCA, the method using a robust estimator of the sample covariance matrix (Maronna, 1976), is characterized by the region of ellipse with the mean as its center. By this usual method, the data outside the ellipse are truncated. By our method, in contrast, data points outside the ellipse are not discarded as long as they lie in the direction of the principal component vector. This suggests that our method has an advantage in achieving asymptotic efficiency. 1/2 The tube characterizing the influence depends on the radius η0 . If η0 is small, then most of the data are regarded as outliers, and thus the result of our algorithm has little information. On the other hand, when η0 is large, this algorithm is not robust. Therefore, the choice of η0 is an important factor. Appendix ˜ 1 . For conFirst, we show that m(t) defined in equation 2.2 converges to µ venience, we rewrite the algorithm in terms of the empirical distribution of the data. By using the empirical distribution F¯N , equation 2.2 is rewritten as Z m(t + 1) = m(t) + α(t) γ (t, x∗ , m(t)) ¸ · {m(t)T x∗ }2 m (t) dF¯N (x), × {m(t)T x∗ }x∗ − km(t)k2
(A.1)
¯ . By taking the limit as t → ∞, we get where x∗ = x − x Sγ µ1 −
µT1 Sγ µ1 µ1 = 0, µT1 µ1
˜ of where µ1 = limt→∞ m(t). Thus, µ1 is the principal component vector µ Sγ . Second, we give a derivation of equation 3.3 for the influence function of the robust PCA by the self-organizing rule. We write F¯N,ε = (1 − ε)F¯N + εδy . Based on F¯N,ε , equation A.1 is changed to
mε (t + 1) = mε (t) · Z ∂ z(x∗ε , mε (t))dF¯N (x) +α(t) (1 − ε) γ (t, x∗ε , m(t)) ∂m ¸ ∂ z(y ∗ε , mε (t)) , + εγ (t, y ∗ε , m(t)) ∂m ¯ ε , y ∗ε = y − x ¯ ε , and x ¯ ε = (1 − ε)x ¯ + εy . We write µ1,ε = where x∗ε = x − x limt→∞ mε (t); hence µ1,ε is the principal component vector of Sγ ,ε obtained
1444
Isao Higuchi and Shinto Eguchi
by replacing by F¯N (y ) with F¯N,ε in the definition of Sγ . By definition, µ ¶ Z ∂γ (t, x∗ε , mε (t)) ∗ ∗T ¯ xε xε dFN (x) Sγ ,ε = (1 − ε) Sγ + ∂ε + ε lim γ (t, y ∗ , mε (t))y ∗ y ∗ T + O(ε2 ), t→∞
¯ . Differentiating Sγ ,ε with regard to ε, we get the influence where y ∗ = y − x function of Sγ as IF(y ; F¯N , Sγ ) = −Sγ − 1∗ + γ0 (y )y ∗ y ∗ T , R ˜ T1 x∗ )x∗ x∗T dF¯N (x)}M, with M = (µ˜ 1 , . . . , µ˜ n ) and where 1∗ = MT { ∂γ ∂z (µ β(z−η) ˜ 1 by ). Finally, we get the influence function 3.3 of µ γ (z) = 1/(1 + e Sibson’s lemma (1979). References Amari, S.-I. (1977). Neural theory of association and concept formation. Biol. Cybernetics, 26, 175–185. Critchley, F. (1985). Influence in principal components analysis. Biometrika, 72, 627–636. Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. Asso., 69, 383–393. Huber, P. J. (1981). Robust statistics. New York: Wiley. Kendall, M. G. (1975). Multivariate analysis. London: Griffin. Maronna, R. A. (1976). Robust M-estimators of multivariate location and scatter. Annals of Statistics, 4, 51–67. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biol., 15, 267–273. Sibson, R. (1979). Studies in the robustness of multidimensional scaling: Perturbational analysis of classical scaling. J. R. Statist. Soc. B, 41, 217–229. Xu, L., & Yuille, A. (1995). Robust principal component analysis by selforganizing rules based on statistical physics approach. IEEE Trans. on Neural Networks, 6, 131–143. Received June 30, 1997; accepted December 10, 1997.
NOTE
Communicated by Bruno Olshausen
A Sparse Representation for Function Approximation Tomaso Poggio Federico Girosi Center for Biological and Computational Learning and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
We derive a new general representation for a function as a linear combination of local correlation kernels at optimal sparse locations (and scales) and characterize its relation to principal component analysis, regularization, sparsity principles, and support vector machines. 1 Introduction Consider the classical regression problem of approximating a multivariate function from a finite set of data. The data are given as a set Dl ≡ {(xi , zi ) ∈ X × Z}N i=1 , obtained by sampling N times the set X × Z according to P(x, z). The goal is to estimate a deterministic function, f (x), that models the relationship between X and Z and thereby solves the associated regression problem. In the form of the problem that we consider in this article, we also assume that probabilistic information about the underlying function f is available in terms of the associated correlation function: f belongs to a set of functions { fα } over which a probability distribution P(α) is defined. Then the correlation function of the random signal f —of which the fα are realizations—is R(x, y) = E[ fα (x) fα (y)],
(1.1)
where E[·] denotes expectation with respect to P(α). We will first assume to have a dense set of data, possibly on a regular grid. Later we derive our main result in the case in which we use as few of the data points as possible, at sparse optimal locations. Our standard example throughout the article is the regression problem of reconstructing a specific image f given its pixel values at discrete locations in the image plane. Thus, x is a vector in the image plane, and fα are images of the same type as f , all aligned and registered—for instance, images of faces. This example is a problem of signal reconstruction (or image compression) when prior information about the type of images is available. We rely heavily on the formulation and the results of Girosi (1998). Part of the motivation for our work comes from the attempt to develop a rigorous framework for the ideas in Penev and Atick (1996) of using a correlation kernel in image reconstruction. c 1998 Massachusetts Institute of Technology Neural Computation 10, 1445–1454 (1998) °
1446
Tomaso Poggio and Federico Girosi
2 The PCA Representation and the Dual (Regularization) Representation Since the correlation function R defined in equation 1.1 is positive definite, it can be used to define a reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950; Wahba, 1990; Girosi, 1998). It is known that R can be represented as R(x, y) =
M X
λn φn (x)φn (y),
(2.1)
n=1
where M ≤ ∞ and the eigenfunctions φn are the solution of Z dyR(x, y)φn (y) = λn φn (x). The system of orthogonal functions φn can be made orthonormal and ordered with decreasing positive eigenvalue λn . The basis functions φn provide a natural representation for solving our regression problem since they span the RKHS. Any function f in the RKHS can be written as f (x) =
M X
cn φn (x).
(2.2)
n=1
As we mentioned, M ≤ ∞, but in this article we will assume that M is finite (which is always the case if the signals are discretized). The classical principal component analysis (PCA) approach is to use an approximation of equation 2.2 in terms of a finite number of principal components φn in order to estimate the parameters cn from the N data points minimizing the approximation error. The framework of regularization theory described in Wahba (1990) and Girosi (1998) suggests that our regression problem can also be solved more directly by minimizing the natural regularization functional induced by R. Thus, we propose to find f that minimizes N X (zi − f (xi ))2 + γ k f k2R ,
H[ f ] =
(2.3)
i=1
where k f k2R =
X h f, φn i2 n
λn
is the norm in the RKHS. Minimization of equation 2.3 yields a dual regularization representation: f (x) =
N X i=1
ci R(x, xi ).
(2.4)
A Sparse Representation for Function Approximation
1447
The two representations are equivalent (under the same error criteria; see Poggio & Girosi, 1998) when the number of principal components is chosen equal to M for the limit of γ = 0 (in both cases even for γ different from 0, the coefficients are linear functionals of the data). Notice that unlike the global φn , the basis functions R(x, xi ) are usually quite local. Consider, for instance, the translation invariant case of natural images, where the φn are Fourier components, while the correlation has a localized peak. 3 A Sparse Representation Our regularization representation for the regression function is in terms of a weighted sum of correlation kernels, each centered at a data point xi . This dual representation contains a large number of terms if the number of data points is large (for instance, all pixels in an image). Given the results of Girosi (1998), it is natural to sparsify it by using support vector machines (SVM) regression (Vapnik, 1995; Vapnik, Golowich, & Smola, 1997) with the kernel R (which corresponds to the PCA features). Thus, instead of minimizing the regularization functional (see equation 2.3), we minimize the functional H, H[ f ] =
N X
|zi − f (xi )|² + γ k f k2R
(3.1)
i=1
where the robust ²-insensitive error function of Vapnik (1995) is used instead of the usual L2 norm on the data term: ½ |zi − f (xi , a)|² =
0 |zi − f (xi , a)| − ²
if |zi − f (xi , a)| < ² otherwise.
(3.2)
It has been shown (Girosi, Poggio, & Caprile, 1991; Girosi, 1998; Smola & Scholkopf, ¨ 1998) that the function that minimizes a regularization functional with a non-L2 norm on the data term has the same form as equation 2.4. In addition, as a result of using the particular error function (see equation 3.2), only a “small” number of coefficients ai will be different from zero: the data points associated to them are called support vectors. Thus, the function that minimizes equation 3.1 is f (x) =
N0 X
ai R(x, xi ),
(3.3)
i=1
where the coefficients ai are obtained by solving the quadratic programming problem corresponding to equation 3.1. Since the total number of nonzero coefficients is usually many fewer than N, equation 3.3 is a sparse representation for f (x) in terms of local features, because the kernel R is usually
1448
Tomaso Poggio and Federico Girosi
local. In addition to solving our original regression problem, it represents a general alternative to PCA. The representation (see equation 3.3) is characterized by the following properties: 1. It is a well-founded solution in terms of the generalization error of the regression problem because of the properties of SVM and its connections to Vapnik’s theory of structural risk minimization (Vapnik, 1982, 1995). 2. It is a principled sparsification of the overcomplete dictionary of R(x, xi ) because minimizing equation 3.1 is closely related to using the basis pursuit denoising (BPD) principle (Chen, 1995; Chen, Donoho, & Saunders, 1995; Girosi, 1998). Minimizing equation 3.1 in the limit of γ = 0 is equivalent—under the assumption of noiseless data ( f (xi ))— to minimizing the following functional with respect to the coefficients an : ° °2 N ° X 1° ° ° ai R(x; xi )° + ²kakL1 . (3.4) E[a] = ° f (x) − ° 2° i=1 R
Equation 3.4 represents f as a superposition of local features R(x, xi ) at sparse locations. The properties of R suggest an interesting, though partial, extension to include multiple scales. Equation 2.1 implies that for finite M, several different kernels can be defined, all corresponding to the same feature space of the principal components φn (notice that M is always finite when R is estimated from a finite number of observations fα ). For instance, we may define a family of kernel functions associated with the correlation kernel R and parameterized by d as Rd (x, y) =
M X (λn )d φn (x)φn (y),
(3.5)
n=1
of which R = R1 is a special case. It is not completely trivial to notice (following Penev & Atick, 1996; Poggio & Girosi, 1998) that d controls the locality of the kernel (see Figures 1 and 2). In the shift-invariant case, for which the φn are Fourier basis functions, d acts as a filter: low pass for increasing d and high pass for decreasing d. Thus, locality increases for decreasing d. For instance, when Rd is a gaussian function, d controls directly the effective variance of the gaussian. The most interesting values of d range between 0 and 1: R0 , which is less smooth than R1 , plays the role of the δ function in the space spanned by the φn , while Rd with negative d are similar to “derivatives” of the delta function.1 1
of φ.
Other parameterizations of R are possible, for instance, in terms of the number M
A Sparse Representation for Function Approximation
d=0
d=0.5
1449
d=1
Figure 1: The figure (in pseudocolor: blue represents values near the zero mean and red represents high values) shows R0 , R0.5 , and R0 for different xi as a function of x. Rd is not position invariant and is more “local” as d decreases. A set of 512 roughly aligned face images was used in this example.
1450
Tomaso Poggio and Federico Girosi
d=0
d=0.5
d=1
Figure 2: Same as in Figure 1, but the set of images consisted of 924 images of pedestrians.
A Sparse Representation for Function Approximation
1451
Rd can replace R in equations 2.3, 2.4, 3.1, 3.3, and 3.4. The set of Rd (x, xi ), with i and d ranging over a “large” set of locations and scales, can be considered to be an overcomplete dictionary. We suggest sparsifying it in a similar way as R(x, xi ), by using the BPD multiscale functional (see Poggio & Girosi, 1998), yielding a sparse, multiscale representation, f (x) =
0 N ,D0 X
ai,d Rd (x, xi ).
(3.6)
i,d
One possible way to obtain equation 3.6 from the SVM technique is the following (Poggio & Girosi, 1998). We assume that f (x) = f1 (x) + f2 (x), where f1 and f2 represent the components of f at two different scales (the generalization to the case of more than two scales is immediate). The functional to be minimized is H[ f ] =
N X i=1
|zi − f1 (xi )|²1 + η
N X
|zi − f1 (xi ) − f2 (xi )|²2
i=1
+ γ1 k f1 k2R1 + γ2 k f2 k2R2 , where ²1 > ²2 , and η is a positive number. The underlying idea is that f1 is a coarse approximation to the data at one scale (²1 should be “large”), while f2 is a refinement at a finer scale ( f2 approximates the residuals of f1 , and ²2 should be “small”). Notice that the same formulation can be used with the corresponding standard regularization functional, whose minimization will yield a multiscale but not sparse solution. 4 Discussion Let us summarize our main results. We have shown that: 1. A function can be represented by either the principal components induced by an associated correlation function or in a dual and equivalent way by the regularization solution—a weighted sum of correlation kernels evaluated at N data points. 2. The regularization solution can be sparsified using the SVM technique on the dictionary of basis functions R(x, xi ), yielding a representation of f in terms of sparse, local features. Given the relation between SVM and BPD, this representation is not only almost optimally compact but also seems to have good generalization properties (Mukherjee, Osuna, & Girosi, 1997). 3. The new representation can be extended to include basis functions, which are generalized correlations at different locations and scales. Notice that the choice of the kernel R in regularization and in SVM can be justified in terms of the usual Bayes interpretation of the regularization
1452
Tomaso Poggio and Federico Girosi
functional (see Poggio & Girosi, 1997 and Girosi, Jones, & Poggio, 1995). The representation of f as a superposition of correlation-like kernels at sparse locations is in general different from the PCA representation, even if the latter is used in conjunction with the sparsification cost functional 3.4 (Poggio & Girosi, 1998, appendix B), since it optimizes a different cost functional. Although it is the optimal solution of our original regression problem, it may also represent a general alternative to PCA in many signal processing tasks, such as reconstruction, compression, and even dimensionality reduction for classification. We expect different trade-offs. For instance, PCA may be more natural if the feature space has low dimensionality compared with the number of examples. This means that our technique may be better than PCA in situations in which only parts of an image are available and full reconstruction is desired. This may be the case when occlusions or nonuniform noise are present. The regression problem we consider here is a problem of signal reconstruction. It is very different from the problem of pattern classification (Poggio & Girosi, 1998; Boser et al., 1992; Cortes & Vapnik, 1995; Vapnik, 1995). Following the spirit of Penev and Atick (1996), the same sparsified kernels computed for regression may be used after preprocessing in classification— in the same way in which principal components are often used—effectively representing a choice of sparse feature from an appropriate large dictionary of basis functions provided by the Rd (x, xi ) (Poggio & Girosi, 1998). Much of the power of PCA (and of our representation) depends on using correlation functions that are not necessarily shift invariant (see Figures 1 and 2). Shift-invariant correlation functions are not very interesting from the point of view of the representations discussed here; they all correspond to the same set of Fourier features. Of course, the correlation function corresponding to a large set of images of different scenes and objects will be translation and scale invariant (see Penev & Atick, 1996, and the references herein). Properly aligned images of objects of the same type (such as faces or people; see Oren, Papageorgiou, Sinha, Osuna, & Poggio, 1997) yield correlation functions that are not shift invariant (see Sirovich & Kirby, 1987; see also Turk & Pentland, 1991). The associated φn features may be used to capture information about the specific class of images. They are, however, global. The correlation kernels Rd (x, xi ) instead yield local multiscale “features,” which can be sparsified and thereby simultaneously optimized for generalization. We can speculate that the cortex may use special machinery to align and normalize visual inputs so that dictionaries of object-specific features may be learned without being affected by arbitrary translations and scalings. At earlier stages of the visual system, however, one may expect from our results that translation-invariant correlation functions associated with nonaligned images of different types will determine basis functions similar to local Fourier components.
A Sparse Representation for Function Approximation
1453
Acknowledgments We thank Mike Oren, Amnon Shashua, Bruno Olshausen, and Alessandro Verri for their useful comments. We thank Constantine Papageorgiou for providing us with the figures.
References Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 686, 337–404. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (July, 1992). A training algorithm for optimal margin classifier. In Proc. 5th ACM Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh, PA. Chen, S. (1995). Basis pursuit. Unpublished doctoral dissertation, Stanford University. Chen, S., Donoho, D., & Saunders, M. (May, 1995). Atomic decomposition by basis pursuit (Tech. Rep. No. 479). Stanford, CA: Department of Statistics, Stanford University. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 1–25. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6), 1455–1480. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7, 219–269. Girosi, F., Poggio, T., & Caprile, B. (1991). Extensions of a theory of networks for approximation and learning: Outliers and negative examples. In R. Lippmann, J. Moody, & D. Touretzky (Eds.), Advances in neural information processings systems 3. San Mateo, CA: Morgan Kaufmann. Mukherjee, S., Osuna, E., & Girosi, F. (September, 1997). Nonlinear prediction of chaotic time series using support vector machines. In J. Principe, L. Giles, N. Morgan, & E. Wilson (Eds.), IEEE Workshop on Neural Networks and Signal Processing, Amelia Island, FL. Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., & Poggio, T. (June, 1997). Pedestrian detection using wavelet templates. In Proc. Computer Vision and Pattern Recognition (pp. 193–199). Puerto Rico. Penev, P. S., & Atick, J. J. (1996). Local feature analysis: A general statistical theory for object representation. Neural Systems, 7, 477, 500. Poggio, T., & Girosi, F. (1998). Notes on PCA, regularization, support vector machines and sparsity (A.I. Memo). Cambridge, MA: MIT Artificial Intelligence Laboratory. Sirovich, L., & Kirby, M. (1987). A low dimensional procedure for identifying human faces. Journal of Optical Society A, 4:519. Smola, A., & Scholkopf, ¨ B. (1998). From regularization operators to support vector kernels. In Advances in neural information processing systems 10. Cambridge, MA: MIT Press.
1454
Tomaso Poggio and Federico Girosi
Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. Vapnik, V. N. (1982). Estimation of dependences based on empirical data. Berlin: Springer-Verlag. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V., Golowich, S. E., & Smola, A. (1997). Support vector method for function approximation, regression estimation, and signal processing. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9 (pp. 281–287). Cambridge, MA: MIT Press. Wahba, G. (1990). Splines models for observational data. Philadelphia: SIAM. Received September 11, 1997; accepted December 16, 1997.
LETTER
Communicated by Bruno Olshausen
An Equivalence Between Sparse Approximation and Support Vector Machines Federico Girosi Center for Biological and Computational Learning and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
This article shows a relationship between two different approximation techniques: the support vector machines (SVM), proposed by V. Vapnik (1995) and a sparse approximation scheme that resembles the basis pursuit denoising algorithm (Chen, 1995; Chen, Donoho, & Saunders, 1995). SVM is a technique that can be derived from the structural risk minimization principle (Vapnik, 1982) and can be used to estimate the parameters of several different approximation schemes, including radial basis functions, algebraic and trigonometric polynomials, B-splines, and some forms of multilayer perceptrons. Basis pursuit denoising is a sparse approximation technique in which a function is reconstructed by using a small number of basis functions chosen from a large set (the dictionary). We show that if the data are noiseless, the modified version of basis pursuit denoising proposed in this article is equivalent to SVM in the following sense: if applied to the same data set, the two techniques give the same solution, which is obtained by solving the same quadratic programming problem. In the appendix, we present a derivation of the SVM technique in the framework of regularization theory, rather than statistical learning theory, establishing a connection between SVM, sparse approximation, and regularization theory. 1 Introduction There has been an increasing interest in approximation techniques that use the concept of sparsity to perform some form of model selection. By sparsity we mean, in very general terms, a constraint that enforces the number of building blocks of the model to be small. Sparse approximation often appears in conjunction with the use of overcomplete or redundant representations, in which a signal is approximated as a linear superposition of basis functions taken from a large dictionary (Chen, 1995; Chen et al., 1995; Olshausen & Field, 1996; Daubechies, 1992; Mallat & Zhang, 1993; Coifman & Wickerhauser, 1992). In this case, sparsity is used as a criterion to choose between different approximating functions with the same reconstruction error, favoring the one with the least number of coefficients. The concept of sparsity has also been used in linear regression, as an alternative to subset c 1998 Massachusetts Institute of Technology Neural Computation 10, 1455–1480 (1998) °
1456
Federico Girosi
selection, in order to produce linear models that use a small number of variables and therefore have greater interpretability (Tibshirani, 1994; Breiman, 1993). In this article, we discuss the relationship between an approximation technique based on the principle of sparsity and the support vector machines (SVM) technique recently proposed by Vapnik (Vapnik, 1995; Vapnik, Golowich, & Smola, 1997; Cortes & Vapnik, 1995; Boser, Guyon, & Vapnik, 1992). SVM is a classification and approximation technique derived by V. Vapnik in the framework of structural risk minimization, which aims at building parsimonious models, in the sense of VC-dimension. Sparse approximation techniques are also parsimonious, in the sense that they try to minimize the number of parameters of the model, so it is not surprising that some connections between SVM and sparse approximation exist. What is more surprising, and less obvious, is that SVM and a specific model of sparse approximation, which is a modified version of the basis pursuit denoising algorithm (Chen, 1995; Chen et al., 1995), are actually equivalent in the case of noiseless data. By equivalent, we mean the following: if applied to the same data set, they give the same solution, which is obtained by solving the same quadratic programming problem. The equivalence between sparse approximation and SVM for noiseless data is the main point of this article, but we also include a derivation of the SVM different from the one given by V. Vapnik and that fits very well in the framework of regularization theory, the same one used to derive techniques like splines or radial basis functions. In section 2 in this article, we introduce the technique of SVM in the framework of regularization theory (the mathematical details are set out in appendix B). Section 3 introduces the notion of sparsity and presents an exact and approximate formulation of the problem. In section 4 we present a sparse approximation model, which is similar in spirit to the basis pursuit denoising technique of Chen et al. (1995), and show that, in the case of noiseless data, it is equivalent to SVM. Section 5 concludes the article and contains a series of remarks and observations. Appendix A contains some background material on reproducing kernel Hilbert spaces, which are heavily used in this article. Appendix B contains an explicit derivation of the SVM technique in the framework of regularization theory, and appendix C addresses the case in which data are noisy. 2 From Regularization Theory to Support Vector Machines In this section we briefly sketch the ideas behind SVMs for regression and refer readers to Vapnik (1995) and Vapnik et al. (1997) for a full description of the technique. We present Vapnik’s theory in a slightly different manner. We will take a viewpoint closer to classical regularization theory (Tikhonov & Arsenin, 1977; Morozov, 1984; Bertero, 1986; Wahba, 1975, 1990), which might be more familiar to readers, rather than the theory of uniform conver-
Sparse Approximation and Support Vector Machines
1457
gence in probability developed by Vapnik (1982, 1995). A similar approach is described in Smola and Scholkopf ¨ (1998), although with a different formalism. In this section and in the following ones, we will need some basic notions about reproducing kernel Hilbert spaces (RKHS). For simplicity of exposition we put all the technical material about RKHS in appendix A. Since the RKHS theory is very well developed, we do not include many important mathematical technicalities (like the convergence of certain series, or the issue of semi-RKHS), because the goal here is just to set out a basic explanation of an existing technique. The rigorous mathematical apparatus that we use can be found in Wahba (1990). 2.1 Support Vector Machines. The problem we want to solve is the following: we are given a data set D = {(xi , yi )}li=1 , obtained by sampling, with noise, some unknown function f (x), and we are asked to recover the function f , or an approximation of it, from the data D. We assume that the function f underlying the data can be represented as: f (x) =
∞ X
cn φn (x) + b,
(2.1)
n=1
where {φn (x)}∞ n=1 is a set of given, linearly independent basis functions, and cn and b are parameters to be estimated from the data. Notice that if one of the basis functions φn is constant, the term b is not necessary. The problem of recovering the coefficients cn and b from the data set D is clearly ill posed, since it has an infinite number of solutions. In order to make this problem well posed, we follow the approach of regularization theory (Tikhonov & Arsenin, 1977; Morozov, 1984; Bertero, 1986; Wahba, 1975, 1990) and impose an additional smoothness constraint on the solution of the approximation problem. Therefore, we choose as a solution the function that solves the following variational problem: min H[ f ] = C f ∈H
l X i=1
1 V(yi − f (xi )) + 8[ f ], 2
(2.2)
where V(x) is some error cost function used to measure the interpolation error (for example, V(x) = x2 ), C is a positive number, 8[ f ] is a smoothness functional, and H is the set of functions over which the smoothness functional 8[ f ] is well defined. The first term is enforcing closeness to the data, and the second smoothness, while C controls the trade-off between these two terms. A large class of smoothness functionals, defined over elements of the form in equation 2.1, can be defined as follows: 8[ f ] =
∞ 2 X cn , λ n=1 n
(2.3)
1458
Federico Girosi
where {λn }∞ n=1 is a decreasing, positive sequence. That equation 2.3 actually defines a smoothness functional can be seen in the following example. Example. Let us consider a one-dimensional case in which x ∈ [0, 2π ], and let us choose φn (x) = einx , so that the cn are the Fourier coefficients of the function f . Since the sequence {λn }∞ n=1 is decreasing, the constraint that 8[ f ] < ∞ is a constraint on the rate of convergence to zero of the Fourier coefficients cn , which, as is well known, can control the differentiability properties of f . Functions for which 8[ f ] is small have limited high-frequency content, and therefore do not oscillate much, so that 8[ f ] is a measure of smoothness. More examples can be found in appendix A. When the smoothness functional has the form of equation 2.3, it is easy to prove (see appendix B) that, independently on the form of the error function V, the solution of the variational problem (see equation 2.2) always has the form f (x) =
l X
ai K(x, xi ) + b
(2.4)
i=1
where we have defined the (symmetric) kernel function K as: K(x; y) =
∞ X
λn φn (x)φn (y).
(2.5)
n=1
The kernel K can be seen as the kernel of an RKHS, a concept that will be used in section 4. Details about RKHS and examples of kernels can be found in appendix A and in Girosi (1997). If the cost function V is quadratic, the unknown coefficients in equation 2.4 can be found by solving a linear system. When the kernel K is a radially symmetric function, equation 2.4 describes a radial basis functions approximation scheme, which is closely related to smoothing splines, and when K is of the form K(x − y), equation 2.4 is a regularization network (Girosi, Jones, & Poggio, 1995). When the cost function V is not quadratic anymore, the solution of the variational problem (see equation 2.2) still has the form of equation 2.4 (Smola & Scholkopf, ¨ 1998; Girosi, Poggio, & Caprile, 1991), but the coefficients ai cannot be found anymore by solving a linear system. V. Vapnik (1995) proposed using a particularly interesting form for the function V, which he calls the ²-insensitive cost function, which we plot in Figure 1: ½ V(x) = |x|² ≡
0 |x| − ²
if |x| < ² otherwise.
(2.6)
Sparse Approximation and Support Vector Machines
1459
V(x)
ε
x
Figure 1: Vapnik’s ²-insensitive cost function V(x) = |x|² .
The ²-insensitive cost function is similar to some of the functions used in robust statistics (Huber, 1981), which are known to provide robustness against outliers. However, the function in equation 2.6 is not only a robust cost function, but also assigns zero cost to errors smaller than ². In other words, according to the cost function |x|² , any function that comes closer than ² to the data points is a perfect interpolant. In a sense, the parameter ² represents the resolution at which we want to look at the data. When the ²-insensitive cost function is used in conjunction with the variational approach of equation 2.2, one obtains the approximation scheme known as SVM, which has the form f (x, α, α ∗ ) =
l X (αi∗ − αi )K(x; xi ) + b,
(2.7)
i=1
where αi∗ and αi are some positive coefficients that solve the following quadratic programming (QP) problem: min∗ R(α ∗ , α) = ² α,α
l X
(αi∗ + αi ) −
i=1
l X
yi (αi∗ − αi )
i=1
l 1X (α ∗ − αi )(αj∗ − αj )K(xi ; xj ), + 2 i,j=1 i
(2.8)
subject to the constraints 0 ≤ α∗ , α ≤ C Pl ∗ i=1 (αi − αi ) = 0 ∗ ∀i = 1, . . . , l. αi αi = 0
(2.9)
1460
Federico Girosi
Table 1: Some Possible Kernel Functions and the Type of Decision Surface They Define. Kernel Function
Approximation Scheme
exp(−kx − yk2 )
K(x; y) = K(x; y) = (1 + x · y)d K(x; y) = tanh(x · y − θ) K(x; y) = B2n+1 (x − y) sin(d+1/2)(x−y) K(x; y) = (x−y) sin
Gaussian RBF Polynomial of degree d Multilayer perceptron only for some values of θ B-splines Trigonometric polynomial of degree d
2
Note: The last two kernels are one-dimensional: multidimensional kernels can be built by tensor products of one-dimensional ones. The functions Bn are piecewise polynomials of degree n, whose exact definition can be found in Schumaker (1981).
Notice that the parameter b does not appear in the QP problem, and we show in appendix B that it is determined from the knowledge of α and α∗ . It is important to notice that it is possible to prove that the last of the constraints above (αi αi∗ = 0) is automatically satisfied by the solution, and it could be dropped from the formulation. We include this constraint because it will be useful in section 4. Due to the nature of this quadratic programming problem, only a number of coefficients αi∗ − αi will be different from zero, and the input data points xi associated with them are called support vectors. The number of support vectors depends on both C and ². The parameter C weighs the data term in functional 2.2 with respect to the smoothness term, and in regularization theory is known to be related to the amount of the noise in the data. If there is no noise in the data, the optimal value for C is infinity, which forces the data term to be zero. In this case, SVM will find, among all the functions that have interpolation errors smaller than ², the one that minimizes the smoothness functional 8[ f ]. The parameters C and ² are two free parameters of the theory, and their choice is left to the user, as well as the choice of the kernel K, which determines the smoothness properties of the solution and should reflect prior knowledge on the data. For certain choices of K, some wellknown approximation schemes are recovered, as shown in Table 1. (See Vapnik, 1995, for more details about SVM and for the original derivation of the technique.) 3 Sparse Approximation There has been a growing interest in approximating functions using linear superpositions of basis functions selected from a large, redundant set of basis functions, called dictionary. It is not our purpose to discuss the motivations that lead to this approach and refer readers to Chen (1995), Chen et al.
Sparse Approximation and Support Vector Machines
1461
(1995), Olshausen and Field (1996), Harpur and Prager (1996), Daubechies (1992), Mallat and Zhang (1993), and Coifman and Wickerhauser (1992) for further details. A common aspect of these technique is that one seeks an approximating function of the form f (x; a) =
n X
ai ϕi (x),
(3.1)
i=1
where ϕ ≡ {ϕi (x)}ni=1 is a fixed set of basis functions that we will call dictionary. If n is very large (possibly infinite) and ϕ is not an orthonormal basis (for example, it could be a frame or just a redundant, finite set of basis functions), it is possible that many different sets of coefficients will achieve the same error on a given data set. Among all the approximating functions that achieve the same error, a sparse approximation scheme looks for the one with the smallest number of nonzero coefficients. The sparsity of an approximation scheme can also be invoked whenever the number of basis functions initially available is considered too large, for whatever reasons (this situation arises often in radial basis functions applied to a very large data set). More formally we say that an approximating function of the form of equation 3.1 is sparse if the coefficients have been chosen so that they minimize the following cost function: ° °2 Ã !p n n ° ° X X ° ° ξi ai ϕi (x)° + λ ξi , E[a, ξ ] = ° f (x) − ° ° i=1 i=1
(3.2)
L2
where {ξi }ni=1 is a set of binary variables, with values in {0, 1} ,k · kL2 is the usual L2 norm, and p is a positive number that we set to one unless otherwise stated. It is clear that since the L0 norm of a vector counts the number of elements of that vector different from zero, the cost function above can be replaced by the cost function °2 ° n ° ° X p ° ° ai ϕi (x)° + λkakL0 . E[a] = ° f (x) − ° ° i=1
(3.3)
L2
The problem of minimizing such a cost function, however, is extremely difficult because it involves a combinatorial aspect, and it will be impossible to solve in practical cases. In order to circumvent this problem, approximated versions of the cost function above have been proposed. For example, Chen (1995) and Chen et al. (1995) use the L1 norm as an approximation of the L0 , obtaining an approximation scheme that they call basis pursuit denoising. In related work, Olshausen and Field (1996) enforce sparsity by considering
1462
Federico Girosi
Table 2: Some Choices for the Penalty Function S in Equation 3.4 Considered by Olshausen and Field (1996). S(x) |x| − exp(−x2 ) log(1 + x2 )
the following cost function, E[a] = k f (x) −
n X
ai ϕi (x)k2L2 + ²
i=1
n X
S(aj ),
(3.4)
j=1
where the function S was chosen in such a way to approximately penalize the number of nonzero coefficients. Examples of some the choices considered by Olshausen and Field (1996) are reported in Table 2. In the case in which S(x) = |x|, that is the basis pursuit denoising case, it is simple to see how the cost function (see equation 3.4) is an approximated version of the one in equation 3.2. In order to see this, let us allow the variables ξi to assume values in {−1, 0, 1} so that equation 3.2 can be rewritten as ° °2 n n ° ° X X ° ° ξi ai ϕi (x)° + λ |ξi |. (3.5) E[a, ξ ] = ° f (x) − ° ° i=1 i=1 L2
Letting the variables ξi be real-valued and assuming that the coefficients ai are bounded by one, it is clear that the coefficients ai are redundant and can be dropped from the cost function. Renaming the variables ξi as ai , we then have that the approximated cost function is °2 ° n ° ° X ° ° ai ϕi (x)° + λkakL1 , E[a] = ° f (x) − ° ° i=1
(3.6)
L2
which is the one proposed in the basis pursuit denoising method of Chen et al. (1995). 4 An Equivalence Between Support Vector Machines and Sparse Coding The approximation scheme proposed by Chen et al. (1995) has the form described by equation 3.1, where the coefficients are found by minimizing the cost function (see equation 3.6). We now make the following choice for the basis functions ϕi , ϕi (x) = K(x; xi ) ∀i = 1, . . . , l,
Sparse Approximation and Support Vector Machines
1463
where K(x; y) is the reproducing kernel of an RKHS H (see appendix A) and {(xi , yi )}li=1 is a data set that has been obtained by sampling, in the absence of noise, the target function f . We make the explicit assumption that the target function f belongs to the RKHS H. Readers unfamiliar with RKHS can think of H as a space of smooth functions, for example, functions that are square integrable and whose derivatives up to a certain order are also square integrable. The norm k f k2H in this Hilbert space can be thought of as a linear combination of the L2 norm of the function and the L2 norm of its derivatives (the specific degree of smoothness and the linear combination depend on the specific kernel K). It follows from equation 3.1 that our approximating function is f ∗ (x) ≡ f (x; a) =
l X
ai K(x; xi ).
(4.1)
i=1
This model is similar to the one of SVM (see equation 2.7) except for the constant b, and if K(x; y) = G(kx−yk), where G is a positive definite function, it corresponds to a classical radial basis functions approximation scheme (Micchelli, 1986; Moody & Darken, 1989; Powell, 1992). While Chen et al., in their basis pursuit denoising method, measure the reconstruction error with an L2 criterion, we measure it by the true distance, in the H norm, between the target function f and the approximating function f ∗ . This measure of distance, common in approximation theory, is better motivated than the L2 norm; it not only enforces closeness between the target and the model, but also between their derivatives, since k · kH is a measure of smoothness. We therefore look for the set of coefficients a that minimize the following cost function, ° °2 l ° X 1° ° ° ai K(x; xi )° + ²kakL1 , (4.2) E[a] = ° f (x) − ° 2° i=1 H
where k · kH is the standard norm in H. We consider this to be a modified version of the basis pursuit denoising technique of Chen (1995) and Chen et al. (1995). Notice that it looks from equation 4.2 that the cost function E cannot be computed because it requires the knowledge of f (in the first term). This would be true if we had k · kL2 instead of k · kH in equation 4.2, and it would force us to consider the approximation k f (x) − f ∗ (x)k2L2 ≈
l 1X (yi − f ∗ (xi ))2 . l i=1
(4.3)
However, because we used the norm k · kH , we will see that (surprisingly) no approximation is required, and equation 4.2 can be computed exactly, up to a constant (which is obviously irrelevant for the minimization process).
1464
Federico Girosi
For simplicity we assume that the target function f has zero mean in H, which means that its projection on the constant function g(x) = 1 is zero: h f, 1iH = 0. Notice that we are not assuming that the function g(x) = 1 belongs to H, but simply that the functions that we consider, including the reproducing kernel K, have a finite projection on it. In particular, we normalize K in such a way that h1, K(x; y)iH = 1. We impose one additional constraint on this problem: we want to guarantee that the approximating function f ∗ has also zero mean in H: h f ∗ , 1iH = 0.
(4.4)
Substituting equation 4.1 in equation 4.4, and using the fact that K has mean equal to 1, we see that this constraint implies that l X
ai = 0.
(4.5)
i=1
We can now expand the cost function E of equation 4.2 as E[a] =
l X 1 k f k2H − ai h f (x), K(x; xi )iH 2 i=1
+
l l X 1X ai aj hK(x; xi ), K(x; xj )iH + ² |ai |. 2 i,j=1 i=1
Using the reproducing property of the kernel K we have: h f (x), K(x; xi )iH = f (xi ) ≡ yi
(4.6)
hK(x; xi ), K(x; xj )iH = K(xi ; xj ).
(4.7)
Notice that in equation 4.6, we explicitly used the assumption that the data are noiseless, so that we know the value yi of the target function f at the data points xi . We can rewrite the cost function as E[a] =
l l l X X 1X 1 k f k2H − ai yi + ai aj K(xi ; xj ) + ² |ai |. 2 2 i,j=1 i=1 i=1
(4.8)
We now notice that the L1 norm of a (the term with the absolute value in the previous equation), can be rewritten more easily by decomposing the vector a in its “positive” and “negative” parts, as follows: − a = a+ − a− a+ , a− ≥ 0, a+ i ai = 0 ∀i = 1, . . . , l.
Sparse Approximation and Support Vector Machines
1465
Using this decomposition, we have l X − (a+ i + ai ).
kakL1 =
(4.9)
i=1
Disregarding the constant term in k f k2H and taking into account the constraint in equation 4.5, we conclude that the minimization problem we are trying to solve is equivalent to the following quadratic programming (QP) minimization problem: Solve:
Problem 1.
− min + −
ai ,ai
l X
− (a+ i − ai )yi +
i=1
l X − (a+ + ² i + ai )
#
l 1X + − (a+ − a− i )(aj − aj )K(xi ; xj ) 2 i,j=1 i
(4.10)
i=1
subject to the constraints: a+ , a− Pl + − i=1 (ai − ai ) + − a i ai
≥0 =0 =0
(4.11) ∀i = 1, . . . , l.
If we now rename the coefficients as follows: ∗ a+ i ⇒ αi − ai ⇒ αi ,
we notice that the QP problem defined by equations 4.10 and 4.11 is the same QP problem that we need to solve for training an SVM with kernel K (see equations 2.8 and 2.9) in the case in which the data are noiseless. In fact, as we argued in section 2.1, the parameter C of a SVM should be set to infinity when the data are noiseless. Since problem 1 is the same QP problem of SVM, we can use the fact that the constraint αi αi∗ = 0 is automatically satisfied by the SVM solution (see appendix B) to infer that the constraint − a+ i ai = 0 is also automatically satisfied in problem 1, so that it does not have to be included in the QP problem. Notice also that the constant term b, which appears in equation 2.7, does not appear in our solution. We argue in appendix B that for most commonly used kernels K, this term is not needed because it is already implicitly included in the model. We can now make the following statement:
1466
Federico Girosi
Statement. When the data are noiseless, the modified version of basis pursuit denoising of equation 4.2, with the additional constraint of equation 4.4, gives the same solution of SVM, and the solution is obtained by solving the same QP problem of SVM. As expected, the solution of the basis pursuit denoising is such that only a subset of the data points in equation 4.1 has nonzero coefficients, the socalled support vectors. The number of support vectors, that is, the degree of sparsity, is controlled by the parameter ², which is the only free parameter of this theory. 5 Conclusions and Remarks We have shown that in the case of noiseless data, SVM can be derived without using any result from VC theory, but simply enforcing a sparsity constraint in an approximation scheme of the form f (x; a) =
l X
ai K(x; xi ),
i=1
together with the constraint that assuming that the target function has zero mean, the approximating function should also have zero mean. This makes a connection between a technique such as SVM, derived in the framework of structural risk minimization, and basis pursuit denoising, a technique that has been proposed starting from the principle of sparsity. Some observations are in order: • This result shows that SVM provide an interesting solution to an old problem: the choice of the centers for radial basis functions. If the number of data points is very large, we do not want to place one basis function at every data point, but rather at a (small) number of other locations, called centers. The choice of the centers is often done by randomly choosing a subset of the data points. SVM provides a subset of the data points (the support vectors) that is optimal in the sense of the trade-off between interpolation error and number of basis functions (measured in the L1 norm). SVM can therefore be seen as a “sparse” radial basis functions in the case in which the kernel is radially symmetric. • This result is an additional reason to consider sparsity as an interesting constraint. In fact, we have shown that under certain conditions, sparsity leads to SVM, which is related to the structural risk minimization principle and is extremely well motivated in the theory of uniform convergence in probability. • The result holds because in both this and Vapnik’s formulation, the cost function contains both an L2 -type and an L1 -type norm. However,
Sparse Approximation and Support Vector Machines
1467
the support vector method has an L1 -type norm in the error term, and an L2 norm in the regularization term, while the cost function (see equation 4.2) we consider has an L2 -type norm in the error term and an L1 norm in the regularization term. • This result holds due to the existence of the reproducing property of the RKHS. If the norm k·kH were replaced by the standard L2 norm, the cost function would contain the scalar product in L2 between the unknown function f and the kernel K(x; xi ), and the cost function could not be computed. If we replace the RKHS norm with the training error on a data set {(xi , yi )}li=1 (as in basis pursuit denoising), the cost function could be computed, but it would lead to a different QP problem. Notice that the cost function contains the actual distance between the approximating and the unknown function, which is exactly the quantity that we want to minimize. • This article also provides a derivation of the SVM algorithm in the framework of regularization theory (see appendix B). This formulation is particularly simple to state, and it is easily related to other well-known techniques, such as smoothing splines and radial basis functions. The disadvantage is that it hides the connection between SVM and the theory of VC bounds and does not make clear what induction principle is being used. When the output of the target function is restricted to be 1 or −1, that is, we consider a classification problem, Vapnik shows that SVM minimize an upper bound on the generalization error rather than minimizing the training error within a fixed architecture. Although this is rigorously proved only in the classification case, this is a very important property, which makes SVM extremely well founded from the mathematical point of view. This motivation, however, is missing when the regularization theory approach is used to derive SVM. • The equivalence between SVM and sparsity has been shown only in the case of noiseless data. In order to maintain the equivalence in the case of noisy data, one should prove that the presence of noise in the problem (see equation 4.2) leads to the additional constraint α∗ , α ≤ C as in SVM, where C is some parameter inversely related to the amount of noise. In appendix C we sketch a tentative solution to this problem. This solution, however, is not very satisfactory because is purely formal and does not explain what assumptions are made on the noise in order to maintain the equivalence. Appendix A: Reproducing Kernel Hilbert Spaces In this article, an RKHS (Aronszajn, 1950) is defined as a Hilbert space of functions defined over some domain Ä ⊂ Rd with the property that, for
1468
Federico Girosi
each x ∈ Ä, the evaluation functionals Fx defined as
Fx [ f ] = f (x) ∀ f ∈ H are linear, bounded functionals. It can be proved that we can associate to every RKHS H a positive definite function K(x, y), which is called the reproducing kernel of H. The kernel of H has the following reproducing property: f (x) = h f (y), K(y; x)iH ∀ f ∈ H
(A.1)
where h·, ·iH denotes the scalar product in H. The function K acts in a similar way to the delta function in L2 , although L2 is not an RKHS (its elements are not necessarily defined pointwise). Here we sketch a way to construct an RKHS, which is relevant to our article. The mathematical details (such the convergence or not of certain series) can be found in the theory of integral equations (Hochstadt, 1973; Cochran, 1972; Courant & Hilbert, 1962), which is very well established, so we do not discuss them here. In the following, we assume that Ä = [0, 1]d for simplicity. The main ideas will carry over to the case Ä = Rd , although with some modifications, as we will see in section A.2. Let us assume that we find a sequence of positive numbers λn and linearly independent functions φn (x) such that they define a function K(x; y) in the following way,1 K(x; y) ≡
∞ X
λn φn (x)φn (y),
(A.2)
n=1
where the series is well defined (for example it converges uniformly). A simple calculation shows that the function K defined in equation A.2 is positive semidefinite. Let us now take as Hilbert space the set of functions of the form f (x) =
∞ X
cn φn (x),
(A.3)
n=1
in which the scalar product is defined as *
∞ X n=1
cn φn (x),
∞ X n=1
+ dn φn (x)
≡ H
∞ X cn dn n=1
λn
.
(A.4)
1 When working with complex functions φ (x), this formula should be replaced with n P∞ K(x; y) ≡ λ φ (x)φn∗ (y). n=1 n n
Sparse Approximation and Support Vector Machines
1469
Assuming that all the evaluation functionals are bounded, it is now easy to check that such a Hilbert space is an RKHS with reproducing kernel given by K(x; y). In fact, we have h f (x), K(x; y)iH =
∞ X cn λn φn (y) n=1
λn
=
∞ X
cn φn (y) = f (y).
n=1
We conclude that it is possible to construct an RKHS whenever a function K of the form in equation A.2 is available. The norm in this RKHS has the form k f k2H =
∞ 2 X cn . λ n=1 n
(A.5)
It is well known that expressions of the form in equation A.2 actually abound. In fact, it follows from Mercer’s theorem (Hochstadt, 1973) that any function K(x; y) that is the kernel of a positive operator2 in L2 (Ä) has an expansion of the form in equation A.2, in which the φi and the λi are, respectively, the orthogonal eigenfunctions and the positive eigenvalues of the operator corresponding to K. Stewart (1976) reports that the positivity of the operator associated with K is equivalent to the statement that the kernel K is positive definite, that is, the matrix Kij = K(xi ; xj ) is positive definite for all choices of distinct points xi . Notice that a kernel K could have an expansion of the form in equation A.2 in which the φn are not necessarily its eigenfunctions. The case in which Ä = Rd is similar, with the difference that the eigenvalues may assume any positive value, so that there will be a noncountable set of orthogonal eigenfunctions. In the following section, we provide a number of examples of these different situations, which also show why the norm k f k2H can be seen as a smoothness functional. A.1 Examples: RKHS over [0, 2π ]. Here we present a simple way to construct meaningful RKHS of functions of one variable over [0, 2π ]. In the following all the normalization factors will be set to 1 for simplicity. Let us consider any function K(x) that is continuous, symmetric, and periodic, and whose Fourier coefficients λn are positive. Such a function can be expanded in a uniformly convergent Fourier series: K(x) =
∞ X
λn cos(nx).
(A.6)
n=0
2
Positive operators in L2 are self-adjoint operators such that hKf, f i ≥ 0 for all f ∈ L2 .
1470
Federico Girosi
An example of such a function is K(x) = 1 +
∞ X
hn cos(nx) ==
n=1
1 − h2 1 , 2π 1 − 2h cos(x) + h2
where h ∈ (0, 1). It is easy to check that if equation A.6 holds, then we have K(x − y) = 1 +
∞ X
λn sin(nx) sin(ny) +
n=1
∞ X
λn cos(nx) cos(ny),
(A.7)
n=1
which is of the form of equation A.2 in which the set of orthogonal functions φn has the form {φi (x)}∞ i=0 ≡ (1, sin(x), cos(x), sin(2x), cos(2x),. . ., sin(nx), cos(nx), . . .). Therefore, given any function K that is continuous, periodic, and symmetric, we can then define an RKHS H over [0, 2π ] by defining a scalar product of the form: h f, giH ≡
∞ X fnc gcn + fns gsn , λn n=0
where we use the following symbols for the Fourier coefficients of a function f : fnc ≡ h f, cos(nx)i,
fns ≡ h f sin(nx)i.
The functions in H are therefore functions in L2 ([0, 2π ]) whose Fourier coefficients satisfy the following constraint: k f k2H =
∞ X ( f c )2 + ( f s )2 n
n=0
λn
n
< +∞.
(A.8)
Since the sequence λn is decreasing, the constraint that the norm in equation A.8 has to be finite can be seen as a constraint on the rate of decrease to zero of the Fourier coefficients of the function f , which is known to be related to the smoothness properties of f . Therefore, choosing different kernels K is equivalent to choosing RKHS of functions with different smoothness properties, and the norm can be used as the smoothness functional 8[ f ] in the regularization approach sketched in section 2. The relationship between the kernel K and the smoothness properties of the functions in the corresponding RKHS will become clearer in the next section, where we discuss the extension of this approach to the infinite domain Ä = Rd .
Sparse Approximation and Support Vector Machines
1471
A.2 Examples: RKHS over Rd . When the domain Ä over which we wish to define a RKHS becomes the whole space Rd , most of the results of the previous section still apply, with the difference that the spectrum of K becomes (usually) the whole positive axis, and it is not countable anymore. For translation-invariant kernels, that is, positive definite functions of the form K(x − y), the following decomposition holds: Z K(x − y) =
is·x −is·y ˜ dsK(s)e e .
Rd
(A.9)
Equation A.9 is the analog of equation A.2 over an infinite domain, and one can go from the case of bounded Ä to the case of Ä = Rd by the following substitutions: ⇒ ⇒ ⇒ ⇒
n λn φ Pn (x)
∞ n=1
s ˜ K(s) eRis·x Rd ds.
We conclude, then, that any positive definite function of the form K(x − y) defines an RKHS over Rd by defining a scalar product of the form Z h f, giH ≡
ds
f˜(s) g˜ ∗ (s) . ˜ K(s)
(A.10)
The reproducing property of K is easily verified, Z h f (x), K(x − y)i =
ds
−iy·s ˜ f˜(s)K(s)e = f (y), ˜ K(s)
and the RKHS becomes simply the subspace of L2 (Rd ) of the functions such that Z k f k2H =
ds
| f˜(s)|2 < +∞. ˜ K(s)
(A.11)
Functionals of the form in equation A.11 are known to be smoothness functionals. In fact, the rate of decrease to zero of the Fourier transform of the kernel will control the smoothness property of the function in the RKHS. Consider, for example, in one dimension, the kernel K(x) = e−|x| , whose ˜ Fourier transform is K(s) = (1 + s2 )−1 . The RKHS associated with this kernel contain functions such as Z | f˜(s)|2 ds = k f k2L2 + k f 0 k2L2 < ∞. k f k2H = (1 + s2 )−1
1472
Federico Girosi
This is the well-known Sobolev space W21 , where we denote by W2m the set of functions whose derivatives up to order m are in L2 (Yosida, 1974). Notice that the norm induced by the scalar product (see equation A.10) is the smoothness functional considered by Girosi et al. (1995) in their approach to regularization theory for function approximation. This is not surprising, since RKHS have been known to play a central role in spline theory (Wahba, 1990). Notice also that in spline theory, one actually deals with semi-RKHS, in which the norm k · kH has been substituted with a seminorm. Semi-RKHS share most of the properties of RKHS, but their theory becomes a little more complicated because of the null space of the seminorm, which has to be taken into account. Details about semi-RKHS can be found in Wahba (1990). Appendix B
Derivation of the SVM Algorithm
B.1 Generalities on Regularization Theory. Let us look more closely at the solution of the variational problem in equation 2.2: min H[ f ] = C f ∈H
l X i=1
1 V(yi − f (xi )) + 8[ f ]. 2
We assume that H is an RKHS with kernel K and that the smoothness functional 8[ f ] is 8[ f ] = k f k2H . This is equivalent to assuming that the functions in H have a unique expansion of the form f (x) =
∞ X
cn φn (x)
n=1
and that their norm is k f k2H =
∞ 2 X cn . λ n=1 n
In this derivation we do not have the coefficient b that appears in equation 2.1, since we argued before that if one of the φi is constant, which is usually the case, this term is not necessary. We can think of the functional H[ f ] as a function of the coefficients cn . In order to minimize H[ f ], we take its derivative with respect to cn and set it equal to zero, obtaining the following: −C
l X i=1
V 0 (yi − f (xi ))φn (xi ) +
cn = 0. λn
(B.1)
Sparse Approximation and Support Vector Machines
1473
Let us now define the following set of unknowns: ai ≡ CV 0 (yi − f (xi )). Using equation B.1 we can express the coefficients cn as a function of the ai : c n = λn
l X
ai φn (xi ).
i=1
The solution of the variational problem therefore has the form f (x) =
∞ X
cn φn (x) =
n=1
l ∞ X X n=1 i=1
ai λn φn (xi )φn (x) =
l X
ai K(x; xi ),
(B.2)
i=1
where we have used the expansion in equation A.2. This shows that, independent of the form of V, the solution of the regularization functional (see equation 2.2) is always a linear superposition of kernel functions, one for each data point. The cost function V affects the computation of the coefficients ai . In fact, plugging equation B.2 back in the definition of the ai , we obtain the following set of equations for the coefficients ai : l X 0 Kij aj , i = 1, . . . , l, ai = CV yi − j=1
where we have defined Kij = K(xi ; xj ). In the case in which V(x) = x2 we obtain the standard regularization theory solution (see Girosi et al., 1995, for an alternative derivation): (K + γ I) a = y where we have defined γ ≡
1 C.
B.2 The SVM Algorithm in the Regularization Theory Framework. Following Vapnik (1995) we now consider the case of the ²-insensitive cost function V(x) = |x|² . In this case, the approach sketched above is problematic because V is not differentiable at x = ² (although it still makes sense everywhere else). In order to make our notation consistent with Vapnik’s, we have to modify slightly the model proposed in the previous section. Vapnik explicitly takes into account an offset in the model, so that equation 2.1 is replaced by f (x) =
∞ X
cn φn (x) + b.
(B.3)
n=1
The smoothness functional remains unchanged (so that the smoothness does not depend on b): 8[ f ] =
∞ 2 X cn . λ n=1 n
1474
Federico Girosi
Also, we scale the functional in equation 2.2 of a factor the following variational problem: min H[ f ] = C f ∈H
l X i=1
1 2λ
≡ C, obtaining
1 |yi − f (xi )|² + 8[ f ]. 2
Since it is difficult to deal with the function V(x) = |x|² , the problem above is replaced by the following equivalent3 problem, in which an additional set of variables is introduced: min H[ f ] = C f ∈H
l X 1 (ξi + ξi∗ ) + 8[ f ], 2 i=1
(B.4)
subject to f (xi ) − yi yi − f (xi ) ξi ξi∗
≤ ≤ ≥ ≥
² + ξi ² + ξi∗ 0 0
i = 1, . . . , l i = 1, . . . , l i = 1, . . . , l i = 1, . . . , l.
(B.5)
The equivalence of the variational problem is established by noticing that in the problem above, a (linear) penalty is paid only when the absolute value of the interpolation error exceeds ², which correspond to Vapnik’s ²-insensitive cost function. Notice that when one of the two top constraints is satisfied with some nonzero ξi (or ξi∗ ), the other is automatically satisfied with a zero value for ξi∗ (or ξi ). In order to solve the constrained minimization problem above, we use the technique of Lagrange multipliers. The Lagrangian corresponding to the problem above is:
L( f, ξ , ξ ∗ ; α, α∗ , r, r∗ ) = C +
l X
1 (ξi + ξi∗ ) + 8[ f ] 2 i=1
l X
αi∗ (yi − f (xi ) − ² − ξi∗ )
i=1
+
l X
αi ( f (xi ) − yi − ² − ξi )
i=1
−
l X (ri ξi + r∗i ξi∗ ),
(B.6)
i=1
where α, α∗ , r, r∗ are positive Lagrange multipliers. The solution of the constrained variational problem above is now obtained by minimizing the 3 By equivalent, we mean that the function that minimizes the two functionals is the same.
Sparse Approximation and Support Vector Machines
1475
Lagrangian (in equation B.6) with respect to f (that is, with respect to the cn and to b), ξ , and ξ ∗ and maximizing (in the positive quadrant) with respect to α, α∗ , r, r∗ . Since the minimization step is now unconstrained, we set to zero the derivatives with respect to cn , b, ξ , and ξ ∗ , obtaining: l X ∂L = 0 ⇒ cn = λn (αi∗ − αi )φn (xi ) ∂cn i=1 l X ∂L =0⇒ (αi∗ − αi ) = 0 ∂b i=1
∂L = 0 ⇒ rn = C − αn ∂ξn ∂L = 0 ⇒ r∗n = C − αn∗ . ∂ξn∗ Substituting the expression for the coefficients cn in the model (see equation B.3), we then conclude that the solution of the problem (see equation B.4) is a function of the form f (x) =
l X
(αi∗ − αi )K(x; xi ) + b.
(B.7)
i=1
Substituting equation B.7 in the Lagrangian, we obtain an expression that should now be maximized (in the positive quadrant) with respect to α, α∗ , r, r∗ , with the additional constraints listed above. Noticing that the relationship between rn (r∗n ) and αn (αn∗ ) implies that α ≤ C and α∗ ≤ C, and minimizing −L rather than maximizing L, we now obtain the following QP problem: Problem B1. min L(α, α∗ ) = ²
α,α∗
l l X X (αi∗ + αi ) − yi (αi∗ − αi ) i=1
i=1
l 1X (α ∗ − αi )(αj∗ − αj )K(xi ; xj ), + 2 i,j=1 i
subject to the constraints 0 ≤ α∗ , α ≤ C l X (αi∗ − αi ) = 0. i=1
1476
Federico Girosi
This is the QP problem that has to be solved in order to compute the SVM solution. It is useful to write and discuss the Kuhn-Tucker conditions: αi ( f (xi ) − yi − ² − ξi ) = 0 i = 1, . . . , l αi∗ (yi − f (xi ) − ² − ξi∗ ) = 0 i = 1, . . . , l (C − αi )ξi = 0 i = 1, . . . , l (C − αi∗ )ξi∗ = 0 i = 1, . . . , l. The input data points xi for which αi or αi∗ are different from zero are called support vectors. A few observations are in order: • The Lagrange multipliers αi and αi∗ cannot be simultaneously different from zero, so the constraint αi αi∗ = 0 holds. • The support vectors are those data points xi at which the interpolation error is either greater than or equal to ². Points at which the interpolation error is smaller than ² are never support vectors and do not enter in the determination of the solution. Once they have been found, they could be removed from the data set, and if the SVM were run again on the new data set, the same solution would be found. • Any of the support vectors for which 0 < αi < C (and therefore ξi = 0) can be used to compute the parameter b. In fact, in this case, it follows from the Kuhn-Tucker conditions that f (xi ) =
l X
αj K(xi ; xj ) + b = yi + ²
j=1
(a similar argument holds for the αi∗ ). • If ² = 0, then all the points become support vectors. • Because of the constraint αi αi∗ = 0, defining a = α∗ − α and using equation 4.9 the QP problem can be written as follows: Problem B2. 1 min E∗ [a] = ²kakL1 − a · y + a · Ka a 2
Sparse Approximation and Support Vector Machines
1477
subject to the constraints −C ≤ ai ≤ C a·1=0 It is important to notice that if one the basis functions φi is constant, then the parameter b in equation B.3 could be omitted. The RKHS described in appendix A all have this property. Appendix C
Noisy Case: An Equivalence?
It is natural to ask whether the result of this article can be extended to the case of noisy data. I will sketch here an argument to show that there remains a relationship between SVM and sparse approximation when data are noisy, although the relationship is much less clear. In the presence of additive noise, we have f (xi ) = yi + δi , where yi are the measured value of f , and δi are random variables with unknown probability distribution. Substituting yi with yi +δi in equation 4.8, disregarding the constant term in k f k2H , and defining E∗ [a] = −
l X
ai yi +
i=1
l l X 1X ai aj K(xi ; xj ) + ² |ai |, 2 i,j=1 i=1
we conclude that we need to minimize the following QP problem: Problem C1. £ ¤ min E∗ [a] − a · δ a
subject to the constraint: a·1
=0
where the vector δ is unknown. In order to understand how to deal with the fact that we do not know δ , let us consider a different QP problem: Problem C2. min E∗ [a] a
1478
Federico Girosi
subject to the constraints: a·1 a a
=0 ≥η ≤ η∗
where the box parameters η and η ∗ are unknown. We solve problem C2 using the Lagrange multipliers technique for the inequality constraints, obtaining the following dual version of problem C2: Problem C3. £ ¤ max∗ min E∗ [a] − a · (β − β ∗ ) + β · η − β ∗ · η ∗
β ,β
a
subject to the constraint: a·1 β, β∗
=0 ≥0
where β and β ∗ are vectors of Lagrange multipliers. Notice now that the choice of the box parameters η and η ∗ uniquely determines β and β ∗ , and that setting δ = β − β ∗ , problems C1 and C2 are identical for what concerns the a vector: in both cases, one needs to solve a QP problem in which a linear term contains unknown coefficients. Therefore, solving problem C1 with unknown δ seems to be formally equivalent to solving problem C3 with unknown box parameters. This suggests the following argument: (1) solving problem C1 with unknown δ is formally equivalent to solving problem C3 with unknown box parameters; (2) in the absence of any information on the noise, and therefore on the box parameters, we could set the box parameters to η ∗ = −η = C1 for some unknown C; (3) for η ∗ = −η = C1 problem C3 becomes the usual QP problem of SVM (problem B.2); (4) therefore, in the total absence of information on the noise, problem C1 leads to the same QP problem of SVM, making the equivalence between sparse approximation and SVM complete. However, this argument is not very rigorous, because it does not make clear how the assumptions on η and η ∗ are reflected on the noise vector δ . Nevertheless, the formal similarity of problems C1 and C3 seems to point in the right direction, and an analysis of the relationship of η , η ∗ , and δ could lead to useful insights on the assumptions made on the noise in the SVM technique.
Sparse Approximation and Support Vector Machines
1479
Acknowledgments I thank T. Poggio and A. Verri for their useful comments and B. Olshausen for long discussions on sparse approximation.
References Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Amer. Math. Soc., 686, 337–404. Bertero, M. (1986). Regularization methods for linear inverse problems. In C. G. Talenti (Ed.), Inverse problems. Berlin: Springer-Verlag. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (July, 1992). A training algorithm for optimal margin classifier. In Proc. 5th ACM Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh, PA. Breiman, L. (1993). Better subset selection using the non-negative garotte (Tech. Rep.). Berkeley: Department of Statistics, University of California, Berkeley. Chen, S. (1995). Basis pursuit. Unpublished doctoral dissertation, Stanford University. Chen, S., Donoho, D., & Saunders, M. (May, 1995). Atomic decomposition by basis pursuit (Tech. Rep. 479). Stanford, CA: Department of Statistics, Stanford University. Cochran, J. A. (1972). The analysis of linear integral equations. New York: McGrawHill. Coifman, R. R., & Wickerhauser, M. V. (1992). Entropy-based algorithms for best-basis selection. IEEE Transactions on Information Theory, 38, 713–718. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 1–25. Courant, R., & Hilbert, D. (1962). Methods of mathematical physics (Vol. 2). London: Interscience. Daubechies, I. (1992). Ten lectures on wavelets. Philadelphia: SIAM. Girosi, F. (1997). An equivalence between sparse approximation and support vector machines (A.I. Memo No. 1606). Cambridge, MA: MIT Artificial Intelligence Laboratory. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7, 219–269. Girosi, F., Poggio, T., & Caprile, B. (1991). Extensions of a theory of networks for approximation and learning: Outliers and negative examples. In R. Lippmann, J. Moody, & D. Touretzky (Eds.), Advances in neural information processings systems 3. San Mateo, CA: Morgan Kaufmann. Harpur, G. F., & Prager, R. W. (1996). Development of low entropy coding in a recurrent network. Network, 7, 277–284. Hochstadt, H. (1973). Integral equations. New York: Wiley. Huber, P. J. (1981). Robust statistics. New York: Wiley. Mallat, S., & Zhang, Z. (1993). Matching pursuit in a time-frequency dictionary. IEEE Transactions on Signal Processing, 41, 3397–3415.
1480
Federico Girosi
Micchelli, C. A. (1986). Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Constructive Approximation, 2, 11– 22. Moody, J., & Darken, C. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1(2), 281–294. Morozov, V. A. (1984). Methods for solving incorrectly posed problems. Berlin: Springer-Verlag. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Powell, M. J. D. (1992). The theory of radial basis functions approximation in 1990. In W. A. Light (Ed.), Advances in numerical analysis, Volume 2: Wavelets, subdivision algorithms and radial basis functions (pp. 105–210). New York: Oxford University Press. Schumaker, L. L. (1981). Spline functions: Basic theory. New York: Wiley. Smola, A., & Scholkopf, ¨ B. (1998). From regularization operators to support vector kernels. In Advances in neural information processings systems 10. Cambridge, MA: MIT Press. Stewart, J. (1976). Positive definite functions and generalizations, an historical survey. Rocky Mountain J. Math., 6, 409–434. Tibshirani, R. (1994, June). Regression selection and shrinkage via the lasso (Tech. Rep.). Toronto: Department of Statistics, University of Toronto. Available from: ftp://utstat.toronto.edu/pub/tibs/lasso.ps. Tikhonov, A. N., & Arsenin, Y. Y. (1977). Solutions of ill-posed problems. Washington, DC: W. H. Winston. Vapnik, V. N. (1982). Estimation of dependences based on empirical data. Berlin: Springer-Verlag. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V., Golowich, S. E., & Smola, A. (1997). Support vector method for function approximation, regression estimation, and signal processing. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information Processings Systems 9 (pp. 281–287). Cambridge, MA: MIT Press. Wahba, G. (1975). Smoothing noisy data by spline functions. Numer. Math, 24, 383–393. Wahba, G. (1990). Splines models for observational data. Philadelphia: SIAM. Yosida, K. (1974). Functional analysis. Berlin: Springer-Verlag. Received September 11, 1997; accepted December 16, 1997.
LETTER
Communicated by Andrew Back
Extended Kalman Filter–Based Pruning Method for Recurrent Neural Networks John Sum High Performance Computing Laboratory, Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Lai-wan Chan Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Chi-sing Leung Department of Computer Science, University of Wollongong, Wollongong, 2522 NSW, Australia
Gilbert H. Young High Performance Computing Laboratory, Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, N.T., Hong Kong
Pruning is one of the effective techniques for improving the generalization error of neural networks. Existing pruning techniques are derived mainly from the viewpoint of energy minimization, which is commonly used in gradient-based learning methods. In recurrent networks, extended Kalman filter (EKF)–based training has been shown to be superior to gradient-based learning methods in terms of speed. This article explains a pruning procedure for recurrent neural networks using EKF training. The sensitivity of a posterior probability is used as a measure of the importance of a weight instead of error sensitivity since posterior probability density is readily obtained from this training method. The pruning procedure is tested using three problems: (1) the prediction of a simple linear time series, (2) the identification of a nonlinear system, and (3) the prediction of an exchange-rate time series. Simulation results demonstrate that the proposed pruning method is able to reduce the number of parameters and improve the generalization ability of a recurrent network. 1 Introduction Good generalization ability and fast training speed are two basic criteria used to evaluate the performance of learning methods in neural networks. Pruning is one of the methods often used to improve the generalization error. c 1998 Massachusetts Institute of Technology Neural Computation 10, 1481–1505 (1998) °
1482
John Sum et al.
It directly reduces the number of effective parameters in a neural network and, hence, the generalization error. Various methods have previously been proposed for pruning (Reed, 1993). The idea is to remove weights that will not result in much increase in the training error. We measure this ability by the error sensitivity—the degradation of the network performance with respect to the removal of the weight. Usually the estimation of the sensitivity (that is, the importance of the weight) is based on the evaluation of the second derivative of the training error. Let E(θˆ ) be the training error, where θˆ is the parametric vector containing all the weight values; the error sensitivity of the ith parameter, denoted by S(θˆi ), is approximated as follows: S(θˆi ) = E([θˆ1 , . . . , θˆi−1 , 0, θˆi+1 , . . . , θˆnθ ]T ) − E(θˆ ) Ã ! ˆ 1 2 ∂ 2 E(θ) ˆ , ≈ θi 2 ∂θ 2 ii
where (A)ii is the ith diagonal element of the matrix A and θˆ = [θˆ1 , . . . , θˆnθ ]T . nθ is the total number of weights in the network. The importance of the weight is ranked in accordance with the magnitude of S(θˆi ). The one at the bottom of the ranking list is of least importance and is removed first. This leads to the methods of optimal brain damage (OBD) (LeCun et al. 1990), which have been applied to the pruning of feedforward neural network (FNN) and the algorithm devised by With Pedersen and Hansen (1995) for the pruning of a recurrent neural network. A better estimation of the weight values can be accomplished by retraining the pruned network. Because retraining comes with an additional cost, Hassibi and Stork (1993) extended the idea of OBD to a method that can estimate these weight values, as if the network has been pruned and retrained, right at the time when the network is pruned. A shortcoming of using this pruning methods is that the term S(i) is derived based on the assumption that the training has reached its local minimum. This is undesirable since local minima occur when the networks are overtrained in most cases. Tresp, Neuneir, and Zimmerman (1996) proposed early brain damage pruning to rectify this. However, this method uses the Hessian matrix to approximate the energy surface, and the approximation becomes less accurate if the stopping point is far away from the local minimum. Thus, pruning still takes place when the network is close to its local minimum. Further, in terms of the training time, repeating the training and retraining processes is quite time-consuming, especially when the training method converges slowly. The actual time for obtaining a good network structure may thus be quite long. To overcome these problems, we can train a neural network using an effective training method such as recursive least square (RLS) (Singhal & Wu, 1989) or extended Kalman filter (EKF) (Anderson and Moore, 1979). Since the principle behind these algorithms is different, we have to formu-
Extended Kalman Filter–Based Pruning Method
1483
late a saliency measure based on the results obtained from such a training method. For feedforward networks, we have shown (Leung, Wong, Sum, & Chan, 1996) that the error sensitivity terms can be approximated closely by using the weight vectors and the covariance matrix, which are readily obtained from RLS. For recurrent networks, similar problems also appear in pruning. Traditionally, real-time recurrent learning (Williams & Zipser, 1989) or backpropagation through time method (Rumelhart, Hinton, & Williams, 1986) is employed in training the recurrent networks. Williams (1992) has formulated the RNN training as a nonlinear filtering problem and applied the EKF technique to determine the weight values. Simulation results demonstrated that the speed-up can increase by 10 times. Because this training is effective, it would be valuable to see if we can make use of the information obtained after training, the weight values and its error covariance matrix, to prune the network. Since the objective of EKF is maximizing the a posterior probability but not minimizing the training error, the error sensitivity S(i) is not appropriate for pruning the network. Therefore, instead of considering the error sensitivity, we suggest considering the sensitivity of the a posterior probability as a measure of the importance of the weight and devising a pruning procedure for recurrent neural network. We present a pruning procedure based on this idea. In the next section, a background to the EKF will be presented. The EKF-based training method for recurrent network is elucidated in section 3. Section 4 describes the pruning scheme based on probability sensitivity. Three simulated examples are given in section 5 for illustrating the effectiveness of this pruning scheme. In section 6, we discuss the necessity of the ranking heuristic and give two examples to show its usefulness. Finally, we present the conclusion in section 7. 2 Background on Extended Kalman Filter Without loss of generality, we consider a nonlinear signal model as follows: x(t + 1) = f (x(t)) + v(t),
(2.1)
y(t) = h(x(t)) + w(t),
(2.2)
where f (.) and h(.) are nonlinear function of x(t). The noise processes {v(t)} and {w(t)} are mutually independent zero mean gaussian noise, and E{v(t)v(t)T } = Q(t), E{w(t)w(t)T } = R(t). ˆ x(0) is a gaussian random variable with mean x(0) and variance P0 . Let Yt be the observations {y(0), y(1), . . . , y(t)}; the estimation of the state x(t) based on the observations Yt or Yt−1 can be accomplished by using the ˆ | t) and x(t ˆ | t − 1) be the expectation of x(t) extended Kalman filter. Let x(t given Yt and Yt−1 , respectively; the estimation of the x(t) can be achieved
1484
John Sum et al.
by plugging in the following Kalman filter equations: ˆ | t) = x(t ˆ | t − 1) + L(t)[y(t) − h(x(t ˆ | t − 1))] x(t
(2.3)
ˆ + 1 | t) = f (x(t ˆ | t)) x(t
(2.4) −1
T
L(t) = P(t | t − 1)H(t)[H (t)P(t | t − 1)H(t) + R(t)] T
P(t | t) = P(t | t − 1) − L(t)H (t)P(t | t − 1) P(t + 1 | t) = F(t)P(t | t)FT (t) + Q(t),
(2.5) (2.6) (2.7)
where ˆ | t)) ∂ f (x(t ∂x ˆ | t)) ∂h(x(t . H(t) = ∂x F(t) =
(2.8) (2.9)
ˆ | −1) and P(0 | −1) are defined as P0 and x0 if we The initial conditions x(0 know that the initial value x(0) is a gaussian random variable with mean x0 and variance P0 . In general, we may not have the information. In such a case, we can simply set x0 as a small, random number and P0 to be a diagonal matrix with large diagonal values. Theoretically, if f (x) and h(x) are linear functions of x, say, f (x) = Ax and h(x) = Cx, it is known that the probability density function for x(t), ˆ | t) and denoted by P (x(t) | Yt ), is gaussian distribution with mean x(t variance P(t | t). Similarly, P (x(t) | Yt−1 ) is gaussian distribution with mean ˆ | t−1) and variance P(t | t−1). Hence, x(t ˆ | t) is the maximum a posterior x(t ˆ | t) (MAP) estimation of x(t). In case f (x) or h(x) is a nonlinear function, x(t and P(t | t) will be only approximation of the mean and variance of the a posterior probability density function P (x(t) | Yt ). 3 EKF based training In this section, we review how the training of a recurrent neural network can be accomplished as a nonlinear state estimation problem. (For the formulation of EKF in RNN training, also refer to Williams, 1992.)1 1 At least two more approaches apart from the Williams approach using EKF to train recurrent neural network have been proposed. Puskorius and Feldkamp (1994) and Suykens, De Moor, and Vandewalle (1995) considered that the learning is acting on the weight vector alone. In each time step, the derivative of each output unit with respect to the weight vector is calculated recursively through a sensitivity network. The weight vector is then updated via the recursive least-square method. Wan and Nelson (1996) extended the idea of dual linear Kalman filter to train a recurrent-type neural network. The idea is to apply two Kalman filter equations to update the weight vector and hidden unit activity simultaneously in each time step. A main theme of this article is to elucidate how pruning can be used in conjunction with the Williams approach.
Extended Kalman Filter–Based Pruning Method
1485
To train a recurrent neural network, we assume that the training data set is generated by a stochastic signal model as follows (Williams, 1992): x(t + 1) = g(Ax(t) + Bu(t) + D) + v(t),
(3.1)
y(t + 1) = Cx(t + 1) + w(t),
(3.2)
where v(t) and w(t) are zero-mean gaussian noise. If the parameters (A, B, C, D) are known, we can use the EKF to predict the y(t + 1) (see the appendix). If the parameters are not known, we need to estimate them. In EKF (Williams, 1992), training a recurrent network is treated as a nonlinear estimation problem, where the parameters {A, B, C, D} and x(t) are the unknown states being estimated. Hence, the state equations are: x(t + 1) = g(A(t)x(t) + B(t)u(t) + D(t)) + v(t),
(3.3)
θ(t + 1) = θ(t) + e(t),
(3.4)
y(t) = C(t)x(t) + w(t).
(3.5)
Let θ be the collection of the state {A, B, C, D}. Putting x(t) and θ (t) as a single state vector, the state equations become: ·
x(t + 1) θ(t + 1)
·
¸
v(t) e(t)
= g1 (x(t), u(t), θ (t)) +
¸ (3.6)
y(t) = f1 (x(t), θ (t)) + w(t),
(3.7)
where · g1 (x(t), u(t), θ(t)) =
g(A(t)x(t) + B(t)u(t) + D(t)) θ (t)
¸
f1 (x(t), θ(t)) = C(t)x(t).
(3.8) (3.9)
Comparing equations 3.6 and 3.7 with equations 2.1 and 2.2, we can see that the simultaneous estimation of x(t) and parametric vector θ (t) can be as obtained recursively with the following recursive equations: ˆ − 1 | t − 1), u(t), θˆ (t − 1)) x(t | t − 1) = g(x(t T
P(t | t − 1) = F(t − 1)P(t − 1 | t − 1)F (t − 1) + Q(t − 1) ·
ˆ | t) x(t ˆ θ(t)
¸
(3.11)
·
=
¸ ˆ | t − 1) x(t ˆ − 1) θ(t · ¸¶ µ ˆ | t − 1) x(t + L(t) y∗ (t) − HT (t) θˆ (t − 1)
(3.10)
P(t | t) = P(t | t − 1) − L(t)HT (t)P(t | t − 1),
(3.12) (3.13)
1486
John Sum et al.
where · ˆ | t), u(t + 1), θˆ (t)) ∂ g(x(t F(t + 1) = x 0nθ ×n
¸ ˆ | t), u(t + 1), θˆ (t)) ∂θ g(x(t , Inθ ×nθ (3.14)
HT (t) = [∂xT y(t) ∂θT y(t)]
(3.15)
L(t) = P(t | t − 1)H(t)[HT (t)P(t | t − 1)H(t) + R(t)]−1 .
(3.16)
The initial P−1 (0 | 0) is set to be zero matrix, and θˆ (0) is a small random vector. Given the data set {u(t), y∗ (t)}N t=1 and iterating equations 3.10 through 3.16 N times, the parametric vector θˆ (N) will then be assigned as the network parameters. Since the actual values of Q(t) and R(t) are not known in advance, they can be estimated recursively, as in Iiguni, Sakai, and Tokumaru (1992): R(t) = (1 − αR )R(t − 1) + αR (y∗ (t) − y(t | t − 1))2
(3.17)
Q(t) = (1 − αQ )Q(t − 1) + αQ L(t)L(t)T (y∗ (t) − y(t | t − 1))2 ,
(3.18)
where αR and αQ are two small positive values. 4 Pruning Scheme The aim of the EKF is to construct an approximated posterior probability function for the hidden state (xT (N), θ T (N))T given the training set D = {u(t), y∗ (t)}N t=1 . The posterior probability function is approximated by a gaussian density function with · ¸ ˆ x(N | N) Mean = ˆ | N) ; variance = P(N | N), θ(N which are obtained recursively via equations 3.10 through 3.18 until t = N. Equivalently, this posterior probability can be expressed as follows:
P (φ(N) | D) = Pˆ (φ(N) | D) ½ 1 T −1 ˆ P (N | N) × exp − (φ(N) − φ(N)) 2 ¾ ˆ × (φ(N) − φ(N)) , ˆ = (xT (N | N) θ T (N | N))T . where φ(N) = (xT (N) θ T (N))T and φ(N) £ ¤−1/2 ˆ Pˆ (φ(N) | D) = (2π)nθ +n det(P(N | N))
(4.1)
Extended Kalman Filter–Based Pruning Method
1487
To remove excessive weights from the network, we start by considering the information given by this posterior probability, which is readily obtained from the EKF training. In accordance with equation 4.1, P(N | N) provides the cue about the shape of the probability function describing φ(N). And assuming the probability distribution is gaussian and decomposing P(N | N) into a four-block matrix: ·
Pxx (N | N) PTxθ (N | N)
Pxθ (N | N) Pθ θ (N | N)
¸ ,
(4.2)
where Pxx (N | N) ∈ Rn×n , Pxθ (N | N) ∈ Rn×nθ , and Pθ θ (N | N) ∈ Rnθ ×nθ , the posterior probability function for θ (N) can be approximated by integrating equation 4.1 with respect to x(N): Z
P (θ(N) | D) =
P (x(N), θ (N) | D)dx(N) ½ 1 = Pˆ θ |D exp − (θ(N) − θˆ (N))T P−1 θ θ (N | N) 2 ¾ × (θ (N) − θˆ (N)) ,
(4.3)
where £ ¤−1/2 Pˆ θ |D = (2π)nθ det(Pθ θ (N | N)) . Let θˆ k (N) be the parametric vector with all elements equal to θˆ (N | N) except that the kth element is set to be zero, that is, θˆ k (N) = [θˆ1 (N | N), . . . , θˆk−1 (N | N) 0 θˆk+1 (N | N), . . . , θˆnθ (N | N)]T . Thus, the posterior probability of θˆ k (N) given D can readily be written as follows: ½ ´ ¾ 1 ³ P (θˆ k (N) | D) = Pˆ θ |D exp − θˆk2 P−1 (N | N) , (4.4) θθ kk 2 Pθ θ (N | where (P−1 θ θ (N | N))kk is the kth diagonal element of the ³ inverse of ´
N). Obviously, the smaller the value of the factor θk2 P−1 θ θ (N | N)
P (θ k (N)
kk
, the
| D). Figure 1 interprets this idea higher the posterior probability graphically. Suppose that the posterior probability is a gaussian distribution with ˆ where θˆ = (θ1 , θ2 ) and variance P. The maximum a posterior probamean θ, bility is P (θˆ | D). Suppose θ2 is set to zero; the posterior probability reduces to around 0.75P (θˆ | D). If θ1 is set to zero, the posterior probability will
1488
John Sum et al.
Figure 1: Probabilistic pruning. We use the EKF-based training method to obtain an approximated posterior probability distribution for the network parametric ˆ where θˆ = (θ1 , θ2 ), and vector θ . It is a gaussian distribution with mean θ, variance P. The maximum a posterior probability is P(θˆ | D). Suppose θ2 is set to be zero; the posterior probability reduces to around 0.75P(θˆ | D). If θ1 is set to be zero, the posterior probability will reduce to smaller than 0.7P(θˆ | D). This suggests that θ2 should be ranked higher and should be eliminated before θ1 , since the posterior probability is not that sensitive to the change of θ2 compared with θ1 .
reduce to a value smaller than 0.7P (θˆ | D). This suggests that θ2 should be eliminated, because the posterior probability is not that sensitive to the change of θ2 compared with θ1 . Hence, the pruning procedure can be summarized as follows: 1. Initialization ˆ a. Set θ(0) as a small, random vector. b. Set P(0 | 0) to be a diagonal matrix with a very large value. c. Set x(0 | −1) to be a small, random vector. 2. Training a. Use the recursive equations to obtain θˆ (N). b. Check the before-pruning one-step prediction error, denoted by Ebp , based on the method described in the appendix.
Extended Kalman Filter–Based Pruning Method
1489
3. Pruning a. Decompose the matrix P(N | N) into a block matrix to get Pθ θ . 2 −1 b. Evaluate P−1 θ θ , and hence θk (Pθ θ (N | N))kk , for all k from 1 to nθ .
c. Rearrange the ³ saliency ´index {πk } according to the ascending . order of θk2 P−1 θ θ (N | N) kk
d. Set k = 1. e. While (k ≤ nθ ), i. Set θπk to zero and check the validation error using the method described in the appendix. ii. If the error is greater than a predefined threshold, set θπk back to its original value. iii. k = k + 1. The validation error is used for checking whether the weight should be removed. The pruning does not stop until all the weights have been checked. In contrast to the conventional pruning procedures such as OBD or optimal brain surgeon (OBS), the pruning procedure stops once the validation error is larger than a threshold value. In some problems, if the available data set is small, we can simply treat the training data set as the validation set in step 3e. Remark. There are other proposals that share ideas similar to ours. Finnoff, Hergert, and Zimmermann (1993) used the ratio of the weight magnitude over its fluctuation as a measure of the importance of a weight. The fluctuation is approximated by the standard deviation of the change of the weight value during training. Cottrell et al. (1995) independently suggested an idea where the fluctuation is approximated by θk2 (6 −1 )kk , where 6 is the Hessian matrix describing the shape of the error surface and simultaneously the shape of probability distribution around the estimated weight vector. Larsen (1996) generalized these ideas, together with OBD and OBS, under a statistical pruning framework. All of these methods and ours share one similar point: if the mean value of a weight is small but its value’s fluctuation is large, this weight should not be an important one. It should be removed with higher priority. There are three major differences between our approach and theirs: (1) The problem we are dealing with is recurrent neural network pruning, while the other work contributes to the pruning of feedforward neural network. (2) The motivation of the probabilistic pruning idea is due to using the EKF approach to train the neural network. (3) Practically, we suggest no retraining and reestimation of the network weights. Besides, we have no problem
1490
John Sum et al.
Table 1: Comparison of Pruning Methods.
Methods
Measure of Weight Importance
OBD OBS Finnoff et al. (1993) Cottrell et al. (1995) Larsen (1996) Ours
Error sensitivity Error sensitivity Sensitivity in probability sense Sensitivity in probability sense Sensitivity in probability sense Sensitivity in probability sense
Retrain √ × √ √ × ×
Reestimate
Model
× √
FNN FNN FNN FNN FNN RNN
× × √ ×
Note: Because Larsen (1996) has unified the idea of statistical pruning to both OBD and OBS, we include them in this table for comparison.
in determining the number of weights to be removed. We summarize these similarities and differences in Table 1. 5 Simulation Results In this section, we demonstrate the efficacy of the proposed pruning scheme through three examples: a simple time-series prediction problem, a nonlinear single-input, single-output system identification problem (adapted from Narendra & Parthasarathy, 1992), and a real-life example—the prediction of the exchange rate. 5.1 Simple Linear Time Series. The linear time series we used is defined as follows: y(t) = 0.6 sin(t/30) + 0.2 sin(t/5) + 0.01w(t), where w(t) is a zero mean unit variance gaussian noise. Eight hundred data are generated—600 for training and 200 for testing. The recurrent network is constituted by one input unit, six hidden units, and one output unit. Thus, the total number of weights is 54. The weight values are initialized randomly around zeros with small variance. The same experiment is repeated for 50 trials. Let errbp and errap be the testing error before and after pruning, respectively. The threshold is set to be the training error. Figures 2 and 3 show the statistics of the ratio errap /errbp and the number of weights removed for these 50 trials. It is observed that the ratios of errap and errbp are smaller than 1 in 47 out of 50 cases, and in most of the cases, the number of weights being removed is equal to or more than 20, which is about 0.37 of the total number of weights. This indicates that the above algorithm is able to improve the generalization of a recurrent neural network in simple linear time series.
Extended Kalman Filter–Based Pruning Method
1491
20 18 16
Frequency
14 12 10 8 6 4 2 0 0.4
0.5
0.6 0.7 0.8 0.9 Error_after_prune/Error_before_prune
1
1.1
Figure 2: The histogram of the ratio errap /errbp . The vertical axis corresponds to the frequency, and the horizontal axis corresponds to the ratio. In almost all cases, the ratios are smaller than one. This means that pruning can improve generalization ability.
5.2 Nonlinear System Identification. We follow an example mentioned in Narendra and Parthasarathy (1992). The nonlinear system is described by the second-order nonlinear difference equations as follows: x˜ 1 (k) + 2x˜ 2 (k) + u(k) 1 + x˜ 22 (k) x˜ 1 (k)x˜ 2 (k) + u(k) x˜ 2 (k + 1) = 1 + x˜ 22 (k) yp (k) = x˜ 1 (k) + x˜ 2 (k). x˜ 1 (k + 1) =
(5.1) (5.2) (5.3)
The recurrent network is composed of 1 output unit (y(k)), 10 hidden units (x(k)), and 1 input unit (u(k)). During training, an independently and identically distributed (i.i.d.) random signal with a uniform distribution over [−1, 1] was used as the input to the nonlinear plant. One thousand input-output pairs are generated: the first 800 for training and the last 200 for validation. In this example, the input data are random signal and the system is complex, so 800 pairs are not sufficient for network training. Therefore, we have to retrain the network using the same training set several times. We
1492
John Sum et al. 12
10
Frequency
8
6
4
2
0 0
5
10
15 20 25 30 35 Number of weight being pruned
40
45
50
Figure 3: Histogram of the number of weights pruned away. The horizontal axis corresponds to the number of weights pruned away, while the vertical axis corresponds to the frequency. In most of the trials, the number of weights pruned away is more than 30, about three-fifths of the total number of weight connection.
suggest the following alternative steps for network training: 1. Pass through the training sequence {u(k), y∗ (k)}N k=1 and use the Kalman filter formulation to learn the network parameters. 2. Fix the network parameters and pass the validation data set. 3. Evaluate the validation error by comparing the network output with the actual plant output. 4. If the validation error does not converge, go back to step 1. Otherwise stop. Once the training process is finished, we follow the steps stated in section 4 to prune the network. The training parameters are depicted in Table 2. Figure 4 shows the results after the network training has been finished. Figure 4a shows the output of the network and the output of the plant. It is observed that the recurrent network can closely predict the plant output. To demonstrate that generalization can be improved if some weights in the neural network have been pruned, we feed in another set of input signals—a sine signal, in this example—to test the pruned recurrent neural
Extended Kalman Filter–Based Pruning Method
1493
Table 2: Values of Parameters Used for Nonlinear System Identification. Training Number of hidden units Initial hidden activity Initial weight αR αQ P(0) Q(0) R(0) Testing αR αQ Px (0) Qx (0) Rx (0)
10 0.01 × randno 0.0004 × randno 0.005 0.005 30I140×140 I140×140 1 0.005 0.005 30I10×10 I10×10 1
Table 3: Pruning Results for the Nonlinear System Identification (average of 50 trials). δ
Validation Error
Testing Error
Number of Weights Removed
Initial 0.1 0.2 0.3 0.4 0.5
0.3325 0.3581 0.3986 0.4295 0.4520 0.4977
0.6976 0.6068 0.5962 0.6058 0.6892 0.6747
— 15 15 21 22 25
network and the nonlinear plant. Figure 4b shows the case when the input is fed with this sine signal: ¶ µ 2πk . u(k) = sin 25 It shows that the trained network is able to model the nonlinear system. Again, we let Ebp be the validation error before pruning. We define the threshold as (1 + δ)Ebp . Five values of δ (0.1, 0.2, 0.3, 0.4, 0.5) are examined. The results are depicted in Table 3. Figure 5 shows the output of the network and the plant for the cases when δ is equal to 1/10. By setting δ = 0.1, the number of weights removed increases to 15. Although the validation error with δ = 0.1 is larger than the error before pruning and the testing error is smaller than the error before pruning, we
1494
John Sum et al.
(a) Training and validation
(b) Testing
6
6
4
4
2 2 0 0 −2 −2 −4
−4
−6
−6 700
750
800
850
900
950
1000
−8 0
10
20
30
40
50
60
70
80
90
100
Figure 4: The result after the network training. The solid lines correspond to the output of the actual plant, and the dot-dash lines correspond to the output of the recurrent network. (a) Part of the training data and validation data when the input is random signal. (b) The testing data when the input is sine signal. The validation set consists of those data from 801 to 1000 in (a).
(a)
(b)
6
6
4
4
2 2 0 0 −2 −2 −4
−4
−6 0
−6
20
40
60
80
100
120
140
160
180
200
−8 0
10
20
30
40
50
60
70
80
90
100
Figure 5: Output of the recurrent network after pruning with δ setting to be 1/10. The solid lines correspond to the actual plant output, while the dot-dash lines correspond to the network output. (a) Part of the training data and validation data when the input is random signal. (b) The testing data when the input is sine signal.
can observe from Figure 5 that the network can still closely predict the output of the true system. This example shows that the proposed pruning scheme can improve the
Extended Kalman Filter–Based Pruning Method
1495
Table 4: Values of Parameters Used for Predicting the Exchange Rate. Training Number of hidden units Initial hidden activity Initial weight αR αQ P(0) Q(0) R(0) Testing αR αQ Px (0) Qx (0) Rx (0)
10 0.001 × randno 0.0001 × randno 0.005 0.005 30I140×140 0.01I140×140 1 0.005 0.005 I10×10 Qe after training R after training
generalization ability of a recurrent network. It also demonstrates that setting the threshold to a value larger than the validation error before pruning may lead to an even better neural network model with fewer weight connections but better generalization. 5.3 Exchange Rate Prediction. We apply the proposed method to predict the daily U.S. dollar/deutsche mark exchange rate z(t). The range of the data are selected from November 29, 1991 to November 3, 1994—altogether 764 working days; 564 data are used as training data and the last 200 as testing data. The recurrent neural network model is constituted by 1 input unit, 10 hidden units, and 1 output unit. The input is fed with small, random noise during training, and the output of the network is to predict the {ln z(t)} sequence. Altogether there are 130 connection weights. The training method is the same as the one we used in the previous example, except the initial condition of the matrix Q. The data are passed to the network 10 times. After each pass, the testing error is evaluated based on the method discussed in the appendix. The training stops once the testing error converges. The training parameters are depicted in Table 4. Similar to that of the previous example, after training, we prune the network and check the resultant testing error. Seven values of δ are examined: 0 0.1, 0.2, 0.3, 0.4, 0.5, 0.6. Table 5 and Figure 6 show the testing error and the number of weights being removed after pruning. They show that as δ increased, more and more weights were pruned away. The testing error dropped from 0.00080 to a minimum of 0.00032 when δ = 0.2. This gives the best network architecture (with 49 connection weights) for this problem. As δ increased further, the testing error increased again. When δ = 0.5, the testing error rose back to
1496
John Sum et al.
Table 5: Pruning Result for the Exchange Rate Prediction. δ
Testing Error
Number of Weights Removed
Before pruning 0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.00080 0.00037 0.00038 0.00032 0.00048 0.00054 0.00081 0.00113
— 48 79 81 84 85 88 93
(a) RMS testing error vs. δ
(b) Number of weights removed vs. δ
−4
12
x 10
95
11
90
10
85 80
9
75
8 70
7 65
6 60
5 55
4 3 0
50
0.1
0.2
0.3
0.4
0.5
0.6
45 0
0.1
0.2
0.3
0.4
0.5
Figure 6: The mean square testing error and the number of weights removed against the value of δ in the prediction of exchange rate.
the level before pruning took place. At this moment, only 42 weights were left, compared to the original one with 130 weights. 6 The Saliency Ranking We have shown that our pruning method can improve the generalization ability. Next, we closely examine the ranking generated from our saliency term. 6.1 Nonlinear System Identification. We repeat the experiment using the data from example 2 (see section 5.2). Everything remains the same except the pruning procedure—step 3(c) in section 4. Instead of using the
0.6
Extended Kalman Filter–Based Pruning Method
(a) Average no. of weights pruned
1497
(b) Avg. Testing error
28 Heuristic 26
Heuristic
Random
Random 0.8
24 0.7
22
20
0.6
18 0.5
16 0.4
14
12 0
0.1
0.2
0.3
0.4
0.3 0
0.5
0.1
0.2
0.3
0.4
0.5
Figure 7: Comparison between the cases when the heuristic is imposed (solid lines) and not imposed (dot-dash lines). The results for the heuristic case are from section 5.
Table 6: Pruning Results for the Nonlinear System Identification when Ranking Heuristic Is Not Imposed. δ
Validation Error
Testing Error
Number of Weights Removed
Initial 0.1 0.2 0.3 0.4 0.5
0.3325 0.3281 0.3976 0.4280 0.4613 0.5011
0.6976 0.6703 0.7041 0.7470 0.8430 0.9014
— 14 16 19 23 26
ranking obtained from our saliency term, we replace it by a random list {π1 , π2 , . . . , πnθ }. We set the threshold to be (1 + δ)Ebp , where δ is 0.1, 0.2, 0.3, 0.4, and 0.5, respectively. The experiment is repeated for 50 trials. The average testing errors and the average number of weights removed are shown in Figure 7. The dot-dash lines correspond to random ranking pruning, while the solid lines correspond to the saliency ranking pruning. The numerical data for random ranking are shown in Table 6. From these figures, we can observe that both methods removed a similar number of weights. Saliency ranking pruning, however, generated a smaller testing error compared with the random ranking. Besides, by inspecting
1498
John Sum et al. δ = 0.2
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
20
40
60
80
100
120
Figure 8: Normalized frequency curve for random ranking case. The parameter δ is set to be 0.2, and the ranking of the weight importance is initialized randomly.
the normalized frequency curves2 that show the normalized frequency of a weight being removed, we find that the heuristic rank list is a good cue for weight pruning. Figure 8 shows the normalized frequency curve for the case when δ = 0.2 and the weight ranking is initialized randomly. Because random ranking indicates no information on which weight is not important, we have to search the whole list to make sure that the pruning is finished. It is rather time-consuming and computationally costly. If we check carefully on the set of weights being removed and plot the normalized frequency curves against the saliency ranking list (see Figure 9), it is found that the weights being pruned away based on random ranking are indeed located in the first half of the heuristic ranking list. This reveals that no matter which weights are being pruned away by using random ranking or saliency ranking, the actual set of weights being removed is located in 2 As we have repeated the experiment for 50 trials, the normalized frequency for the kth-weight is defined as total number of times it is being pruned/50.
Extended Kalman Filter–Based Pruning Method
1499
(a) δ = 0.1
(b) δ = 0.2 Rand
Rand
Heuristic
Heuristic
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0
20
40
60
80
100
120
0 0
20
40
(c) δ = 0.3
60
80
Rand
Rand
Heuristic
Heuristic 1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
20
40
60
80
120
(d) δ = 0.4
1
0 0
100
100
120
0 0
20
40
60
80
100
120
Figure 9: Comparison between saliency rank pruning and random rank pruning. The dot-dash lines are the frequency curves for the heuristic case; the solid lines are the frequency curves for the random ranking case. The horizontal axis corresponds to the index of weight (according to the saliency ranking).
the beginning portion of the saliency ranking list and their chance of being removed decreases as their locations are farther from the beginning. That means that based on the heuristic ranking list, searching for the whole list is not necessary.3
3
It does not mean that the algorithm could be terminated once the first weight is found that cannot be pruned. Observed from Figure 9, it appears that the weights pruned using the saliency measure are always those whose saliency falls below a certain threshold. We have found experimentally that if δ is set to be 0.05 or smaller, the second weight on the list will not be pruned away, but the third one will. Therefore, if the algorithm terminates once the first weight is found that cannot be pruned, we can prune away only one weight.
1500
John Sum et al.
Table 7: Pruning Result for a Recurrent Neural Network Being Trained to Identify a Simple Linear State-Space Model when Ranking Heuristic Is Imposed. δ
Validation Error
Testing Error
Number of Weights Removed
Initial 0 0.05 0.10 0.15 0.20
0.0769 0.0426 0.0458 0.0412 0.0512 0.0410
0.0318 0.0774 0.0656 0.0734 0.0509 0.0991
— 6 11 13 17 21
6.2 Linear System Identification. The model discussed in this subsection is a simple linear state-space model: x1 (t + 1) = 0.7x1 (t) + 0.08x2 (t) + u(t)
(6.1)
x2 (t + 1) = x1 (t)
(6.2)
y(t) = 0.22x1 (t).
(6.3)
A recurrent neural network consisting of one input unit, five hidden units, and one output unit is being trained. During training, 500 input-output data pairs are generated. The inputs are randomly (uniformly) drawn from [−1, 1]. After training, the network is pruned and tested by a new input signal (the testing set): ¶ µ 2π k . u(k) = sin 25 The experiment is repeated for 20 trials, and the results obtained by using ranking are shown in Table 7. When ranking heuristic is imposed, the testing error increases progressively as δ increases. On the other hand, when ranking heuristic is not imposed, the testing error suddenly increases when δ is increased from 0 to 0.05 (see Table 8). Similarly, this indicates that the ranking heuristic has encoded the importance of the weight with respect to the testing data. 7 Conclusion We have presented a pruning procedure for the recurrent neural network. The essence of this pruning method is the utilization of the result obtained from EKF training: the parametric vector θˆ (N) and the covariance matrix P(N|N). Instead of considering the error sensitivity as a measure of the importance of the weight, we take the a posterior probability sensitivity. In accordance with the theory of optimal filtering, θˆ (N) and P(N|N) can be
Extended Kalman Filter–Based Pruning Method
1501
Table 8: Pruning Result for a Recurrent Neural Network Being Trained to Identify a Simple Linear State-Space Model and Ranking Heuristic Is Not Imposed. δ
Validation Error
Testing Error
Number of Weights Removed
Initial 0 0.05 0.10
0.0769 0.0593 0.0427 0.0439
0.0318 0.0603 0.1580 0.1400
– 5 9 16
treated as an approximation of the mean and covariance of this a posterior probability. The sensitivity can thus be calculated. Applying this pruning method together with the recurrent neural network to three problems such as the prediction of a linear time series, the modeling of a nonlinear system, and the prediction of the exchange rate, we observed that the proposed procedure can reduce the number of weights of a recurrent network and improve the generalization ability of those networks. We have also demonstrated that the heuristic ranking list is a good cue for the removal of weight. Comparing the results with random ranking listing on two problems, a nonlinear system identification and a linear system identification, we observed that the ranking list generated by the factor θk2 (P−1 θ θ (N | N))kk not only tells which weight is more important but also helps to improve the generalization ability. There are many ways of applying the RLS method or EKF in feedforward neural network training (Kollias & Anastassiou, 1989; Singhal & Wu, 1989; Chen, Cowan, Billings, & Grant, 1990; Puskorius & Feldkamp, 1991; Ruck, Rogers, Kabriskey, Maybeck, Oxley, 1992; Shah, Palmeieri, & Datum, 1992). The pruning scheme we have proposed can be readily applied in conjunction with these methods. Besides, because the EKF is an adaptive method for state estimation and parameter identification, we suspect that this pruning scheme can readily be extended as an adaptive pruning method, and, hence, the search for a better model can be feasible for time-varying systems. Recently, a considerable amount of effort has addressed the generalization and the pruning of a feedforward neural network (Reed, 1993; Sjoberg ¨ & Ljung, 1995; Weigend, Rumelhart, & Huberman, 1991). The generalization and the pruning problem of a recurrent neural network have rarely been discussed. With Pedersen and Hansen (1995) derived a recursive algorithm for evaluating the second-order derivative of the error function so that pruning based on error sensitivity can be possible. Wu and Moody (1996) derived a smoothing regularizer for recurrent neural networks so as to improve the RNN generalization ability. The results presented in this article may shed some light on the development of a more effective pruning method.
1502
John Sum et al.
Appendix: Estimation of the Testing Error To do a one-step prediction based on the trained recurrent network, we need to apply the EKF equation again. Once the training has been finished and the network parameter θ is fixed, we have the only hidden variable x(t) as an unknown. To predict the output y(t) based on the past information y(t − 1), . . . , y(1) and u(t), u(t − 1), . . . , u(1), we first estimate the value of state x(t) and then use this value to predict the output y(t). The estimation of state x(t) can readily be achieved by using Kalman filter equations. Recall that the model of a recurrent network can be defined as a statespace model: x(t + 1) = g(Ax(t) + Bu(t) + D) + v(t),
(A.1)
y(t + 1) = Cx(t + 1) + w(t).
(A.2)
Here, we assume that both the hidden state and output are contaminated by zero-mean gaussian noise. Let Px , Lx be the estimated covariance matrix and the Kalman gain, respectively, for the estimate x. Moreover, we let Qx and Rx be the system noise covariance and the measurement noise covariance. The prediction y(t | t − 1) will then be obtained using the following equations: x(t | t − 1) = g(Ax(t − 1 | t − 1) + Bu(t) + D) Px (t | t − 1) = Fx (t − 1)Px (t − 1 | t −
1)FTx (t
(A.3)
− 1) + Qx (t − 1)
(A.4)
y(t | t − 1) = Cx(t | t − 1)
(A.5) ∗
x(t | t) = x(t | t − 1) + Lx (t)(y (t) − y(t | t − 1))
(A.6)
Px (t | t) = Px (t | t − 1) − Lx (t)CPx (t | t − 1),
(A.7)
where Fx (t + 1) = ∂x g(x(t | t), u∗ (t + 1), θ (t)),
(A.8)
Lx (t) = Px (t | t − 1)C[CPx (t | t − 1)CT + Rx (t − 1)]−1 .
(A.9)
Here we encounter the same problem as in the case of training. We do not have any information about the covariance matrix Rx and Qx , so we need to apply the same technique to estimate the values for both of them: Rx (t) = (1 − αR )Rx (t − 1) + αR (y∗ (t) − y(t | t − 1))2 T
∗
(A.10) 2
Qx (t) = (1 − αQ )Qx (t − 1) + αQ Lx (t)Lx (t) (y (t) − y(t | t − 1)) .(A.11) Equations A.3 to A.11 can be viewed as follows. Suppose x(t − t | t − 1) is the near-optimal estimation of x(t − 1) based on the information given up to
Extended Kalman Filter–Based Pruning Method
1503
the (t − 1)st time step. In accordance with these values, we can estimate x(t) from equation A.3 and predict the output y(t | t − 1) as well. Once we have the true value y∗ (t), the one-step prediction error can be obtained. Based on this prediction error, we can estimate the value of x(t) using equation A.6 and update the system noise covariance and measurement noise covariance using equations A.10 and A.11. Let the number of testing data be Nt ; the one-step prediction error can be defined as Nt 1 X (y∗ (i) − y(t | t − 1))2 . Nt i=1
However, because we know that the value x(t) estimated during the transient is usually not near optimal, the prediction inevitably would be large. It is better to define the prediction error based on the output value when the estimation of x(t) has reached the near-optimal point. Of course, we do not known when this will happen, so we simply assumed that the estimation of x(t) will reach the near-optimal point when t > 0.1Nt . Thus, the prediction error is defined as follows: Etest =
Nt X 1 (y∗ (i) − y(t | t − 1))2 , Nt − τ i=τ +1
(A.12)
where τ = 0.1Nt . Acknowledgments We thank the anonymous referees for their valuable comments and Peter Tam for his comment on the use of extended Kalman filter on state estimation. References Anderson, B. D. O., & Moore, J. (1979). Optimal filtering. Englewood Cliffs, NJ: Prentice Hall. Chen, S., Cowan, S., Billings, S. A., & Grant, P. M. (1990). Parallel recursive prediction error algorithm for training layered neural networks. International Journal of Control, 51(6), 1215–1228. Cottrell, M., et al. (1995). Neural modeling for time series: A statistical stepwise method for weight elimination. IEEE Transactions on Neural Networks, 6(6), 1355–1362. Finnoff, W., Hergert, F., & Zimmermann, H. G. (1993). Improving model selection by nonconvergent methods. Neural Networks, 6, 771–783. Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In S. Hanson, J. Cowan, & C. Giles (Eds.), Advances in neural information processing systems (pp. 164–171). San Mateo, CA: Morgan Kaufmann.
1504
John Sum et al.
Iiguni, Y., Sakai, H., & Tokumaru, H. (1992). A real-time learning algorithm for a multilayered neural network based on the extended Kalman filter. IEEE Transactions on Signal Processing, 40(4), 959–966. Kollias, S., & Anastassiou, D. (1989). An adaptive least squares algorithm for the efficient training of artificial neural networks. IEEE Transactions on Circuits and Systems, 36(8), 1092–1101. Larsen, J. (1996). Design of neural network filters. Unpublished doctoral dissertation, Technical University of Denmark. LeCun, Y., Denker, J., & Solla, S. (1990). Optimal brain damage. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 396–404). San Mateo, CA: Morgan Kaufmann. Leung, C. S., Wong, K. W., Sum, J., & Chan, L. W. (1996). On-line training and pruning for RLS algorithms. Electronics Letters, 7, 2152–2153. Narendra, K. S., & Parthasarathy, K. (1992). Neural networks and dynamical systems. International Journal of Approximate Reasoning, 6, 109–131. Puskorius, G. V., & Feldkamp, L. A. (1991). Decoupled extended Kalman filter training of feedforward layered networks. Proceedings of IJCNN’91, 1, 771–111. Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279–297. Reed, R. (1993), Pruning algorithms—A survey. IEEE Transactions on Neural Networks, 4(5), 740–747. Ruck, D. W., Rogers, S. K., Kabrisky, M., Maybeck, P. S., & Oxley, M. E. (1992). Comparative analysis of backpropagation and the extended Kalman filter for training multilayer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(6), 686–691. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Parallel distributed processing: Exploration in the microstructure of cognition (Vol. 1). Cambridge, MA: MIT Press. Shah, S., Palmeieri, F., & Datum, M. (1992). Optimal filtering algorithms for fast learning in feedforward neural networks. Neural Networks, 5, 779–787. Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended Kalman algorithm. In D. S. Touretzky (Ed.), Advances in neural information processing systems 1 (pp. 133–140). San Mateo, CA: Morgan Kaufmann. Sjoberg, ¨ J., & Ljung, L. (1995). Overtraining, regularization and searching for a minimum, with application to neural networks. Int. J. Control, 62, 1391–1407. Sum, J., Leung, C. S., & Chan, L. W. (1996). Extended Kalman filter in recurrent neural network training and pruning (Tech. Rep. CS-TR-96-05). Hong Kong: Department of Computer Science and Engineering, Chinese University of Hong Kong. Suykens, J., De Moor, B., & and Vandewalle, J. (1995). Nonlinear system identification using neural state space models, applicable to robust control design. International Journal of Control, 62(1), 129–152. Tresp, V., Neuneier, R., & Zimmermann, H. (1996). Early brain damage. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9. Cambridge, MA: MIT Press. Wan, E. A., & Nelson, A. T. (1996). Dual Kalman filtering methods for nonlinear
Extended Kalman Filter–Based Pruning Method
1505
prediction, smoothing, and estimation. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9. Cambridge, MA: MIT Press. Weigend, A., Rumelhart, D. E., & Huberman, B. A. (1991). Generalization by weight-elimination applied to currency exchange rate prediction. In International Joint Conference on Neural Networks (pp. 2374–2379). Singapore. Williams, R. J. (1992), Training recurrent networks using the extended Kalman filter. Proceedings of the IJCNN’92 Baltimore, 4, 241–246. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270–280. With Pedersen, M., & Hansen, L. K. (1995). Recurrent networks: Second order properties and pruning. In G. Tesauro et al. (Eds.), Advances in neural information processing systems 7 (pp. 673–680). Cambridge, MA: MIT Press. Wu, L., & Moody, J. (1996). A smoothing regularizer for feedforward and recurrent neural networks. Neural Computation, 8, 461–489. Received June 19, 1996; accepted October 7, 1997.
LETTER
Communicated by Misha Tsodyks
Transform-Invariant Recognition by Association in a Recurrent Network N´estor Parga Edmund Rolls Oxford University, Department of Experimental Psychology, Oxford OX1 3UD, England
Objects can be reconized independently of the view they present, of their position on the retina, or their scale. It has been suggested that one basic mechanism that makes this possible is a memory effect, or a trace, that allows associations to be made between consecutive views of one object. In this work, we explore the possibility that this memory trace is provided by the sustained activity of neurons in layers of the visual pathway produced by an extensive recurrent connectivity. We describe a model that contains this high recurrent connectivity and synaptic efficacies built with contributions from associations between pairs of views that is simple enough to be treated analytically. The main result is that there is a change of behavior as the strength of the association between views of the same object, relative to the association within each view of an object, increases. When its value is small, sustained activity in the network is produced by the views themselves. As it increases above a threshold value, the network always reaches a particular state (which represents the object) independent of the particular view that was seen as a stimulus. In this regime, the network can still store an extensive number of objects, each defined by a finite (although it can be large) number of views. 1 Introduction Single neurons with responses that are relatively invariant with respect to, for example, the position, size, and even view of an object or face are present in the primate temporal visual cortical areas (see, e.g., Gross, Desimone, Albright, & Schwartz, 1985; Tanaka, Saito, Fukada, & Moriya, 1990; Rolls, 1984, 1992, 1994, 1995, 1996b; Rolls, Booth, & Treves, 1996). How could such invariant representations be formed? One suggestion is that there is a short-term memory trace built into the learning rule implemented in the visual system, which enables, for example, successive views of the same object to be associated together (Foldiak, 1991; Rolls, 1992, 1994, 1995, 1996b). Because the statistics with which objects are normally viewed in the visual world result in different aspects (e.g., views) of the same object being seen close together in time, such a learning rule might enable different views of c 1998 Massachusetts Institute of Technology Neural Computation 10, 1507–1525 (1998) °
1508
N´estor Parga and Edmund Rolls
objects to be associated together. In contrast, views of different objects only rarely, on average, occur close in time. Thus, a neuron could learn by a simple form of Hebbian associativity coupled with a short-term memory trace to respond to any views of an object but to no views of other objects. Foldiak (1991) showed that translation-invariance learning over a one-dimensional input array was possible for a simple winner-take-all network with a decaying trace of previous neuronal activity and an associative Hebb rule. Rolls (1992, 1994, 1995) and (Rolls & Treves, 1997) suggest that invariant representations could be formed in the visual system for two-dimensional images using a multistage architecture, with convergence onto a neuron at any one layer from a small region of the preceding layer, a short-term trace of preceding neuronal activity, soft competition between the neurons to produce distributed representations, and a Hebb-like learning rule. In the theory as suggested, the trace was implemented by a short-term trace in the postsynaptic neuron, and this enabled neurons to learn which inputs from the preceding stage tended to occur close together in time. Possible neurophysiological mechanisms suggested for the trace included the continuing firing of single neurons in the visual system, which often lasts for 300 ms following a 20 ms presentation of a stimulus (Rolls & Tovee, 1994; Rolls, Tovee, Purcell, Stewart, & Azzopardi, 1994), which could be implemented by recurrent collateral connections between pyramidal cells in the same layer or in adjacent cortical stages in the cortico-cortical hierarchy of stages in the visual system (see Rolls, 1992, 1994, 1995); the rather slow unbinding of glutamate from the NMDA receptors after they have been activated (this may be seen after even 100 ms); and slow changes intracellularly induced after the Ca2+ entry, which is one step in the induction of long-term potentiation. Wallis, Rolls, & Foldiak (1993) and Wallis and Rolls (1997) produced a simulation, VisNet, of this theory of the formation of invariant representations in the visual system proposed by Rolls (1992, 1994, 1995, 1996a, 1996b). The simulation showed that translation, size, and view invariance could be learned by such a network. In developing these ideas further, we turn to an approach that allows an analytic formalism to be brought to bear on the issue of the storage capacity of a recurrent network, which performs, for example, view-invariant recognition of objects by associating together different views of the same object that tend to occur close together in time. The architecture with which the invariance is computed is a little different from that already described. In the model of Rolls (1992, 1994, 1996b; Wallis & Rolls, 1997), the postsynaptic memory trace enabled different afferents from the preceding stage to modify their synapses onto the same postsynaptic neuron (see Figure 1). In that model there were no recurrent connections between the neurons, although such connections were one way in which it was postulated the memory trace might be implemented, by simply keeping the representation of one view or aspect active until the next view appeared. Then an association would occur
Transform-Invariant Recognition by Association in a Recurrent Network 1509
Input from preceding stage
Trace learning rule implemented on the inputs to the network
Figure 1: A trace learning rule is implemented in the feedforward inputs to a nonrecurrent network.
between representations that were active close together in time (within, e.g., 100–300 ms). In our model, there is a set of inputs with fixed synaptic weights to a network. The network itself is a recurrent network. For the purposes of this article, we are concerned primarily with how the recurrent network would operate once the synaptic matrix has been formed, not with how the synaptic matrix is formed. In section 4, we will describe a set of neuronal operations that could lead to our synaptic matrix. Let us say for the moment that in the case of a recurrent network, we expect that the trace rule is implemented on the recurrent collaterals (see Figure 2). In the context of recurrent networks, we can consider two main approaches. First, a very simple approach to store in a synaptic weight matrix the s views of an object. This consists of equally associating all the views to each other, including the association of each view with itself (that is, the diagonal terms of the association matrix). Choosing in Figure 3 an example such that objects are defined in terms of five different views, this might produce (if each view produced firing of one neuron at a rate of 1) a block of 5×5 pairs of views, contributing to the synaptic efficacies each with value 1. Object 2 might produce another block of synapses of value 1 further along the diagonal and symmetric about it. Each object or memory could then be thought of as a single attractor with a distributed representation involving five elements (each element representing a different view). Then the capacity of the system in terms of the number Po of objects that can be stored is the number of separate attractors that can be stored in the network. For random fully distributed patterns, this is as shown numerically by Hopfield (1982), Po = 0.14 C,
(1.1)
where there are C inputs per neuron (and N = C neurons if the network is fully connected). The synaptic matrix envisaged here does not consist of
1510
N´estor Parga and Edmund Rolls
Fixed input from the preceding stage
Trace or pairwise learning rule implemented on the recurrent collateral inputs to the autoassociation network. Figure 2: A trace or pairwise associative learning rule is implemented in the recurrent collateral synapses of an autoassociative memory.
random fully distributed binary elements; instead we will assume that it has a sparseness w = Ns n, where s is the number of views stored for each object, from any of which the whole representation of the object must be reconized. (n is the average number of active neurons in the element, or portion of the pattern, that describes one view.) In this case, one can show (as in Gardner, 1988; Tsodyks & Feigel’man, 1988) that the number of objects, Po =
kC , w ln(1/w)
(1.2)
where C is the number of synapses on each neuron devoted to the recurrent collaterals from other neurons in the network, and k is a factor of order one that depends weakly on the detailed structure of the rate distribution, the connectivity pattern, and so forth. A problem with this proposal is that as the number of views per object increases to a large number (e.g., > 20), the network will fail to retrieve correctly the internal representation of the object starting from any one view (which is only a fraction 1/s of the length of the stored pattern that represents an object). This is because if the cue given by one of the views becomes too small, the internal representation of the stimulus will fall outside the attraction basin of the attractor associated to the object. The second approach, which we discuss in detail here, is to consider the operation of the network when the associations between pairs of views can be described by a matrix that has the general form shown in Figure 4. Such
Transform-Invariant Recognition by Association in a Recurrent Network 1511
O1
O2
ξ1 ξ2 ξ3 ξ4 ξ5
O1
ξ1 ξ2 ξ3 ξ4 ξ5
O2
ξ1 ξ2 ξ3 ξ4 ξ5
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
ξ1 ξ2 ξ3 ξ4 ξ5
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
Figure 3: A schematic illustration of the first type of associations contributing to the synaptic matrix considered (see the text).
an association matrix might be produced by different views of an object appearing after a given view with equal probability, and synaptic modification occurring of the view with itself (giving rise to the diagonal term), and of any one view with that which immediately follows it.1 The same matrix might be produced not only by pairwise association of successive views because the association rule allows for associations over a short time scale of, say, 100 to 200 ms, but might also be produced if the synaptic trace had an exponentially decaying form over several hundred milliseconds, allowing associations with decaying strength between views separated by one or more intervening views. The existence of a regime, for values of the coupling parameter between pairs of views in a finite interval, such that the presentation of any of the views of one object leads to the same attractor regardless of the particular view chosen as a cue, is the main issue that we deal with here. A related problem we also deal with is the issue of the capacity of this type of synaptic matrix: How many objects can be stored and retrieved correctly in a view-invariant way? As we will show in the model presented here, their number grows linearly with the number of neurons.
1 Strictly speaking the matrices in Figures 3 and 4 do not refer to the same thing. In the first case, it denotes the matrix of synaptic efficacies; in the second, it is the association between two patterns (the matrix X in equation 2.1).
1512
N´estor Parga and Edmund Rolls
O1
O2
ξ1 ξ2 ξ3 ξ4 ξ5
O1
ξ1 ξ2 ξ3 ξ4 ξ5
O2
ξ1 ξ2 ξ3 ξ4 ξ5
1 b b b b
b 1 b b b
b b 1 b b
b b b 1 b
ξ1 ξ2 ξ3 ξ4 ξ5
b b b b 1 1 b b b b
b 1 b b b
b b 1 b b
b b b 1 b
b b b b 1
Figure 4: A schematic illustration of the second and main type of synaptic matrix considered (see the text).
Some of the groundwork for this approach was laid by the work of Amit and collaborators (Griniasty, Tsodyks, & Amit, 1993). A variant of the second approach is to consider that the remaining entries in the matrix shown in Figure 4 all have a small value. This would be produced by the fact that sometimes a view of one object would be followed by a view of a different object, when, for example, a large saccade was made, with no explicit resetting of the trace. On average, any one object would follow another rarely, and so the case is considered when all the remaining associations between pairs of views have a low value. 2 The Model We will consider a simple model to see if it can store a large number of objects in a view-invariant way. In order to have a problem solvable by standard statistical physics techniques (Amit, 1988), we will make several simplifying hypothesis. The N neurons in the model are binary objects. Denoting the state of neuron i by Si , this variable can take only the values +1 and −1. The views of the objects will be chosen as P special states of the network. Different views, even if they belong to the same object, will be taken as uncorrelated random variables. The interpretation of these states is that they are the internal representations in some internal layer of stimuli that might differ in the way an object is presented: different illumination or color,
Transform-Invariant Recognition by Association in a Recurrent Network 1513
different view or scale. In this internal layer, all these possibilities appear as uncorrelated states of the network, and the only relation between these states when they represent the same object is that they will be associated with each other. We will refer to all of these internal states as views, regardless of the real difference in the stimuli they represent. We will also assume that the network is fully connected, and we will take a coding rate (or sparseness) w = 0.5. By convention we will label the views, placing first all those associated with, say, object 1, then all the views defining object 2, and so on. We will denote the internal representation of the νth view by ξ ν , and the state of neuron j in this pattern by ξjν (j = 1, . . . , N) (see Figure 4). For simplicity, we will assume that all the objects are defined by the same number s of views; in this way, the patterns labeled by ν = 1, . . . , s refer to views of the first object, and so forth. The variables ξjν will be selected to be +1 or −1, with probability w = 0.5. All pairs of views of the same object are coupled to each other. We will denote the coupling between the patterns ξ µ and ξ ν as Xµν . Since we are not considering the effect of an association between views of different objects, this matrix is made of blocks of size s × s—one block for each of the Po objects. These considerations lead us to propose the following form of the synaptic matrix, Jij =
s Po 1 X µ ξi Xµν ξjν , 2N µ=1
(2.1)
which, more explicitly, contains a contribution from association between views of the form s Po 1 X µ bµν ξi ξjν , N µ6=ν
(2.2)
where bµν (the off-diagonal elements of Xµν ) is the strength of the association between views ξ µ and ξ ν . The value of the coupling of one view with itself will be taken equal to one. In the case where all pairs of different views of a given object are equally correlated with strength bµν = b for all µ 6= ν, while views of different objects are not correlated, the matrix X takes the block form shown in Figure 4 for s = 5. This is the case we will deal with, and the block matrix of size s × s will be denoted by O(s) . It is natural to think that b is smaller than the strength of the association of one view with itself, that is, b < 1.2 For an arbitrary value of the sparseness the synaptic matrix 2 The choice of equal strength for all the pairs simplifies the numerical analysis; the equations in this section can be easily extended to the general case.
1514
N´estor Parga and Edmund Rolls
given in equation 2.1 is not a good choice. One should use, for example, the one studied by Tsodyks and Feigel’man (1988). An analysis of the capacity properties for other values of w requires finding the explicit solution of the model for the new synaptic matrix. Our purpose is to show that this network has an object phase—that is, a phase where each object is represented by a different attractor of the recurrent network. If all views of the same object are contained in its associated attraction basin, then view-invariant recognition of the object will be achieved when either one of the views or a pattern sufficiently close to one of them is presented as a stimulus. If each object were defined by a single view, this model reduces to the Hopfield model, which can store a number of patterns linear in N. More precisely, in this saturation regime, the number of stored objects grows as αN, where α has to be less than αc = 0.14. In this case, the view is very close to the attractor itself. If this model is modified to contain a finite number of more complex objects (those having several views) in the background of those simple one-view objects, it is reasonable to expect that the complex objects will still define stable states if the coupling between the views is strong enough. The case where all the Po ' O(N) objects are defined by a finite number of views requires a more careful analysis. This can be done with tools from statistical physics that were used some time ago to solve the standard Hopfield model (Amit, Gutfreund, & Sompolinsky 1985; Amit 1988). Before presenting the result of this calculation, let us see how to characterize the behavior of the network for given values of the parameters α and b. The useful quantities for this purpose are the overlaps of the state of the network {Si } (i = 1, . . . , N) with the views of one of the objects (µ = 1, . . . , s): mµ =
N 1 X µ Si ξi . N i=1
(2.3)
If the network is presented with a stimulus close to one of the views, say ξ µ0 , then all these overlaps initially will be very small, except mµ0 , which will be close to 1.0. Over time, these overlaps will change until a fixed point is reached. If the final state is similar to the initial one, then we will say that the network is in the view phase. But it could also happen that the stable state has overlaps with more than one view. When it has similar overlaps with all the patterns representing the s views of the object seen initially, we will say that this point is in the view-invariant object phase, or simply in the object phase. Because of the symmetry between the views, no matter which view is taken as the initial condition of the network, it will always reach the same attractor. Another possibility is that the system will get to a state where all the overlaps are null. Then there is neither view nor object retrieval, and the system is beyond its capacity. The region where this happens is called, by analogy with a similar situation in statistical physics models, a spin-
Transform-Invariant Recognition by Association in a Recurrent Network 1515
glass phase. For the simple model we are proposing here, there are no more complex solutions because of the symmetry of the synaptic matrix. Let us continue by defining the free energy of a system with Hamiltonian, H=
N X
Jij Si Sj .
(2.4)
i6= j
The free energy is then computed as: f =−
1 1 lim hhlog TrS exp[−βH]iiξ , β N→∞ N
(2.5)
where the angular brackets denote the average over the variables ξ and the symbol TrS refers to a sum over all possible states of the network. An extra parameter, β, has been introduced that acts as the inverse of a temperature. At the end, however, we will consider only the zero temperature limit. This means that the spike emission dynamics is given simply by N X Jij Sj (t) . Si (t + 1) = sign
(2.6)
j6=i
In the thermodynamic description given by the free energy f , the state of the system is described in terms of macroscopic quantities. This means that instead of the overlaps defined in equation 2.3, where the microstate {Si }i=1,...,N is used, one should use their thermal averages. These are computed by averaging over all the microstates weighed with the Boltzmann distribution that appears in equation 2.5. From now on, we will use mµ to denote these overlaps. More explicitly, denoting the thermal average with single angular brackets, we have mµ =
N 1 X µ hSi iξi . N i=1
(2.7)
To check the existence of a transition from the view to the object phase, one has to look for states such that their overlap with all the views of a given object are, in principle, mµ ∼ O(1), √ while the overlaps with the views of other objects are zero (actually O(1/ N)). In terms of these macroscopic quantities, the view phase is characterized by mµ = 0 for µ 6= µ0 . On the other hand, in the object phase, these s overlaps are nonzero and equal. It turns out that these order parameters are not enough to describe the behavior √ of the network properly. This is because there are O(N) overlaps O(1/ N) with the views of the other objects. This effect is taken into account
1516
N´estor Parga and Edmund Rolls
by an order parameter, r, which gives the mean square overlap of all the views of the other objects: r=
1 X hhmµ (X2 )µ ν mν iiξ . α µ ν>s
(2.8)
Finally one also has to consider the possibility that the mµ ’s are zero, but the order parameter defined as * + N 1 X 2 hSi i (2.9) q= N i=1 ξ
is not zero. Another relevant point is that the mean-field solution of this model is exact. This is because the network is fully connected. Then the order parameters satisfy a set of mean-field, self-consistent equations. In order to find them, one first has to compute and express the free energy in terms of the order parameters and then extremize it with respect to them. The evaluation of the free energy of this kind of system is now standard (Mezard, Parisi, & Virasoro, 1987). The free energy and the mean-field equations have been obtained for a general coupling matrix X (although with a different motivation) by Cugliandolo and Tsodyks (1994). (See appendix I of their work for details of the calculation.) Here we describe just the main steps of the algebra and present the results the free energy and for the meanfield equations of our problem (that is, for the case where X decomposes in Po s × s blocks as in Figure 4). Very briefly, the computation proceeds as follows: One first uses the representation of the logarithm zn − 1 n→0 n
log z = lim
in equation 2.5. Then each of the n factors in f is identified with a “replica” of the network. Notice that all of them share the same set of internal representations of the views. At this point the quenched average over these variables can be done easily. Now the free energy can be written in terms of integrals over the order parameters q, r, and mµ (only the O(1) overlaps, that is, µ ≤ s). The final step is to make the ansatz that the problem is symmetric under permutations of the n replicas. Then the remaining integrals can be readily solved in the large N limit (keeping α fixed) by the saddle point method. After some lengthy but straightforward algebra, one obtains that the free energy at zero “temperature” (i.e., for β −1 = 0) is given by: à ¶ÀÀ ¿¿ µ s X σ 1 (s) α+ mγ Oγ λ mλ + α r c + J(q) − 2 σ erf √ f = 2 2αr ξ γ ,λ=1
Transform-Invariant Recognition by Association in a Recurrent Network 1517
r −
αr 2π
¿¿
µ
σ2 exp − 2αr
¶ÀÀ ! .
ξ
(2.10)
Here Oγ(s)λ is the matrix element (γ , λ) of O(s) ; erf (x) denotes the error function, and c is (1 − q)β in the large β limit. The angular brackets indicate an average over the views. The function J(q) is given by · ¸ 1−b 1−b+sb α + (s − 1) , (2.11) J(q) = − s 1 − c (1 − b + s b) 1 − c (1 − b) and σ is defined as s X
σ =
γλ
mγ Oγ(s)λ ξ λ .
(2.12)
For given values of the load parameter α and the strength of the association between a pair of views of the same object b, the values of the overlaps of the stationary state of the network with the representation of the s views of the object are given by the solution of the following equations: ** mν = r c=
(s) λ !++ γ λ mγ Oγ λ ξ
à Ps ξν erf
2 παr
**
√ 2αr
(2.13)
ξ
ÃP (s) λ !2 ++ s γ λ mγ Oγ λ ξ exp − √ 2αr
(2.14)
ξ
· ¸2 · ¸2 1−b+sb 1−b (s − 1) 1 + . r= s 1 − c (1 − b + s b) s 1 − c (1 − b)
(2.15)
These equations describe the retrieval properties of one object (let’s say object A0 ) immersed in a background of another Po = α0 N objects (here α0 = α/s is the load parameter relevant for objects); the effect of these other objects is contained in the parameter r (this is computed trivially from the last equation once c is known), which gives the mean square overlap of all the views of the other objects. The network might also have more complicated states, say, a state given by a mixture of two objects. These states are not described by the equations above; however, they are not relevant if the cue is sufficiently close to one of the views of object A0 . 3 Results The simplest way to look for solutions of these equations is to use an iterative procedure; for instance, for the s overlaps mµ , we have (the time t denotes
1518
N´estor Parga and Edmund Rolls
Table 1: Capacity (Critical Value of the Load Parameter α = α0 s) of the Object Phase for Several Values of the Number of Views s. s 3 5 7 9 11
the iteration number): ** mµ (t + 1) =
ξµ erf
α0 s 0.087 0.081 0.077 0.076 0.073
(s) λ !++ γ λ mγ (t)Oγ λ ξ
à Ps
√ 2αr
.
(3.1)
ξ
As an initial condition, the overlaps are taken mµ = 0, except one of them, for example, mµ0 , which is given a value close to one. The iterations should proceed until a fixed point of this dynamics is reached. However, we have noticed that this naive method does not work, it either fails to reach a fixed point or if it does find one, one cannot be sure of its stability. For this reason, we have preferred to work directly with the equations that result from performing a steepest descent of the free energy. The method is briefly described in the appendix. The solution of equations 2.13 through 2.15 shows that the object phase is indeed present in a region of the parameter space defined by α and b. As expected, it appears when the coupling between views b is strong enough to destabilize the view phase and α is small enough to prevent the noise coming from the storage of an O(N) number of objects from spoiling the retrieval of one of them. For a fixed value of b, one can compute the value of α where the transition from the object to the spin-glass phase takes place. The result is shown in Table 1 for b = 0.80 for several values of s, up to 11 views per object. Note that the capacity of the network decreases slowly as s increases. The complete phase diagram of the model is presented in Figure 5 for a fixed number of views (s = 5). The view phase where the views themselves define attractors appears at the left lower corner of the figure; these states disappear at large α when the network makes the usual transition to the spin-glass phase and at large b where the system reaches the object phase. The capacity in terms of the number of objects that can be stored in the network is given by the value of the almost horizontal line in Figure 5 divided by s. It is interesting to note that the object phase exists even for b smaller than the values shown in this figure. It is present even for b = 0. The standard Hopfield model at low values of 1/β already has a set of stable symmetric phases where the state of the system has a uniform overlap with an odd number of patterns. In that case, however, they are not relevant because
Transform-Invariant Recognition by Association in a Recurrent Network 1519 0.14 0.12
alpha
0.1
III
0.08 0.06 0.04 0.02
I
II
0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
b
Figure 5: The phase diagram of the model for s = 5. Region I is the standard view phase, where the views themselves are retrieved correctly. In this region, objects are not stored in a view-invariant fashion. Region II is the phase where objects are stored as attractors; any stimulus close to some of the views will elicit the same response. Finally in region III, the spin-glass phase, no storage of information is possible.
one is interested in the retrieval of the patterns (which we here call the views), and the symmetric states do not show up when the stimulus is close to one of them. As b is increased, it will reach a value where one pattern cannot support an attractor by itself and then the symmetric (object) states will emerge, giving rise to the (object) phase we were looking for. One consequence of this scenario is that the model can support only objects defined by an odd number of views; however, one expects this to be the result of the artificial aspects of the model. In fact, as is discussed in Amit et al. (1985), the properties of these solutions depend on the distribution of the patterns ξjν . In a more realistic model, with the state of neurons defined in terms of their firing rates, we expect this restriction to disappear. 4 Discussion In this work, we have shown that invariant object recognition is feasible in attractor neural networks. The system is able to store and retrieve in a view-invariant way an extensive number of objects, each defined by a finite set of views. What is implied by “extensive” is that the number of objects is proportional to the size of the network. The crucial factor that defines
1520
N´estor Parga and Edmund Rolls
this size is the number of connections per neuron. In the case of the fully connected networks considered in this article, the size is thus proportional to the number of neurons. To be more specific, the number of objects that can be stored is 0.081 N/5, when there are 5 views of each object. The number of objects is 0.073 N/11, when there are 11 views of each object. This is an interesting result in network terms, in that s views, each represented by an independent random set of active neurons, can, in the network described, be present on the same “object” attraction basis. It is also an interesting result in neurophysiological terms; the number of objects that can be represented in this network scales linearly with the number of recurrent connections per neuron. Although the explicit numerical calculation was done for a rather small number of views per object (up to 11), the basic result that the network can support this kind of phase is expected to hold for any number of them (the only requirement being that it does not increase with the number of neurons). This is, of course, enough. Once an object is defined by a set of views, when the network is presented with a somewhat different stimulus, or an interpolation of some of the views, or a noisy version of one of them, it will still be in the attraction basin of the object attractor. Some of the assumptions taken to simplify the problem are not relevant in what regards the existence of this regime. This is the case of the assumption that all pairs are coupled to each other with the same strength. Choosing them with a certain distribution will only change the details of where the object phase appears. Here we have been mainly concerned with the retrieval properties of the network. Because of that, we started our analysis from a synaptic matrix where the associations between the views had already been learned. One can consider, however, how this matrix is built through a learning procedure that takes place as one object is seen from different perspectives or under different conditions. Learning could proceed as follows: since a fixed view is normally seen for longer times than transitions to another view, it is natural to assume that associations of one view with itself will be built up first, while those between different views appear later. This means that the association between views will happen in the presence of attractors for each view. This is possible because at this stage of learning, the strength of the association between pairs is still small, and, as we have shown, the network will sit in a view phase where the attractors are defined by the views themselves. Under these conditions, when the stimulus corresponds to, say, view v1 , the network will reach its attractor and will maintain sustained activity for about 300 ms (Rolls & Tovee, 1994). But within this time scale, the object can already present a different view v2 as the input stimulus. For a short period of time, the neuronal firing may be in a state that reflects both v1 and v2 . During this time, the appropriate synaptic modification could take place. After these two views are seen consecutively several times, the contribution made by this particular process to the synaptic efficacies will be higher than
Transform-Invariant Recognition by Association in a Recurrent Network 1521
the threshold required to make a transition to the object phase, and views will start to be recognized as a single object. For a different learning protocol, this mechanism has been studied in some detail by Amit and Brunel (1995) and Brunel (1996). The model emphazises the role of the attractors to implement the association between views, but other neurophysiological mechanisms could also participate to reinforce the association. The unbinding of glutamate by the NMDA receptors lasting for about 100 ms could produce this effect. Also the trace rule used in Foldiak (1991) and Wallis and Rolls (1997) is just one possible way to describe any of these memory effects mathematically. A study of invariant object representations using recurrent networks has also been done by O’Reilly and Johnson (1994). Although this is interesting work, the analysis is mainly numerical and the capacity properties of the model are not fully understood. Invariant representation of faces in the context of attractor neural networks has been discussed by Bartlett and Sejnowski (1996) in terms of a model where different views of faces are presented in a fixed sequence (Griniasty et al. 1993). This is not, however, the general situation; normally any pair of views can be seen consecutively, and they will become associated. The most general version of the model presented in this work contemplates this possibility.3 The synaptic matrix in equation 2.1 refers to this situation, although later we found it more convenient to study the simpler case, where all the views are coupled to each other with the same strength. We wish to note the different nature of the invariant object recognition problem studied here and the paired-associate learning task studied in Miyashita and Chang (1988), Miyashita (1988), and Sakai and Miyashita (1991). In the invariant object recognition case, no particular learning protocol is required to produce an activity of the inferotemporal cells responsible for invariant object recognition maintained for 300 ms. The learning can occur rapidly, and the learning takes place between stimuli (e.g., different views) that occur with no intervening delay. In the paired-associate task, the monkeys must learn to associate together two stimuli that are separated in time (by a number of seconds), and this type of learning can take weeks. During the delay period, the sustained activity is rather low in the experiments, and thus the representation of the first stimulus that remains is weak and can be associated with the second stimulus only poorly. Formally, however, the learning mechanism could be treated in the same way as we have used here for invariant object recognition. The experimental difference is that in the paired-associate task used by Miyashita and Chang (1988), it is the weak memory of the first stimulus that is associated with the second
3 After this work was submitted, we found that in a more recent publication, Bartlett and Sejnowski (1997) relaxed the condition on the presentation of the views in a fixed sequence. In this respect, their recurrent synaptic matrix is closer to ours.
1522
N´estor Parga and Edmund Rolls
stimulus. In contrast, in the invariance learning, it would be the firing activity being produced by the first stimulus (not the weak memory of the first stimulus) that can be associated together. The mechanisms described here would apply most naturally when a small number of representations need to be associated together to represent an object. One example is associating together what is noted when an object is seen from different perspectives. Another example is scale, with respect to which neurons early in the visual system tolerate scale changes of approximately 1.5 octaves, so that the whole scale range could be covered by associating together a limited number of such representations (see Rolls, 1994, 1996b). The mechanism would not be so suitable when a large number of different instances would need to be associated together to form an invariant representation of objects, as might be needed for translation invariance. For the latter, we propose a solution in a multilayer network, with a local solution being implemented at each stage (Rolls, 1994, 1995; Wallis & Rolls, 1996). We have envisaged the local solution for translation invariance at each stage as being performed by a trace rule implemented between the inputs to a stage and the postsynaptic neurons in a stage. However, both types of mechanism, implemented in the feedforward connections or in the recurrent collateral connections, could contribute (separately or together) to achieving invariant representations. Part of the interest of the approach described in this article is that it allows analytic investigation, and this is what we have introduced here. Appendix Here we explain how equations 2.13 through 2.15 were solved. A straightforward way to perform a steepest descent on the free energy given in equation 2.10 is to update the order parameters according to the variation of the free energy with respect to mµ , r, and c. The complication of this approach is that it yields P + 2 coupled equations, even when we know that r and c are trivially related at the fixed point, as equation 2.15 shows. One can make convergence faster by imposing this relation as a constraint on the dynamics by means of a Lagrange multiplier. The algebra is rather simple, and one finally finds the following equations for the updating of mµ , r, and c: "
¶À # σ δmν = −η(1 − a) mν − ξ erf √ 2αr ξ " ¶À # µ ¿ σ + a M − Yξ erf √ 2αr ξ " r ¿ µ ¶À # σ2 2 α c− exp − +ηλ δr = −η 2 παr 2αr ξ ¿
ν
µ
(A.1) (A.2)
Transform-Invariant Recognition by Association in a Recurrent Network 1523
δc = −η
δK(c) α . [r − K(c)] − η λ 2 δc
(A.3)
In these equations we have defined: M=
s X
mλ
(A.4)
ξλ
(A.5)
λ=1
Yξ =
s X λ=1
K(c) =
· ¸2 · ¸2 1−a+sa 1−a (s − 1) 1 + , s 1 − c(1 − a + s a) s 1 − c(1 − a)
(A.6)
and λ is chosen in such a way that if r = K(c) holds at a given time (in particular at t = 0) then it also holds at the next time step: r + δr = K(c + δc).
(A.7)
Although it is correct to solve equations A.1 through A.3 to see if the invariant-recognition phase is reached upon presentation of one view, the connection between these equations and the dynamics defined by equation 2.6 is subtle. Equation 2.6 refers to the evolution of a microscopic state. It decreases systematically the value of the Hamiltonian, equation 2.4. On the other hand, equations A.1 through A.3 decrease the free energy. Strictly speaking, the fixed point is not the same. This is because once a number O(N) of views have been stored in the synaptic matrix, there will be many metastable states with a large (i.e., close to one) overlap with a given stored view. Equations A.1 through A.3 reach a fixed point that describes the average properties of these states highly correlated to the stored views. Equation 2.6 reaches one of these states. If the initial condition is modified (but still starting from a state close to one of the views), the final state will probably be a different one, although it will still have a large overlap with the stored pattern. A study of the number of metastable states in an associative memory network has been done by Gardner (1986). Acknowledgments This research was supported by the Medical Research Council, PG8513790, by an E. U. Human Capital and Mobility grant CHRX-CT92-0063, and by a Spanish grant PB96-47. References Amit, D. (1988). Modeling brain function. Cambridge: Cambridge University Press.
1524
N´estor Parga and Edmund Rolls
Amit, D., & Brunel, N. (1995). Learning internal representations in an attractor neural network with analogue neurons. NETWORK, 6, 359–388. Amit, D., Gutfreund, H., & Sompolinsky, H. (1985). Spin-glass models of neural networks. Phys. Rev., A 32, 1007–1018. Bartlett, M. S., & Sejnowski, T. J. (1996). Learning viewpoint invariant representations of faces in an attractor network. Communication presented at the 18th Cognitive Science Meeting, San Diego, CA. Bartlett, M. S., & Sejnowski, T. J. (1997). Viewpoint invariant face recognition using independent component analysis and attractor networks. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9, Cambridge, MA: MIT Press. Brunel, N. (1996). Hebbian learning of context in recurrent neural networks. Neural Computation, 8, 1677. Cugliandolo, L. F., & Tsodyks, M. V. (1994). Capacity of networks with corrrelated attractors. Journal of Phsyics A, 27, 741–755. Foldiak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 193–199. Gardner, E. (1986). Structure of metastable states in the Hopfield model. Journal Physics, A 19, L1047–L1052. Gardner, E. (1988). The space of interactions in neural network models. Journal Physics, A 21, 257–270. Griniasty, M., Tsodyks, M. V., & Amit, D. J. (1993). Conversion of temporal correlations between stimuli to spatial correlations between attractors. Neural Computation, 5, 1. Gross, C. G., Desimone, R., Albright, T. D., & Schwartz, E. L. (1985). Inferior temporal cortex and pattern recognition. Exp. Brain Res. Suppl., 11, 179–201. Hopfield, J. J. (182). Neural networks and physical systems with emergent collective computational abilities. Natl. Acad. Sci. USA, 79, 2554–2558. Mezard, M., Parisi, G., & Virasoro, M. A. (1987). Spin glass theory and beyond. Singapore: World Scientific. Miyashita, Y. (1988). Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature, 335, 817–820. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. O’Reilly, R. C., & Johnson, M. H. (1994). Object recognition and sensitive periods: A computational analysis of visual imprinting. Neural Computation, 6, 357– 389. Rolls, E. T. (1984). Neurons in the cortex of the temporal lobe and in the amygdata of the monkey with responses selective for faces. Human Neurobiology, 3, 209– 222. Rolls, E. T. (1992). Neurophysiological mechanisms underlying face processing within and beyond the temporal cortical visual areas. Phil. Trans. Roy. Soc., 335, 11–21. Rolls, E. T. (1994). Brain mechanisms for invariant visual recognition and learning. Behavioural Processes, 33, 1134–138. Rolls, E. T. (1995). Learning mechanisms in the temporal lobe visual cortex. Behavioural Brain Res., 66, 177–185.
Transform-Invariant Recognition by Association in a Recurrent Network 1525 Rolls, E. T. (1996a). Roles of long term potentiation and long term depression in neural network operations in the brain. In M. S. Fazeli & G. L. Collingridge (Eds.), Cortical plasticity: LTP and LTD. Oxford: Bios. Rolls, E. T. (1996b). A neurophysiological and computational approach to the functions of the temporal lobe cortical visual areas in invariant object recognition. In L. Harris & M. Jenkin (Eds)., Computational biological mechanisms of visual coding. Cambridge: Cambridge University Press. Rolls, E. T., Booth, M. C. A., & Treves, A. (1996). View-invariant representations of objects in the inferior temporal visual cortex. Society for Neuroscience Abstracts, 22. Rolls, E. T., & Tovee, M. J. (1994). Processing speed in the cerebral cortex, and the neurophysiology of visual backward masking. Proc. Roy. Soc., B 257, 9–15. Rolls, E. T., Tovee, M. J., Purcell, D. G., Stewart, A. L., & Azzopardi, P. (1994). The responses of neurons in the temporal cortex of primates, and face identification and detection. Exp. Brain Res., 101, 474–484. Rolls, E. T., & Treves, A. (1997). Neuronal networks and brain funciton. Oxford: Oxford University Press. Sakai, K., & Miyashita, Y. (1991). Neural organisation for the long-term memory of paired associates. Nature, 354, 152. Tanaka, K., Saito, C., Fukada, Y., & Moriya, M. (1990). Integration of form, texture, and color information in the inferotemporal cortex of the Macaque. In E. Iwai & M. Mishkin (Eds.), Vision, memory and the temporal lobe (pp. 101–109). New York: Elsevier. Tsodyks, M. V., & Feigel’man, M. V. (1988). The enhanced storage capacity in neural networks with low activity level. Europhysics Lett., 6, 101–105. Wallis, G., & Rolls, E. T. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51, 167–194. Wallis, G., Rolls, E. T., & Foldiak, P. (1993). Learning invariant responses to the natural transformations of objects. In International Joint Conference on Neural Networks (Vol. 2, pp. 1087–1090). Received February 4, 1997; accepted February 12, 1998.
LETTER
Communicated by DeLiang Wang
Retrieval Dynamics in Oscillator Neural Networks Toshio Aoyagi Katsunori Kitano Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto 606–8501, Japan
We present an analytical approach that allows us to treat the long-time behavior of the recalling process in an oscillator neural network. It is well known that in coupled oscillatory neuronal systems, under suitable conditions, the original dynamics can be reduced to a simpler phase dynamics. In this description, the phases of the oscillators can be regarded as the timings of the neuronal spikes. To attempt an analytical treatment of the recalling dynamics of such a system, we study a simplified model in which we discretize time and assume a synchronous updating rule. The theoretical results show that the retrieval dynamics is described by recursion equations for some macroscopic parameters, such as an overlap with the retrieval pattern. We then treat the noise components in the local field, which arise from the learning of the unretrieved patterns, as gaussian variables. However, we take account of the temporal correlation between these noise components at different times. In particular, we find that this correlation is essential for correctly predicting the behavior of the retrieval process in the case of autoassociative memory. From the derived equations, the maximal storage capacity and the basin of attraction are calculated and graphically displayed. We also consider the more general case that the network retrieves an ordered sequence of phase patterns. In both cases, the basin of attraction remains sufficiently wide to recall the memorized pattern from a noisy one, even near saturation. The validity of these theoretical results is supported by numerical simulations. We believe that this model serves as a convenient starting point for the theoretical study of retrieval dynamics in general oscillatory systems. 1 Introduction Collective oscillatory behavior has been observed in many biological neuronal systems—the primary visual cortex, olfactory bulb, hippocampus, and so on. In particular, recent experimental observations suggest that the temporal coherence of neuronal oscillatory activity may contribute to information processing in real biological systems (Gray, Konig, Engel, & Singer, 1989). To grasp the essential dynamics of the systems, numerous theoretical models have been proposed and analyzed by many authors (von der Malsburg & Schneider, 1986; Sompolinsky, Golomb, & Kleinfeld, 1991; Schillen & c 1998 Massachusetts Institute of Technology Neural Computation 10, 1527–1546 (1998) °
1528
Toshio Aoyagi and Katsunori Kitano
Konig, ¨ 1994). Among the models used, the oscillator neural network is particularly attractive. It provides a suitable framework that is simple enough to be mathematically tractable. This simplicity enables us to obtain interesting analytic results. Moreover, we expect that such results shed light on the role of the oscillatory behavior in real neuronal systems. Although there are many interesting and more realistic models to be explored, here we focus on the oscillator neural network model. Many theoretical works concerning oscillator neural networks have been reported by a number of authors. In the context of these works, let us start with a brief review of the theoretical results with regard to autoassociative memory. The potentiality of oscillator networks in associative memory has been examined with the use of the phase-locking mechanism (Abbot, 1990; Wang, Buhmann, & von der Malsburg, 1990) and fixed-point behavior (Fukai & Shiino, 1995). A neural network of coupled phase oscillators with nontrivial frequency distribution has been analyzed in the low loading limit (Arenas & Vincente, 1994) and in the case of finite storage capacity (Park & Choi, 1995). This work has shown that under suitable conditions, the phase locking can serve as a mechanism for associative memory, even if the natural frequencies of the oscillators are distributed. In this model, however, the learning rule is slightly different from that in the model we consider. There are several theoretical works directly related to our model. In an equivalent model associated with spin glasses, Cook (1989) showed that the storage capacity is αc = 0.038, provided that the Hebbian rule with random phase patterns is used. An oscillator network with sparse connectivity has been analyzed by Noest (1988). Recently we reported that in some sense, the oscillator network is more robust against damage of synaptic connections than is the Hopfield model (Aoyagi & Kitano, 1997). However, there remain many unsolved problems with regard to oscillator neural networks. For example, there is no dynamical theory with an extensive number of stored patterns in fully connected oscillator networks. On the other hand, in the case of traditional neural networks, since the equilibrium properties of the associative model came to be understood (Hopfield, 1982; Amit, Gutfreund, & Sompolinsky, 1988), the dynamical theory to treat the recalling process has progressed gradually. For synchronous updating, a rigorous treatment of the recurrent feedback effect has been examined by Gardner, Derrida, and Mottishaw (1987). As a rougher but practical approximation, Amari and Maginu (1988) have proposed a statistical neurodynamical theory for the long-term behavior of the recalling process. This theory has been recently refined and extended by Okada (1995). For asynchronous updating, Coolen and Sherrington (1993) have proposed a dynamical theory to describe the fully connected Hopfield model near saturation. The aim of this article is to study the retrieval dynamics of oscillator neural networks. For this purpose, we attempt to apply the dynamical theory to oscillator models. We employ the same approach as that in Okada’s theory
Retrieval Dynamics in Oscillator Neural Networks
1529
for the analysis of the recalling process. We consider two types of associations: the autoassociation in which the network retrieves a pattern with a static phase relationship and the temporal association in which the network retrieves a sequence of some phase patterns. In both cases, we show that the retrieval dynamics are approximately governed by recursion relations in terms of macroscopic parameters. By using a suitably simplified model of the oscillatory neuronal systems, we find it is possible to develop an analytical theory of the recalling dynamics. From this point of view, we introduce and describe our model in the next section. 2 A Model of Oscillator Neural Networks Let us first consider the situation in which N periodic firing neuronal systems are coupled to each other. In general, such a system can be described by evolution equations involving a set of state variables—for instance, a membrane potential and several ionic leak currents. Under suitable conditions, such a coupled system can be reduced to a system of simple coupled oscillators. Therefore, the state of the ith system can be characterized by a single variable, φi . This quantity, φi , is usually referred to as the phase, which represents the timing of the neuronal spikes. The reduced phase equations take the general form (Kuramoto, 1984) N X dφi = ωi + 0ij (φj − φi ), dt j=1
(2.1)
where ωi is the frequency of the ith neuron and 0ij represents the effect of the interaction between the ith and jth neurons. System 2.1 is invariant under uniform phase translation, φi → φi + φ0 , where φ0 is an arbitrary real constant. Also, 0ij (φ) is a 2π -periodic function of φ. To be specific, we assume that all frequencies are equal to Ä and that 0ij (φ) is approximated by the lowest mode of the Fourier components. Eliminating Ä by applying the transformation φi → φi + Ät, the model equations (see equation 2.1) become N X dφi = Jij sin(φj − φi + βij ), dt j=1
(2.2)
where Jij and βij are parameters representing the effect of the interaction. Using the complex representation Wi = exp(iφi ), we obtain the alternative form (Noest, 1988) 1 dWi = (hi − h˜i Wi2 ), dt 2
hi =
N X j=1
Cij Wj ,
(2.3)
1530
Toshio Aoyagi and Katsunori Kitano
where Cij = Jij exp(iβij ) and h˜i denotes the complex conjugate of hi . We now consider the case that the system is in a stable stationary state. Putting dWi /dt = 0, it is found that such states satisfy the conditions Wi = hi /|hi |. In this model, the outputs of all neurons are continuously and simultaneously changing according to equations 2.3. On the other hand, the various dynamical theories in the case of traditional neural networks have been greatly advanced over the past decade. However, when we apply such theories to oscillator neural networks and attempt an analytical treatment of the dynamics, we encounter difficulties owing to the updating rule. Therefore, to make the dynamics (see equation 2.3) more mathematically tractable, we discretize time and assume the synchronous updating rule. Furthermore, considering the fact that all neurons relax toward the state in which the relation Wi = hi /|hi | is satisfied, it is natural to adopt the following dynamics:
Wi (t + 1) =
hi (t) , |hi (t)|
hi (t) =
N X
Cij Wj (t).
(2.4)
j=1
Here, the quantity X(t) represents the value of the X at time t. Throughout this article, we treat the above model analytically. Since these dynamics are not equivalent to those of equation 2.2, the behavior they describe may in general be different. In fact, for certain realizations of equation 2.2, the structure of phase space suffers a qualitative change under the simplification to equation 2.4. Fortunately, however, we have found that the solutions of equation 2.4 corresponding to such cases do not appear in the simulation of this equation under the conditions we are considering. This point will be brought up again in the last section. Also, if we use asynchronous updating, the equilibrium states of equation 2.4 are equivalent to those of the original phase model. This can be easily shown by considering the noiseless limit in a statistical mechanics treatment. In this sense, equation 2.4 can be thought of as a synchronous updated version of the oscillator neural network with discrete time. In traditional neural networks, the analytical approach for synchronous updating has contributed to the understanding of retrieval dynamics in the more general cases. Therefore, we believe that this model serves as a convenient starting point for the theoretical study of retrieval dynamics in general oscillatory systems. In fact, we will find that the basins of attraction in the two models are qualitatively very similar. Let us denote a set of patterns to be memorized as complex variables µ µ ξi = exp(iθi ) (µ = 1, . . . , P), where P is the total number of patterns. For µ simplicity, we assume that the parameters θi are chosen at random from a uniform distribution between 0 and 2π . As usual, the load parameter α is defined by α = P/N. The overlap Mtµ between the state of the system and
Retrieval Dynamics in Oscillator Neural Networks
1531
the pattern µ at time t is given by Mµ (t) = mµ (t)eiϕµ (t) =
N 1 X µ ξ˜ Wj (t). N j=1 j
(2.5)
In practice, the correlation of the system with the µth pattern at time t is measured by the amplitude component, mµ (t) = |Mµ (t)|. To realize the function of the autoassociative (or content-addressable) memory, we must define the synaptic efficacies appropriately. As a natural choice, we assume that the synaptic efficacies are given by the generalized Hebbian rule: Cij =
P 1 X µ µ ξi ξ˜j . N µ=1
(2.6)
Note that rotational symmetry is not broken by this rule. Because of this symµ metry, all patterns generated by the uniform phase translation ξi exp(iφ0 ) µ represent the same pattern as ξi . This stems from the fact that information is encoded not by the absolute time but by the relative timing of spikes. Using this learning rule, it is expected that an initial noisy phase pattern can be corrected dynamically, as illustrated in Figure 1. Finally, under the learning rule, the dynamics of equation 2.2 can be written into the gradient form: ∂E dφi =− , dt ∂φi
E=
N N X X
Jik cos(φj − φk + βik ).
(2.7)
j=1 k=1
This can be easily proved from the fact that the matrix of the synaptic efficacies Cij is self-adjoint, that is, Jij = Jji and βij = −βji . Consequently, the system converges to an equilibrium state in which dφi /dt = 0, and thus Mµ (t) is constant. For the dynamics of the general form (see equation 2.1), Hoppensteadt and Izhikevich (1997) have proved that if 0ij (φj − φi ) = −0ji (φi − φj ) and ωi are identical, the system (see equation 2.1) converges to a limit cycle attractor. The above result for the system (see equation 2.2) can be regarded as a special case of their proof. 3 Theoretical Results and Simulations We now consider the situation in which the network is trying to retrieve the pattern ξi1 . We assume that√M1 (t) is of order unity, while each of the Mµ (t) (µ = 2, . . . , P) is of order 1/ N. Hereafter, we omit the subscript 1 and write M(t) (= m(t)e i y(t) ) for M1 (t). The local fields hi (t) can be written as hi (t) = M(t)ξi1 + zi (t) = m(t)ei(ϕ(t)+θi ) + zi (t), 1
(3.1)
1532
Toshio Aoyagi and Katsunori Kitano
..... neuron i Time
..... j
..... k initial state
final state
Figure 1: Retrieval process of a phase pattern in the oscillator neural network. In an initial noisy pattern, the relative timing of the spikes is disturbed from the memorized positions (dashed lines). These relative phases are corrected dynamically in the final state.
where zi (t) is defined by zi (t) =
N P X 1 X ξ ν ξ˜ ν Wj (t). N ν6=1 j=1 i j
(3.2)
The first term acts to retrieve the pattern, while the second term can be thought as the noise arising from the other learning patterns. The essential point in this method is the treatment of the second term as a complex gaussian noise characterized by hzi (t)i = 0, h|zi (t)|2 i = 2σ (t)2 .
(3.3)
In other words, using the notation zi (t) = xi (t) + iyi (t), we assume that hxi (t)2 i = hyi (t)2 i = σ (t)2 and hxi (t)yi (t)i = 0. We first consider the overlap at the next time step. From the definition in equation 2.5 and equation 2.4, we obtain i(ϕ(t)+θj ) N + zj (t) 1 X µ m(t)e . ξ˜j M1 (t + 1) = 1 i(ϕ(t)+θ ) N j=1 j |m(t)e + zj (t)| 1
(3.4)
Since the system is invariant under uniform phase transformation, there is no explicit force corresponding to changing ϕ(t). Moreover, for the original dynamics (equation 2.2), the system finally converges to an equilibrium
Retrieval Dynamics in Oscillator Neural Networks
1533
state in which ϕ(t) is constant. Therefore, it is expected that ϕ(t) is almost constant, so let us set ϕ(t) = ϕ0 . In fact, we have confirmed this assumption numerically. However, if the system fails to recall the pattern, the assumption does not hold. Using this assumption, we find that the overlap at time t + 1 is given by i(ϕ +θ 1 )
N m(t) + zj (t)¯e 0 j 1 X m(t + 1) = N j=1 |m(t) + z (t)¯ei(ϕ0 +θj1 ) | j ÀÀ ¿¿ m(t) + z(t) , = |m(t) + z(t)| z(t)
(3.5)
where hh· · ·iiz(t) represents an average over the complex gaussian z(t) with mean 0 and variance 2σ (t)2 . Here we have used the self-averaging property i(ϕ +θ 1 )
and the fact that the statistics of the noise zj (t)¯e 0 j are the same as zj (t) due to rotational symmetry. Next we proceed to calculate the variance, σ (t + 1). In the local field, we can separate out explicitly noise terms with µ = ν as hi (t) = m(t)ei(ϕ0 +θi ) + 1
N P X N 1 X 1 X µ µ ξi ξ˜j Wj (t) + ξ ν ξ˜ ν Wj (t). (3.6) N µ6=1,ν j=1 N j=1 i j
√ Noting that the last term is small (order 1/ N), we can expand to first order when Wi (t + 1) =
hi (t) |hi (t)|
is calculated. By definition (see equation 3.2), after some calculations we find that the noise z(t + 1) is given by zi (t + 1) =
N P X hj (t) 1 X ξiν ξ˜jν N ν6=1 j=1 |hj (t)|
P X N N hν (t) 1 X 1 1 X ν ˜ν j + zi (t) ξi ξj ν = N ν6=1 j=1 |hj (t)| N j=1 2|hjν (t)|
−
N P X hjν (t) 1 X ν ˜ν 2 ν ˜ , ξ ( ξ ) ξ (t) W k k N2 ν6=1 j,k=1 i j 2h˜ ν (t)|hν (t)| j
(3.7)
j
where hνi (t) = m(t)ei(ϕ0 +θi ) + 1
N P X 1 X µ µ ξi ξ˜j Wj (t). N µ6=1,ν j=1
(3.8)
1534
Toshio Aoyagi and Katsunori Kitano
Now we assume that hνi (t) is of ξiν . In this case, the first and p independent √ second terms are of order P/N = α, while the last term is of order p √ P/N = α/N. Therefore, in the limit N → ∞, the last term can be ignored. This gives zi (t + 1) ∼
P X N hjν (t) 1 X + zi (t)U(t), ξiν ξ˜jν ν N ν6=1 j=1 |hj (t)|
(3.9)
with ¿¿ U(t) =
1 2|m(t) + z(t)|
ÀÀ .
(3.10)
z(t)
In the derivation of U(t), we have used the fact that hjν (t) → hj in the limit N → ∞. We now square equation 3.9 to obtain σ (t+1) by using 2σ (t+1)2 = |zi (t + 1)|2 . Since hνi (t) is assumed to be independent of ξiν , we can easily determine the average of the square of the first term to be α. Similarly, the square of the second term is σ (t)2 U(t)2 . Thus, we obtain σ (t + 1)2 =
α + σ (t)2 U(t)2 2 ** ++ N P X hν (t) 1 X ν ˜ν j z˜ i (t) . + U(t)Re ξ ξ N ν6=1 j=1 i j |hjν (t)|
(3.11)
In general, to estimate the last term, we must take into account the fact that the noise z(t) at time t is generally correlated to z(t0 ) at a preceding time t0 . In the nth order approximation, we estimate the last term in equation 3.11, including the correlation of the noise z(t) up to z(t−n+1). At the same time, we assume that z(t) is independent of z(t − n) and its value at any earlier time. For later convenience, we define ÀÀ ¿¿ m(t − 1) + z(t − 1) m(t0 − 1) + z˜ (t0 − 1) X(t, t0 ) = Re |m(t − 1) + z(t − 1)| |m(t0 − 1) + z(t0 − 1)| z(t−1),z(t0 −1) ®® C(t, t0 ) = z(t)˜z(t0 ) z(t),z(t0 ) . (3.12) Note that X(t, t0 ) = m(t)m(t0 ) if z(t) is independent of z(t0 ), that is, C(t, t0 ) = 0. For n = 1, using zi (t + 1) =
P X N hj (t) 1 X , ξiν ξ˜jν N ν6=1 j=1 |hj (t)|
the last term in equation 3.11 becomes 2αU(t)X(t, t−1) = 2αU(t)m(t)m(t+1). In this case, the noise z(t) at each time is assumed to be an independent
Retrieval Dynamics in Oscillator Neural Networks
1535
gaussian variable. Finally, we obtain the macrodynamical equations at first order: ¿¿ m(t + 1) =
m(t) + z(t) |m(t) + z(t)|
ÀÀ z(t)
α σ (t + 1)2 = + σ (t)2 U(t)2 + αU(t)m(t)m(t + 1). 2
(3.13)
Here, U(t) is determined by m(t) and σ (t) according to equation 3.10. It is thought that this approximation in the oscillator network corresponds to the Amari-Maginu theory (1988) in the case of traditional neural networks For n = 2, we need to take account of the correlation between zi (t) and zi (t − 1). Using the relation in equation 3.9, we obtain σ (t + 1)2 =
α + σ (t)2 U(t)2 + αU(t)X(t + 1, t) 2 + αU(t)U(t − 1)m(t + 1)m(t − 1).
(3.14)
®® To calculate C(t, t − 1), we estimate z(t)˜z(t − 1) by using equation 3.9 and the approximation C(t, t − 2) = 0. Then we obtain 2C(t, t − 1) = αX(t + 1, t) + 2σ (t − 1)2 U(t − 1).
(3.15)
Consequently, the second-order approximation yields ¿¿ m(t + 1) =
m(t) + z(t) |m(t) + z(t)|
ÀÀ z(t)
α σ (t + 1) = + σ (t)2 U(t)2 + αU(t)X(t + 1, t) 2 + αU(t)U(t − 1)m(t + 1)m(t − 1) 2
2C(t, t − 1) = αX(t + 1, t) + 2σ (t − 1)2 U(t − 1),
(3.16)
where X(t + 1, t) is calculated by the definition in equation 3.12. Note that X(t+1, t) can be determined by four macroscopic parameters: m(t), m(t−1), σ (t), and σ (t − 1). In general, we can proceed with similar calculations up to nth order. We find that the dynamical equations at nth order are given by ¿¿ m(t + 1) = σ (t + 1)2 =
m(t) + z(t) |m(t) + z(t)|
ÀÀ z(t)
n−1 k X Y α + σ (t)2 U(t)2 + α X(t + 1, t − k) U(t − l), 2 k=0 l=0
(3.17)
1536
Toshio Aoyagi and Katsunori Kitano
with αX(t, t − τ ) + 2U(t − 1)U(t − τ − 1)C(t − 1, t − τ − 1) n−1 k X Y X(t − k, t − τ ) U(t − l) + α k=1 l=1 n−1 k X Y 2C(t, t − τ ) = + α X(t, t − k) U(t − l), (1 ≤ τ ≤ n − 2) k=τ +1 l=τ +1 αX(t, t − n + 1) + 2U(t − 1)C(t − 1, t − n + 1), (τ = n − 1) 0, (n ≤ τ ). (3.18) Here, U(t) and X(t, t0 ) are given by equations 3.10 and 3.12, respectively. In addition, we approximate the initial conditions as σ (0) = α/2 and X(τ, 0) = m(τ )m(0) (τ = 1, . . . , n), which is valid for τ = 1. In conclusion, equations 3.10, 3.12, 3.17, and 3.18 are our main theoretical results for describing the recalling process of oscillator neural networks. The corresponding results in the case of traditional neural networks have been derived by Okada (1995). In Figure 2, we compare the time evolution of the overlap predicted by our theory with numerical simulations for some choices of the initial overlap. Figure 2 indicates that the predictions from the higher-order approximation agree better with numerical results. We generally find two phases of the system behavior, depending on the storage level α and the initial overlap m(0). One is a retrieval phase, in which m(t) → m(∞) ∼ 1 for large t. The other is a nonretrieval phase, in which m(t) → 0. When for a given α the retrieval phase exists, the network goes to the state m(∞) ∼ 1, provided that the initial overlap is larger than a certain critical value mαc . The width of the basin to retrieve the pattern can be measured by this critical value. In Figure 3, we summarize the results as a phase diagram in which the attractor m(∞) and the basins of attraction mαc are plotted as a function of α for various order approximations. The upper and the lower curves correspond to m(∞) and mαc , respectively. As seen in this figure, the theoretical predictions at fourth order are in reasonable agreement with numerical results. By contrast, the lower-order approximations result in larger basins of attraction than those found numerically. In particular, the prediction at the lowest order is quite different from the numerical results. For example, the critical value of the storage capacity, αc , obtained by numerical simulations is about 0.04, while that estimated by the first-order approximation is 0.10. This significant deviation is attributed mainly to the temporal correlation of the noise terms defined by equation 3.2. On the other hand, such correlation is relatively weak in the case of the standard model of neural networks, and thus the theoretical prediction at the lowest order is in good agreement with numerical simulations (Amari & Maginu, 1988). In order to understand the
m
Retrieval Dynamics in Oscillator Neural Networks 1.0
1.0
0.5
0.5
0.0 0.0
10.0
(a)
t
20.0
0.0 0.0
1.0
1.0
0.5
0.5
0.0 0.0
10.0
(c)
20.0
0.0 0.0
1537
10.0
20.0
10.0
20.0
(b)
(d)
Figure 2: Typical time evolution of overlaps for α = 0.03, and the initial overlaps m = 0.05, 0.1, 0.3, 0.5, and 0.7. (a) Numerical simulations with N = 1000. (b) Theoretical curves at first order, (c) second order, and (d) third order.
correct behavior of the retrieval dynamics in oscillator neural networks, we cannot ignore the effect of the temporal correlation of the noise terms. 4 Phase Sequence Generator To this point we have been concerned with the network’s retrieving a single phase pattern in which the neural oscillators maintain a static phase relationship. In this section, we consider a more general situation in which the network retrieves an ordered sequence of phase patterns. This implies that the phase relationship between the neurons in the retrieval state generally changes over time. In the case of standard neural networks for synchronous updating, the realization of such temporal association has been attempted (Amari, 1972; Fukushima, 1972). The essential point is the modification of the Hebbian rule so as to couple the patterns of the sequence. Accordingly, we use the following synaptic connection: Cij =
P 1 X µ+1 µ ξi ξ˜j . N µ=1
(4.1)
In an analysis similar to that presented in the previous section, we can derive dynamical equations for the retrieval process. Here, we give only a
1538
Toshio Aoyagi and Katsunori Kitano
m
1.0
0.5
0.0 0.00
simulation 1st order 2nd order 3rd order 4th order
0.02
0.04
α
0.06
0.08
0.10
Figure 3: Basin of attraction. The four curves represent the theoretical results for various order approximations. The data points indicate simulation results, with N = 1000 for 20 trials.
rough explanation of the derivation. The main difference here is that the last term in equation 3.11 vanishes owing to the synaptic connection (see equation 4.1). In other words, taking the limit N → ∞, the temporal correlation of the noise terms does not contribute to the retrieval dynamics, and we need only to calculate the lowest-order approximation. A straightforward computation yields the final form, ÀÀ
m(t + 1) =
¿¿
σ (t + 1)2 =
α + σ (t)2 U(t)2 , 2
m(t) + z(t) |m(t) + z(t)|
z(t)
(4.2)
where m(t) is the overlap concerning the pattern that the network should recall at time t, σ (t)2 is the variance of the noise z(t), and U(t) is defined by equation 3.10. Applying these recursion relations iteratively for various initial overlaps and storage levels, we obtain the basins of attraction as shown in Figure 4. We see from the figure that the ability to recall the pattern is enhanced in comparison with the autoassociative model. For example, the critical storage capacity αc in the sequential generator is approximately 0.19, which is about
Retrieval Dynamics in Oscillator Neural Networks
1539
m
1.0
auto (simulation) auto (theory) sequence (simulation) sequence (theory)
0.5
0.0 0.00
0.05
0.10
α
0.15
0.20
Figure 4: Basins of attraction in the phase sequence generator. The solid line and the diamonds represent the theoretical curve and numerical data, respectively. The theoretical curve (dashed line) and numerical data (circles) for the autoassociative model are shown for reference.
five times greater than that in the autoassociative case. We also find that the basin of the attraction is sufficiently wide. This enhancement is supported by numerical simulations. On the other hand, in the case of the standard model, this enhancement is relatively small. For example, the storage capacity in the sequence generator is approximately 0.28, as compared to 0.14 in the autoassociative case (Amari, 1988; Domany, Kinzel, & Meir, 1989). Therefore, this large enhancement is one of the characteristics of the oscillator neural network. Let us consider the reason for this enhancement. In the case of the autoassociative model, the temporal correlation of the noise in the oscillator network is stronger than that in traditional networks, In this case, it seems that the temporal correlation of the noise tends to decrease the storage capacity. On the other hand, this correlation does not affect the retrieval dynamics in the case of the sequence generator. Therefore, an increase in the correlation causes the difference between the storage capacity of the autoassociative model and that of the sequence generator model to become large. This is the reason for the large enhancement of the storage capacity in storing sequences. Similar synaptic rules can be applied to more general types of sequential patterns, such as those consisting of several limit cycles, provided that a
1540
Toshio Aoyagi and Katsunori Kitano
pattern in the sequence is determined only by the preceding pattern. If the period of each limit cycle becomes infinite as N → ∞, the result is again equation 4.2, where in this case P represents the total number of the patterns associated with all limit cycles. 5 Conclusion and Discussion A good understanding of the retrieval dynamics in oscillator neural networks was obtained by theoretical analysis. The validity of this theory was demonstrated by numerical simulations. The theoretical results show that the major characteristics of the oscillator neural network are as follows: • To understand the correct behavior of the retrieval dynamics in the case of the autoassociative model, it is essential to take into account the fact that the noise arising from the learning of all the unretrieved patterns is temporally correlated. In fact, to achieve reasonable agreement with the numerical simulations, the fourth-order approximation is required. This involves the inclusion of the temporal correlation of the noise up to three previous time steps. • The critical storage capacity in the case of the sequence generator is about 0.19, five times larger than 0.038 in the case of the autoassociative model. By contrast, in the traditional neural network, the storage capacity in the sequence generator is twice as large as that in the autoassociative model. This large enhancement is one of the characteristics of the oscillator neural network. • It is desirable that, even near saturation, the basin of attraction remains sufficiently wide to allow for recall of the memorized pattern from a noisy one. In both cases, the basin of attraction is wide enough to accomplish this. Judging from the fact that the oscillator network is capable of retrieving more detailed information, we believe on the whole that its ability is comparable to that of standard neural networks. This result encourages theoretical attempts to explore the potential ability of the temporal coding. Because our analysis has been simplified by discretizing time and assuming synchronous updating, this simplification may result in behavior that differs from that of the original model possessing continuous time. In fact, in some cases, the dynamics of equation 2.4 with synchronous updating possess period 2 solutions that do not correspond to any solution of the original phase model (see equation 2.2). In this case, an attractor existing in the original model becomes an invariant torus constituted by such periodic solutions in our model (see equation 2.4). For example, this phenomenon is observed in certain situations in which the coupled phase oscillators have an in-phase synchronization attractor. However, under the conditions that the memorized phase patterns are uniformly random and the system size
Retrieval Dynamics in Oscillator Neural Networks
1541
1.0
m
continuos−time dynamics discrete−time dynamics 4th order approx. 0.5
0.0 0.000
0.010
0.020
α
0.030
0.040
0.050
Figure 5: Comparison of basins of attraction. The circles represent numerical data in the discrete-time model associated with our theoretical results. The squares represent simulations in the original phase model, where the dynamics are governed by a set of ordinary differential equations. For reference, the theoretical curve (the solid line) obtained using the fourth-order approximation is also shown.
is large, we find that such irrelevant solutions are not observed in numerical simulations. Although we believe that the irrelevant solutions are not realized due to their high symmetry, we cannot yet confirm this belief theoretically. This is an important problem to consider. The question now arises whether the theoretical results given in this article apply in the case of the original model. To clarify this point in the case of the autoassociative model, we compared the basins of attraction in the two models numerically, as shown in Figure 5. Although a slight difference between these models can be found, we can safely say that the basins of attraction are qualitatively very similar. On the other hand, like the standard model, the dynamical process in the case of the sequence generator is sensitive to a slight asynchronization. Therefore, a control mechanism for synchronous updating, such as a central pacemaker, is required to realize the recalling of sequential patterns in real systems. We would like now to make some comments. The first concerns the relation of our theory to replica theory. In the case of autoassociative memory, according to the fourth-order approximation, the critical storage capacity is approximately 0.046. Although this estimation is slightly larger than the
1542
Toshio Aoyagi and Katsunori Kitano
value obtained by the replica theory, the higher-order approximation is expected to make this difference smaller and, in the limit n → ∞, to vanish. In fact, it can be shown that the two theories yield the same result concerning equilibrium states (see the appendix). Also, the storage capacity obtained by simulations is slightly larger than that from the replica theory. This difference may be attributed to replica symmetry breaking. We also note that numerical simulations show that the noise distribution will not remain gaussian if the network fails to recall the pattern (Nishimori & Ozeki, 1993). Consequently, our theory cannot be applied to dynamical processes after the failure of recalling. Finally, we mention some weak points in our model. In this article, all neurons are assumed to exhibit periodic firing states. However, for any given pattern, generally some neurons will be in a resting state. We would like to develop our theory by including such a resting state (Aoyagi, 1995). In addition, the dynamical evolution in the phase oscillator model is generally slow, and it seems that our model has the same basic feature. To achieve quick response, we may need to extend our model to describe the characteristics of the excitability of neural systems (Terman and Wang, 1995; Somers & Kopell, 1995). Appendix: Relation to Replica Theory In this appendix, we show, assuming that the system is in the equilibrium state, that the result obtained by the replica method can be derived from the present theory. According to replica theory, using the replica symmetric ansatz, the equilibrium state in the oscillator neural network is characterized by two-order parameters, m and r, corresponding to the overlap with the retrieval pattern and the mean-square random overlap for the unretrieved patterns, respectively. According to the replica theory (Cook, 1989; Aoyagi & Kitano, 1997), these order parameters in the equilibrium state satisfy the equations m= r=
Z
∞
Z
−∞
½
∞
−∞
x + √2m αr Dx Dy q 2m 2 √ (x + αr ) + y2
1 2 1 − √ Grp αr
µ
where Dx Dy denotes µ 2 ¶ x + y2 1 exp − dx dy 2π 2
m √ αr
¶¾−2
,
(A.1)
Retrieval Dynamics in Oscillator Neural Networks
1543
and Grp (a) is defined by Z Grp (a) =
∞
Z
−∞
∞
−∞
x2 + y2 + ax Dx Dy p . (x + a)2 + y2
(A.2)
√ Introducing the variable σrp = αr/2, equations A.1 and A.2 can be rewritten as Z ∞Z ∞ σrp x + m Dx Dy q m = −∞ −∞ (σrp x + m)2 + (σrp y)2 (A.3) h ³ ´i −2 σ2 = α 1 − 1 G m . rp 2 2σrp rp σrp We easily find that the equation here for m is equivalent to equation 3.5, provided that σrp coincides with the variance of the complex noise z in equation 3.3, σ . Therefore, we need to prove that the variance of the noise z in the equilibrium state satisfies the condition represented by equation A.3. The equilibrium state implies that zi (t + 1) = zi (t) = zi in equation 3.9. Then zi is given by zi =
N P X hjν 1 X 1 ξiν ξ˜jν ν . 1 − U(t) N ν6=1 j=1 |hj |
(A.4)
In a derivation similar to that of equation 3.11, we obtain σ2 =
α 2 (1 − U)2
.
(A.5)
From the definition of U in equation 3.10, we find Z ∞Z ∞ 1 Dx Dy p U= 2 (σ x + m)2 + (σy)2 −∞ −∞ Z ∞Z ∞ 1 1 = Dx Dy q . 2σ −∞ −∞ m 2 (x + ) + y2
(A.6)
σ
Therefore, equation A.5 becomes · ³ m ´¸−2 1 α 1− Gpt , σ = 2 2σ σ 2
with
Z Gpt (a) =
∞
−∞
Z
∞
−∞
1 Dx Dy p . (x + a)2 + y2
(A.7)
(A.8)
1544
Toshio Aoyagi and Katsunori Kitano
Now we can prove that Gpt (a) = Grp (a), because Z Grp (a) − Gpt (a) =
∞
−∞
Z =
Z
∞
−∞
Z
2π
dϕ 0
x2 + y2 + ax − 1 Dx Dy p (x + a)2 + y2 ∞
dR(R2 − aR cos ϕ − 1)
0
¶ µ R2 − 2aR cos ϕ + a2 × exp − 2 ¶¸R=∞ · µ 2 R − 2aR cos ϕ + a2 = −R exp − 2 R=0 = 0.
(A.9)
Hence, as for the equilibrium state, the result of the present theory coincides with that of the replica theory. This derivation is essentially equivalent to self-consistent signal-to-noise analysis, proposed by Shiino and Fukai (1992). Okuda (1994) first applied this method to the analysis of the equilibrium state in oscillator neural networks. It is worth noting that equation A.7 can be derived directly from equation 3.11 in the limit n → ∞. Using the fact that, owing to zi (t + 1) = zi (t) = zi , X(t, t0 ) = 1 in the equilibrium state, it can be shown that equation 3.11 reduces to the same form as equation A.7. Acknowledgments We thank H. Yanai and T. Munakata for valuable comments and helpful discussions. We also thank G. Paquette for critically reading the manuscript and the anonymous referees for their helpful comments. This work was supported by the Japanese Grant-in-Aid for Science Research Fund from the Ministry of Education, Science, and Culture. References Abbot, L. F. (1990). A network of oscillators. J. Phys. A, 23, 3835–3859. Amari, S. (1972). Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Trans. on Computer, C-21, 1197–1206. Amari, S. (1988). Associative memory and its statistical neurodynamical analysis. In H. Haken (Ed.), Neural and synagetic computers (pp. 85–99). Berlin: Springer-Verlag. Amari, S., & Maginu, K. (1988). Statistical neurodynamics of associative memory. Neural Networks, 1, 63–73. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1985). Storing infinite numbers of patterns in a spin-glass model of neural networks. Phys. Rev. Lett., 55, 1530–1533.
Retrieval Dynamics in Oscillator Neural Networks
1545
Aoyagi, T. (1995). Network of neural oscillators for retrieving phase information. Phys. Rev. Lett., 74, 4075–4078. Aoyagi, T., & Kitano, K. (1997). Effect of random synaptic dilution in oscillator neural networks. Phys. Rev., E55, 7424–7428. Arenas, A., & Vincente, C. J. P. (1994). Phase locking in a network of neural oscillators. Europhys. Lett., 26, 79–83. Cook, J. (1989). The mean-field theory of a Q-state neural network model. J. Phys. A22, 2057–2067. Coolen, A. C. C., & Sherrington, D. (1993). Dynamics of fully connected attractor neural networks near saturation. Phys. Rev. Lett., 71, 3886–3889. Domany, E., Kinzel, W., & Meir, R. (1989). Layered neural networks. J. Phys. A, 22, 2081–2102. Fukai, T., & Shiino, M. (1995). Memory recall by quasi-fixed-point attractors in oscillator neural networks. Neural Comp., 7, 529–548. Fukushima, K. (1972). A model of associative memory in the brain. Kybernetik, 12, 58–63. Gardner, E., Derrida, B., & Mottishaw, P. (1987). Zero temperature parallel dynamics for infinite range spin glasses and neural networks. Journal de physique, 48, 741–755. Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337. Hopfield, J. J. (1982). Neural networks and physical systems with emergent computational abilities. Proc. Natl. Acad. Sci. U.S.A., 79, 2554–2558. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. Berlin: Springer-Verlag. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin: SpringerVerlag. Nishimori, H., & Ozeki, T. (1993). Retrieval dynamics of associative memory of the Hopfield type. J. Phys. A26, 859–871. Noest, A. J. (1988). Associative memory in sparse phasor neural networks. Europhys. Lett., 6, 469–474. Okada, M. (1995). A hierarchy of macrodynamical equations for associative memory. Neural Networks, 8, 833–838. Okuda, K. (1994). Retrieval of phase information in an oscillator network. Unpublished manuscript. Park, K., & Choi, M. Y. (1995). Synchronization in a network of neuronal oscillators with finite storage capacity. Phys. Rev. E, 52, 2907–2911. Schillen, T. B., & Konig, ¨ P. (1994). Binding by temporal structure in multiple feature domains of an oscillatory neural network. Biol. Cybern., 70, 397–405. Shiino, M., & Fukai, T. (1992). Self-consistent signal-to-noise analysis and its application to analogue neural networks with asymmetric connections. J. Phys. A, 25, L375–L381. Somers, D., & Kopell, N. (1995). Waves and synchrony in networks of oscillators of relaxation and non-relaxation type. Physica D, 89, 169–183. Sompolinsky, H., Golomb, D., & Kleinfeld, D. (1991). Cooperative dynamics in visual processing. Phys. Rev. A, 43, 6990–7011.
1546
Toshio Aoyagi and Katsunori Kitano
Terman, D., & Wang, D. L. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176. von der Malsburg, C., & Schneider (1986). A neural cocktail-party processor. Biol. Cybern., 54, 29–40. Wang, D. L., Buhmann, J., & von der Malsburg, C. (1990). Pattern segmentation in associative memory. Neural Comp., 2, 95–107. Received July 8, 1997; accepted December 10, 1997.
LETTER
Communicated by Scott Kirkpatrick
A Fast and Robust Cluster Update Algorithm for Image Segmentation in Spin-Lattice Models Without Annealing—Visual Latencies Revisited Ralf Opara Florentin Worg ¨ otter ¨ Department of Neurophysiology, Ruhr-Universit¨at, 44780 Bochum, Germany
Image segmentation in spin-lattice models relies on the fast and reliable assignment of correct labels to those groups of spins that represent the same object. Commonly used local spin-update algorithms are slow because in each iteration only a single spin is flipped and a careful annealing schedule has to be designed in order to avoid local minima and correctly label larger areas. Updating of complete spin clusters is more efficient, but often clusters that should represent different objects will be conjoined. In this study, we propose a cluster update algorithm that, similar to most local update algorithms, calculates an energy function and determines the probability for flipping a whole cluster of spins by the energy gain calculated for a neighborhood of the regarded cluster. The novel algorithm, called energy-based cluster update (ECU algorithm) is compared to its predecessors. A convergence proof is derived, and it is shown that the algorithm outperforms local update algorithms by far in speed and reliability. At the same time it is more robust and noise tolerant than other versions of cluster update algorithms, making annealing completely unnecessary. The reduction in computational effort achieved this way allows us to segment real images in about 1–5 sec on a regular workstation. The ECU-algorithm can recover fine details of the images, and it is to a large degree robust with respect to luminance-gradients across objects. In a final step, we introduce luminance dependent visual latencies (Opara & Worg ¨ otter, ¨ 1996; Worg ¨ otter, ¨ Opara, Funke, & Eysel, 1996) into the spin-lattice model. This step guarantees that only spins representing pixels with similar luminance become activated at the same time. The energy function is then computed only for the interaction of the regarded cluster with the currently active spins. This latency mechanism improves the quality of the image segmentation by another 40%. The results shown are based on the evaluation of gray-level differences. It is important to realize that all algorithmic components can be transferred easily to arbitrary image features, like disparity, texture, and motion.
c 1998 Massachusetts Institute of Technology Neural Computation 10, 1547–1566 (1998) °
1548
Ralf Opara and Florentin Worg ¨ otter ¨
1 Introduction Solutions for computer vision problems almost always avoid the use of spiking neural networks because they are computationally expensive. This, on the other hand, makes it usually rather difficult, if not impossible, to implement neuronal algorithms directly. For example, image segmentation by spiking neural networks is commonly achieved by synchronizing the firing patterns of units that represent a common object (von der Malsburg 1981; von der Malsburg & Schneider 1986; Gray, Konig, ¨ Engel, & Singer, 1989; Ernst, Pawelzik, & Geisel, 1994; Nischwitz & Glunder, ¨ 1995). The same concept (i.e., “synchronization”) can be introduced into spin-lattice models by mapping every neuron of the neural net onto one spin element in the lattice (Geman, Geman, Graffigne, & Dong, 1990; Vorbruggen, ¨ 1995; Eckes & Vorbruggen, ¨ 1996; Blatt, Wiseman, & Domany, 1996). Synchronization then means that different spins will get the same orientation. In principle such a mechanism can be used to label the different objects in a visual scene by trying to ensure that spins that belong to the same object have the same orientation (i.e., the same label). Several spin-lattice algorithms exist that can be used to segment visual scenes by labeling the objects (Geman et al., 1990; Vorbruggen, ¨ 1995; Eckes & Vorbruggen, ¨ 1996; Blatt et al., 1996). These algorithms differ mainly in the way in which they define the interaction range between spins and how the individual spins are iteratively updated. Local update algorithms (Geman et al., 1990; Vorbruggen, ¨ 1995; Eckes & Vorbruggen, ¨ 1996) modify only one spin per iteration, and the interaction ranges are usually small. With cluster update algorithms (Blatt et al., 1996), on the other hand, larger interaction ranges are introduced due to the treatment of whole clusters and groups of spins—the spin clusters—are updated simultaneously. Like neural nets, spin-lattice models are also confronted with the problem of how to organize spins (synchronize units) over large areas. Local update algorithms can solve this problem in many cases by propagating a certain modification through the whole lattice step by step, which makes them rather slow. Cluster update algorithms are much faster because a change can affect many spins at the same time. This, however, affects their robustness in a negative way because clusters that should get different labels can easily collapse and form one indistinguishable clump. In this article, we mainly focus on cluster update algorithms and describe a modification of the Potts model (Potts, 1952), which preserves the advantages of the original version—the convergence speed—but is much more robust. We also show that this algorithm can be used to introduce one additional neuronal concept—visual latencies—to improve its performance (Opara & Worg ¨ otter, ¨ 1996; Worg ¨ otter ¨ et al., 1996).
A Fast and Robust Cluster Update Algorithm
1549
output
spin lattice
time
delay line
τ
input
Figure 1: Schematic diagram of the system. The input image is piped into a luminance-dependent delay line. Bright objects are processed before dark objects. The temporal structure of this data stream determines the interaction between spins within the second part of the system, which consists of a spin lattice.
2 Overview of the Model The model we propose consists of two parts, shown in Figure 1. The first part of the system contains a luminance-dependent delay line (“visual latencies”; see also Opara & Worg ¨ otter, ¨ 1996). Pixels with a high luminance are processed before dark pixels. The temporal structure of this data stream determines the interaction between spins in the second part of the system. The second part consists of Potts spins (Potts, 1952), which are arranged on a two-dimensional lattice. Within this spin lattice, the algorithm tends to assign the same label only to those spatially adjacent pixels that reach the spin lattice with high temporal coherence (same latency). Pixels with strong delays between them will get different labels. 2.1 Spin-Lattice Model. First we will describe the model without visual latencies. The spin-lattice model we use is similar to the model that Potts proposed in 1952. In the Potts model, each spin can take q different values
1550
Ralf Opara and Florentin Worg ¨ otter ¨
(2 ≤ q), which is a generalization of the Ising model (q = 2) (Ising, 1925). In the two-dimensional case, the spins are arranged on a lattice of size N = Lx Ly . We define a label using the symbol w and a label configuration by W = {w1 , . . . , wN } ∈ Ä, where Ä is the space of all configurations. A subconfiguration, where we consider fewer than N spins, is denoted by a superscript (e.g., W c ). The global energy function of our specific Potts configuration W ∈ Ä is given by:
E(W) =
N X i=1
Ki +
X hi,ji
−Ji,j δwi ,wj =
N X i=1
Ki +
X
Eij ,
(2.1)
where wi ,wj is labels of spin i and j; Ji,j is the interaction strength between spins at locations i and j; δwi ,wj is a kronecker function, being 1 if wi = wj and 0 else; hi, ji is a neighborhood of spin i with ||i, j|| ≤ 1, where 1 is a constant that needs to be set; and Ki is a global “inhibition.” If the term Ki (which will be defined and discussed below) were set to zero, one would get the global energy function of the generic Potts model in its usual form. In the homogeneous Potts model (Jij = const), all spins are interacting with the same strength. In the inhomogeneous Potts model, the interaction strength is changing over space (Ji,j 6= const). To apply the Potts model to an image segmentation task, the similarity of the input image has to be represented in the interaction strength Jij of the spins (Geman et al., 1990; Vorbruggen, ¨ 1995; Eckes & Vorbruggen, ¨ 1996; Blatt et al., 1996). Spins that are representing similar image parts (same object) have to interact strongly, while nonsimilar image parts will not interact or will interact with a negative strength. 2.2 The Energy-Based Cluster Update (ECU) Algorithm. Several different algorithms exist in the literature to order the spins in a Pott-model according to a predefined goal, like determining phase transitions in ferromagnetic systems or, as in our case, in order to segment an image. The local-update “Metropolis” algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953), which rotates single spins per iteration and tries to minimize a global energy function using simulated annealing (Kirkpatrick, Gelatt, & Vecci, 1983), is well known. Cluster update algorithms (Swendsen & Wang, 1987; Wolff, 1989) treat groups of spins simultaneously and are therefore much faster, but often an undesired fusion of clusters is observed. The approach we propose seeks to combine the advantages of both types of algorithms: local and cluster update. We will demonstrate that our algorithm overcomes the critical slowing down at phase transitions and in the low-temperature phase observed in local update algorithms, so simulated
A Fast and Robust Cluster Update Algorithm
1551
annealing is not necessary for our approach. Moreover, it does not produce unwanted cluster fusions. The structure of the ECU algorithm is as follows. We start with a random label configuration, and individual spins are combined as clusters according to equation 2.2. Then we calculate an energy function. Following this, the label configuration is modified such that the probability of a new configuration depends on the energy of this new configuration and the temperature. From this new label configuration, a new cluster configuration is formed, and the next iteration starts. First we show how clusters are formed. Similar to the definitions for labels above, we define c as a cluster; C = {c1 , . . . , cM }, M ≤ N as a configuration of clusters; and C ∈ 0, where 0 is the space of all possible cluster configurations. The formation of clusters is the same as in other cluster update algorithms. Clusters must contain only spins that are in the same state (same label). The binding probability PB that two nearest-neighbor spins are conjoined in order to form a cluster depends on the temperature T and the coupling strength Ji,j (see below) of the two nearest-neighbor spins and is given by: PB,hijic (i, j) = (1 − exp[−0.5Ji,j δwi ,wj /T]) = (1 − exp[0.5Eij /T]).
(2.2)
The factor 0.5 is a normalization constant that is necessary because the clusters are not flipped independently of each other, as in the Swendsen and Wang algorithm. We show later that this normalization ensures that in the thermodynamic equilibrium, configurations are taken according to the Boltzmann distribution. Furthermore, only a certain subset of spins denoted as hijic are taken into account when computing PB —namely, those that are nearest neighbors and have a positive Ji,j , because otherwise negative probabilities would obtain. At high temperatures, the average cluster size is small (Ji,j /T → 0 and PB (i, j) → 0), while at lower temperatures the possible cluster size increases. In our approach we consider the energy gain of a cluster flip. Therefore, we define the energy of the subconfiguration Wkc , assuming that all spins in the cluster c would take the label wk . This is done by considering the interactions of all spins in the cluster with those outside the cluster but within a neighborhood hiji, ck 6= cj , X X ηij Eij E(Wkc ) = Ki + i∈ci
(2.3)
hiji ck 6=cj
where hiji, ck 6= cj is the noncluster neighborhood of spin i, which is that set of spins j outside the cluster ck but within the interaction range 1 of
1552
Ralf Opara and Florentin Worg ¨ otter ¨
spin i given by ||i, j|| ≤ 1. The constant ηij can be set to 1.0 for all practical purposes. In order to arrive at the correct proof of the detailed equilibrium, however, a more complicated definition is required.1 In classical cluster update algorithms, each cluster is simultaneously and independently assigned a spin label. In the ECU algorithm, we instead update one cluster at a time given the label of neighboring clusters. This uses the relative energies of different label assignments computed by equation 2.3. Ki acts as a global inhibition and is commonly used in other neural networks. Its value is rather noncritical, and it can be set within wide ranges. It ensures that far-away objects will get different labels. We define it as: Ki =
N α X δw ,w , N j=1 i j
(2.4)
where α is a control parameter that adjusts the strength of the global inhibition (α ≥ 0). Ji,j , which occurs in the definition of Eij (see, e.g., equation 2.1), is the interaction strength of two spins i and j. We define it such that regions with similar gray values will get positive weights, whereas dissimilar regions get negative weights (Vorbruggen, ¨ 1995; Eckes & Vorbruggen, ¨ 1996): Ji,j = 1 −
|gi − gj | , 2
(2.5)
where gi is the gray value of pixel i of the input image and 2 is the average difference of the gray values within all interaction neighborhoods Ni . It thereby represents the intrinsic (short-range) similarity within the whole input image. It is given by:2 2=
N X X 1 1 |gi − gj |, 2 N (21 + 1) − 1 i=1 hiji
(2.6)
where (21 + 1)2 − 1 is the number of neighbors of a spin. 1
The correct definition of η is given by redefining equation 2.3:
E(Wkc ) =
X X X Ki + 0.5Eij + Eij , i∈ck
hijici ck 6=cj
hiji\hijici ck 6=cj
where hijick is that set of spins used to compute PB and hiji\hijick is its opposite. 2 In the case of 2 = 0, equation 2.5 is ill defined, but in this case only a single uniform surface exists, and segmentation is not necessary.
A Fast and Robust Cluster Update Algorithm
1553
Similar to a Gibbs sampler, the selection probability PS of selecting a label subconfiguration Wkc where all spins in the cluster c take the label wk is given by: exp(E(Wkc )/T) . PS (Wkc ) = Pq c j=1 exp(E(Wj )/T)
(2.7)
The number of possible labels that a cluster can take is given by q. In the following simulations, we use q = 10 unless otherwise noted. 3 Detailed Balance We have two sets of variables: the label configuration W ∈ Ä and (similar to the Swendsen and Wang algorithm) the cluster configuration C ∈ 0. The complete system assumes configurations in the shared configuration space 0 × Ä. The goal of the ECU algorithm is to label an image according to the energy function on the labels E(W) (see equation 2.1), which leads to an equilibrium probability distribution3 P(W) = Z1 exp(−E(W)/T). Labeling could be done simply by Gibbs sampling, for example, but Gibbs sampling individual spins can be very slow. To speed up sampling, we define an energy function over additional variables, the clusters c, such that the equilibrium 1 distribution P P(W, C) = Z exp(−E(W, C)/T) still has the same marginal distribution, C P(W, C) = P(W), as defined above. Then we define a Markov process over this joint system consisting of two steps: (1) sampling of clusters given spins P(W, C → W, C0 ) and (2) sampling of spins given clusters P(W, C → W 0 , C). The claim to prove consists of two aspects. If detailed balance holds, applying these two steps in succession should (1) result in the desired equilibrium distribution P(W, C), which has the desired marginal distribution over spins P(W) and (2) this needs to be the Boltzmann distribution (Swendsen & Wang, 1987; Binder & Heermann, 1988; Neal, 1993). In the detailed balance, the probability P(W) of a configuration W, multiplied with the probability P(W → W 0 ) for the transition into the configuration W 0 , is identical to the probability of the reverse process. Hence: P(W)P(W → W 0 ) = P(W 0 )P(W 0 → W),
(3.1)
where P(W) is the probability for a configuration (W) and P(W → W 0 ) is the probability for the transition between (W) and (W 0 ). The forming of clusters and the assigning of labels are independent events. Therefore, we get the transition probabilities by summing over all 3
Z is the partition function.
1554
Ralf Opara and Florentin Worg ¨ otter ¨
possible cluster configurations that allow a transition from spin configuration W to W 0 and vice versa: X P(W → C)P(C → W 0 ) (3.2) P(W → W 0 ) = C
P(W 0 → W) =
X
P(W 0 → C)P(C → W).
(3.3)
C
Dividing both equations, we have: P 0 P(W → W 0 ) C P(W → C)P(C → W ) P = 0 P(W 0 → W) C P(W → C)P(C → W) ¶ µ E(W) − E(W 0 ) . = exp T
(3.4)
The right-most equality in equation 3.4 is the condition of detailed balance, which we have to show. This is equivalent to showing that the following equation holds: µ ¶ P(W → C)P(C → W 0 ) E(W) − E(W 0 ) P(W → W 0 ) = = exp . (3.5) P(W 0 → W) P(W 0 → C)P(C → W) T The equivalence of equations 3.4 and 3.5 is spelled out in equations 3.6 and 3.7: ¶ µ E(W) − E(W 0 ) 0 P(W → C)P(C → W ) = exp T 0 (3.6) ×P(W → C)P(C → W) X C
[P(W → C)P(C → W 0 )]
¶ E(W) − E(W 0 ) P(W 0 → C)P(C → W)] T C ¶ µ E(W) − E(W 0 ) X [P(W 0 → C)P(C → W)]. = exp T C =
X
µ
[exp
(3.7)
Thus, equation 3.5 needs to be proved. In the following equations, only the contributions that are different for the configurations W and W 0 are calculated, because all others cancel in equation 3.5. Y exp(0.5Eij /T), (3.8) P(W → C) ∼ hijic wi =wj ci 6=cj
A Fast and Robust Cluster Update Algorithm
1555
where hijic are the neighboring spins that are taken into account by forming a cluster (see the remarks concerning equation 2.2). The probability of moving from a cluster configuration C to a label configuration W 0 is given by: P(C → W 0 ) ∼
Y
exp[(−0.5Eij )/T]
hijic ci 6=cj w0i =w0 j
Y
exp[(−Eij )/T]
(3.9)
hiji\hiji ci 6=cj w0i =w0 j
where hiji\hiji are the neighboring spins that are not taken into account by forming a cluster. Now equation 3.5 can be evaluated by constructing the other term similar to equations 3.8 and 3.9: Q P(W → C)P(C → W 0 ) = Q P(W 0 → C)P(C → W)
hiji ci 6=cj w0i =w0 j hiji ci 6=cj wi =wj
µ
= exp
exp(−Eij /T) exp(−Eij /T)
¶ E(W) − E(W 0 ) , T
(3.10)
which proves the claim. 4 Characterization of the ECU Algorithm In order to quantify the ECU algorithm, we use the magnetization m and the magnetic susceptibility χ of the system (see Chen, Ferrenberg, & Landau, 1992): m(W) =
qNmax (W) − N , (q − 1)N
χ=
N (hm2 i − hmi2 ), T
(4.1)
where Nmax is the number of spins whose label w occurs most frequently. The magnetic susceptibility χ can be used to localize the different phases of the system where large fluctuations of the magnetization occur. Other interesting quantities are the spin-spin correlation hδwi ,wj i and the number of clusters hQi. The spin-spin correlation indicates how many adjacent spins are in the same state. To show the basic properties of the ECU algorithm, a very simple image was presented to the system consisting of two rectangles having different averaged gray values and different standard deviations of the noise (see Figure 2A). In the following we will demonstrate the behavior of the system for changing temperatures. (This is done merely to characterize the system, because annealing is not necessary when segmenting an image with the ECU algorithm.) Figure 2B shows the averaged magnetization at different
1556
Ralf Opara and Florentin Worg ¨ otter ¨
A
segmentation results 0.5
0.014
0.2
0.4
0.6
0.8
1.0
0
C 0
0.4
0.6
0.8
0.2 1
0.4
1.0
T
0.2 0.2
0.6
0.8
TC1
< δsi,sj
<
1000
0.2
D
χT / N
<m>
B 0
stimulus
1.0 1
E
TC2 0.4
0.6
0.8
1.0
T
Figure 2: The behavior of the system at different temperatures. As the system parameter, we use N = 32 × 32, 1 = 1, α = 0, and q = 10. In this simulation, we use cyclic boundary conditions. (A) The stimulus consists of an image containing two rectangles of different averaged gray values (gbottom = 60, gtop = 120) and different standard deviations (σbottom = 0, σtop = 25) of the noise. In this particular simulation, 2 (see equation 2.6) was not computed but set to 25. As a function of the temperature we plot, (B) the average magnetization and examples of some segmentation results, (C) the average number of clusters, (D) the magnetic susceptibility, and (E) the average spin-spin correlation.
temperatures. At high temperatures (T → ∞), all spins are totally disordered (limT→∞ {< m >} = 0). As the system reaches temperature Tc1 , the spins, representing the lower rectangle, show a phase transition, indicated by an abrupt increased averaged magnetization, by large fluctuations of the magnetization (χ; see Figure 2D), and by an abrupt decrease in the number of clusters (hQi; see Figure 2C). As the temperature is lowered, the magnetization and the spin-spin correlation are slowly increased until a second-phase transition occurs at Tc2 . At this temperature, the spins representing the upper rectangle are getting ordered, again indicated by abrupt changes of the variables describing the system. In the low-temperature phase (T ≤ 0.4), the spins representing the upper and lower rectangle have different labels. 5 Comparing Different Algorithms A comparison of the three update algorithms—local (Gibbs sampler), cluster (Blatt et al.), and the ECU algorithm—is shown in Figure 3. The input for all simulations consists of two rectangles and a thin line surrounding and separating them (see Figure 3A). In all simulations, the parameters are the
A Fast and Robust Cluster Update Algorithm
B
A
initial configuration
input
local update algorithm
1557
annealing schedule
T0 = 0.6 Tk+1 = 0.99992 Tk
C k = 20000
15000
5010
4970
4880
1
annealing schedule
T0 = 0.6 Tk+1 = 0.8 Tk
D 100
60
40
20
10
1
cluster update algorithm configuration after 500 iterations at different temperatures
E T=0.2
T=0.25
T=0.3
T=0.32
T=0.35
T=0.45
ECU-algorithm configuration after 40 iterations at different temperatures
F T=0.0
T=0.03
T=0.1
T=0.2
T=0.25
T=0.4
Figure 3: Comparison of different update rules. As the parameter, we use N = 128 × 128, α = 0, and q = 10. (A) The input image for all simulations consists of two rectangles and a background. Most pixels of the background are covered by the two rectangles, and only thin lines of the background are visible. (B) The initial random configuration. (C, D) Configurations of a local update algorithm (Gibbs sampler) at different iterations (1 = 5). (E) Configurations of the cluster update algorithm (Blatt et al., 1996) (1 = 1). (F) Configurations of the ECU algorithm (1 = 1).
same, with the exception of the temperature, which is varied. The simulations start with a random configuration, shown in Figure 3B. Figures 3C and 3D shows the spin-lattice configuration of a local update algorithm at different iterations. The initial temperature T0 was set to 0.6. At every iteration, the temperature is cooled according to Tk+1 = 0.99992Tk (see Figure 3C). At iteration k ≈ 5000, the rectangles start to move to an ordered configuration, and 50 iterations later, both rectangles are ordered. At iteration k ≈ 14,000 the background starts to organize. Although we used a very slow annealing schedule, the background is divided into several segments (k = 20,000).
1558
Ralf Opara and Florentin Worg ¨ otter ¨
If one uses faster annealing schedules, which is desirable if confronted with any close-to real-time problem (e.g., robot vision), the rectangles are also divided into several segments (see Figure 3D). Cluster update algorithms label the input image correctly only within a small range of temperatures. The task of image segmentation is to find this range. Figure 3E shows the configuration of the spin lattice after 500 iterations of the cluster update algorithm for several runs at different temperatures. For every temperature, the system starts with a random configuration, shown in Figure 3B. At high temperatures (e.g., T ≈ 0.45), the spin lattice is disordered. At temperatures between T = 0.32, and T = 0.35, the input image is more or less correctly labeled by the spin configuration. At T = 0.32, parts of the background and the lower rectangle are bound together. If the temperature is decreased to T = 0.25, all three objects are bound together. Figure 3F shows the spin lattices configuration of the ECU algorithm after 40 iterations. The same parameters and the same connection strengths between spins are used as in the simulations (Figures 3C–E). If the temperature T is lower than 0.25, the input image is always segmented correctly after 40 iterations. As the temperature increases, the system is increasingly disordered. However, if the temperature were set to zero, randomly occurring incorrect label associations (due to the initialization process) would be frozen, and an incorrect segmentation might occur (see Figure 3E, left). Thus, in general, zero temperature is to be avoided. 6 Introducing Visual Latencies So far we have described the basic properties of the spin model and the chosen dynamics of the ECU algorithm. Now we introduce visual latencies that naturally occur in the visual system of the higher vertebrates. The first part of the system (see Figure 1) contains a luminance-dependent delay line. Pixels with a high luminance pass the delay line faster than low-luminous pixels. The time to pass the delay line is defined as tlat (i) = (gmax − gi ) flat ,
(6.1)
where flat is a factor that determines the maximal latency, gmax is the maximal gray value, and gi is the gray value of pixel i. The latency differences of two pixels i and j are therefore 1tlat = tlat (i) − tlat (j) = (gj − gi ) flat . Due to the temporal structure, which now defines the similarities of the input image, the interaction strength of the spins (see equation 2.5) is now redefined in the following way: |tlat (i) − tlat (j)| ; if t > tlat (i) ∧ t > tlat (j) flat 2 = 0; else.
Ji,j = 1 − Ji,j
(6.2)
A Fast and Robust Cluster Update Algorithm
1559
input
t=0
t=1
t=2
t=3
t=4
t=5
t=6
t=8
t=10
t=12
t=13
Figure 4: The stimulus consists of an N = 128 × 128 image (upper left), containing a paraglider, a shaded sky, and some hills. The other panels show the label distribution of the system at different iteration steps. The labels are coded as gray values. Spins with the same label (same gray value) belong to one object; different gray levels indicates the assignment to different objects. As parameters we used T = 0.008, flat = 0.04, 1 = 4, α = 1, and q = 10.
Note that equations 2.5 and 6.2 are very similar; the gray-value differences in equation 2.5 are recoded only as latency differences in equation 6.2. The latency tlat (i) defines the time when the spin i representing pixel i is activated the first time. This activation has the consequence that spin i can interact with other spins that are already activated. Therefore, the temporal structure of the input data stream also influences the temporal development of ordered structures in the spin lattice. 7 Segmentation of Real Images Figure 4 shows snapshots of the label distributions of the model if a stimulus (upper left) is given to the system. In general, a few parameters need to be set in order to make the algorithm work. The following can be used as an adjustment guideline. The temperature is rather uncritical; it should be low but must not be equal to zero (use, e.g., T = 0.01). Furthermore, set flat ∈ [0.00, 0.08]. If the image contains very little noise, one should let flat → 0. The same holds for 1. Set it to 2.0 for noise-free images and to about 5.0 if noise exists. Setting α to 1.0 will almost always work. The number of labels q is determined by the number of objects that are to be expected in the visual scene. Most often q ≈ 10 will suffice. The simulation starts with a random label configuration, and each label is represented by nearly the same number of spins. During the first iterations, only the brightest objects (short latency) are processed (the paraglider and the lower part of the sky). According to the competition included in the dynamics of the system, the spins of the paraglider will receive the same
1560
Ralf Opara and Florentin Worg ¨ otter ¨
label, while the spins representing the sky will get a different label. Due to the latency, processing of the parts with a lower luminance does not start before iteration 9. At iteration 13, the whole image is segmented into four different areas.4 The efficiency of the ECU algorithm can be judged from iterations 12 and 13, where a surface of nearly 40 ∗ 128 pixels (hills) is flipped during only one iteration. In comparison, local update algorithms need numerous updates and a careful annealing schedule to achieve this. Figure 5 shows the configuration of the spin lattice at different iterations for two more examples of real images. The input consists of a moose on a meadow (see Figure 5A) and a canvas painting of a woman lying on a sofa (see Figure 5B). Again the brightest areas are processed first: the sky and the woman. The darkest objects, the moose and the sofa, are processed later, at iterations 32 (Figure 5A) and 34 (Figure 5B), respectively. In Figure 5A, five objects are detected: the sky, the meadow, two different hills, and the moose. In Figure 5B, six objects are detected: the wall, the carpet, the woman, and three segments representing the sofa. The sofa is divided into three segments because they are disconnected over distances much larger than the interaction range of the spins. 8 Performance Quantification of the ECU Algorithm The performance of the model is determined in a series of simulations in which the basic parameters of the system temperature, latency, and the extent of the interaction neighborhood are varied. The stimulus given to the system consists of two rectangles with average gray values of 100 and 120. The task for the system was to segment input images that contain different amounts of noise, like those shown on top of Figure 6A. An image is segmented correctly if the spins representing rectangle 1 are all in the same state, and the spins representing rectangle 2 are in a different state. Therefore the performance is measured by a quantity denoted as the relative amount of misclassified pixels (RAMP). In Figure 6A RAMP is measured as a function of the latency factor flat (see equation 6.1) and noise. The latency factor flat was varied from flat = 0 iteration/1g up to flat = 0.4 iteration/1g. If the gray-value difference 1g of two pixels is, for example, 20, then the latency difference between these pixels is varied between 0 and 0.4 ∗ 20 = 8 iterations. The signal-to-noise ratio (snr) was varied between ∞ and 1. The temperature was set to 0.05. For each data point, 100 simulations were used; afterward, the data points were smoothed using a weighted binominal averaging procedure. At high snr, the image was always segmented correctly by the ECU algorithm, which is not necessarily the case for local update algorithms at such √ low temperatures. If the snr reaches 2.5, a few pixels are misclassified, 4
There is a small strip with a different label to the right of the paraglider.
A Fast and Robust Cluster Update Algorithm
1561
A input
k=2
k=3
k=9
k = 15
k = 28
k = 32
k = 36
k = 40
k=1
k = 11
k = 13
k = 22
k = 34
k = 40
k = 42
k = 45
B input
Figure 5: (A) The stimulus consists of an N = 128 × 128 image, containing a moose. The panels show the label distribution of the system at different iteration steps. As parameters, we used T = 0.008, flat = 0.2, 1 = 5, α = 1, and q = 10. (B) Same as (A), but the input image consists of a painting of a woman and a sofa. Parameters are the same as in (A).
and the RAMP is greater than zero. As the noise increases, the number of misclassified pixels also increases, due to fact that the similarities between the pixels are not calculated correctly. The latency introduced in our system has a positive effect on the segmentation quality, improving it by about 50%. Figure 6B shows the influence of the latency differences on the convergence speed of the system. For this test also, a rather simple image is used (see the inset), containing four square objects. The squares have different gray values, which lead to different total latencies tlat . The brightest object is always processed at iteration zero. The second object is processed at it-
1562
Ralf Opara and Florentin Worg ¨ otter ¨
A
input
RAMP
1.0
segment. result
0.5 0
0
0.2
flat
2.0 1.0 sqrt (snr)
0.4 inf
B
tcomp: #iterations to segment a square
6
5 stimulus 4
3 0
2
4
6
8
10
12
14
unit-latency L / #iterations Figure 6: The input image (N = 32 × 32 pixel) consists of √ two rectangles with averaged gray values of 100 and 120 and different noise ( snr ∞ down to 1). (A) RAMP as a function of latency and noise. We set T = 0.05, 1 = 5, α = 0.5, and q = 10. (B) Averaged number of iterations to segment a certain square as a function of the unit latency L. The input image (N = 32 × 32 pixel) consists of four squares with different gray levels (inset). We set 1 = 3, snr = ∞, α = 1, and q = 4.
eration tlat = 1 × L, the third object at iteration tlat = 2 × L, and so on. The unit latency L is varied between 0 and 14 iterations (abscissa). With a unit latency of L = 0, all objects are processed simultaneously. In 200 simulations using different initial random configurations, the average computational time (tcomp , in number of iterations) is determined to
A Fast and Robust Cluster Update Algorithm
1563
segment a certain square and plotted against the unit latency L. We define tcomp = tseg − tlat , where tseg is the actual iteration time reached when a given square is completely segmented and tlat the total latency for this square. We use this particular measure because at iterations t < tlat , nearly no computer time is allocated for the processing of that particular square. In Figure 6B one can see that at L = 0, the averaged number of iterations to segment a square is nearly the same for all four objects (tcomp ≈ 5.7). With increasing latency, the number of iterations necessary for the segmentation is decreased for all squares, until a plateau is reached (tcomp ≈ 3.1). The number of iterations until a square is segmented is reduced by nearly 50% as compared to L = 0.
9 Discussion 9.1 Comparing the Algorithms. In this study we compared different versions of update algorithms for image segmentation in spin-lattice models and tried to find a solution to several of the most common problems associated with them. Figure 3 is exemplary for these problems, which unfortunately are very generic such that it is much easier to maladjust the update algorithm rather than to find an acceptable parameter set. In particular, it is almost impossible to find such a set for nasty scenes like the one in Figure 3, which looks so simple. Even with 20,000 iterations in the annealing schedule, the local update algorithm still fails to produce the correct result, because the extent of the local similarities in the border of the image is very small. A much slower annealing would ultimately achieve a correct segmentation, but 20,000 iterations took more than 10 hours on our workstation. Faster annealing leads to a failure in even the large rectangular areas of the image, resulting in a patchwork structure. Commonly used versions of cluster update algorithms (Blatt et al., 1996) in principle could lead to a fast and correct labeling of this image, but the range of temperatures for which this will be achieved is in many cases rather small. Figure 3 demonstrates that it is smaller than 0.03 (= 0.32–0.35) for this particular picture, and the results after 500 iterations are unsatisfactory. In fact, after several hours of trying to produce better results with this algorithm by adjusting the temperature in ever finer steps, we gave up. The novel ECU-algorithm that we designed, on the other hand, produced exact results within only 40 iterations at low temperatures. Thus, our algorithm makes annealing unnecessary (like all cluster update algorithms). In addition, it is robust with respect to the chosen temperature. This usually allows setting the temperature within wide ranges. Real images were segmented with the same speed and accuracy. Only for the paraglider image did we spend some time trying to find the optimal parameter set, which finally reduced the number of iterations to 13. For the moose and the painting of the lying women, we set the parame-
1564
Ralf Opara and Florentin Worg ¨ otter ¨
ters in a single shot according to our previous experiences. Thus, in these cases, we needed some more iterations to reach the final result; nevertheless, segmentation took less than 5 seconds without having particularly optimized the program code on our SUN SPARC 20 with respect to speed. In addition, our algorithm is able to recover even fine details (like the antlers of the moose) and structures that fade into the background (like its legs) will be labeled correctly. Furthermore, the image of the paraglider shows that even rather strong luminance gradients (sky) will be tolerated and correctly treated. The painting demonstrates that complex bent shapes, like the body of the women with its arms in different positions, are also correctly recognized as long as they are connected. In general, we found that segmentation failed to match our own expectations only in those cases where contextual knowledge is necessary to bind objects correctly.
9.2 Visual Latencies Revisited. Intriguingly, the initial starting point for this study was the realization that our own older studies (Opara & Worg ¨ otter, ¨ 1996; Worg ¨ otter ¨ et al., 1996) needed to be pursued in a different algorithmical context in order to better advertise the idea of using visual latencies in image segmentation. Latencies are observed in every sensorial system of the brain. In particular, it has been reported that different luminance levels will induce different propagation delays such that the neuronal activity arrives with a different latency in the visual cortex (Levick, 1973; Bolz, Rosner, & W¨assle, 1982; Sestokas, Lehmkuhle, & Kratz, 1987; Gawne, Kjaer, & Richmond, 1996). In our older studies we introduced this concept into a spiking neural network, and we showed that latencies strongly improve object segmentation in many cases. Top-layer (cortical) neurons of our network representing a bright object are active earlier than those representing a dark object. Consequently, neuronal assemblies that reflect these different objects can synchronize one after the other without mutual disturbance. Thus, these studies suggested that sensorial latencies could play a role in information processing in the brain, as have experimental studies in cat and monkey (Gawne et al., 1996; Worg ¨ otter ¨ et al., 1996). From the viewpoint of computer vision, however, the latency mechanism in the spiking neural network had only a conceptual character because the total CPU time allocated in order to analyze even simple scenes was very large, resembling that obtained with local update algorithms. Therefore, the question for us arose as to how to implement a visual latency mechanism and avoid this problem. This finally led us to the ECU algorithm, which by itself outperforms several other spin-lattice segmentation algorithms. In addition, we observed that its performance can still be enhanced by about 50% using visual latencies, and the effect is more pronounced for good signal-to-noise ratios.
A Fast and Robust Cluster Update Algorithm
1565
10 Conclusions The problems inherent in low-level image segmentation are so complex that the existence of a single optimal algorithmic solution is rather unlikely. Currently feasible, however, are attempts by which the performance limits of individual algorithmical classes are pushed forward in order to achieve better performance. The comparison of the different spin-lattice segmentation algorithms and the introduction of the ECU algorithms in this study were meant to contribute along this line. The second goal of this and our older work (Opara & Worg ¨ otter, ¨ 1996) was to pursue a neuronal algorithmical concept (latencies) over different implementation stages in order to arrive at a solution that can be used under the close-to real-time requirements of computer vision problems. Acknowledgments We acknowledge the support of the Deutsche Forschungsgemeinschaft (grant WO388 4-2, 5-2, 6-1). References Binder, K., & Heermann, D. W. (1988). Monte Carlo simulation in statistical physics. Berlin: Springer-Verlag. Blatt, M., Wiseman, S., & Domany, E. (1996). Superparametric clustering of data. Phys. Rev. Lett., 76, 18. Bolz, J., Rosner, G., & W¨assle, H. (1982). Response latency of brisk-sustained (X) and brisk-transient (Y) cells in the cat retina. J. Physiol, 328, 171–190. Chen, S., Ferrenberg, A. M., & Landau, D. P. (1992). Randomness-induced second-order transitions in the two-dimensional eight-state Potts model: A Monte Carlo study. Phys. Rev. Lett., 69, 1213–1215. Eckes, C., & Vorbruggen, ¨ J. C. (1996). Combining data-driven and model-based cues for segmentation of video sequences. Paper presented at the WCNN World Conference on Neural Networks, San Diego. Ernst, U., Pawelzik, K., & Geisel, T. (1994). Multiple phase clustering of globally pulse coupled neurons with delay. In M. Marinaro & P. G. Morasso (Eds.), ICANN ’94: Proceedings of the International Conference on Artificial Neural Networks (Vol. 1, pp. 1063-1066). London: Springer-Verlag. Gawne, T. J., Kjaer, T. W., & Richmond, B. J. (1996). Latency: Another potential code for feature binding in striate cortex. J. Neurophysiol., 76(2), 1356–1360. Geman, D., Geman, S., Graffigne, C., & Dong, P. (1990). Boundary detection by constrained optimization. IEEE Trans. Pattern Analysis Machine Intelligence, 12(7), 609–628. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338, 334–337.
1566
Ralf Opara and Florentin Worg ¨ otter ¨
Ising, E. (1925). Beitragzur Theorie des Ferromagnetismus. Z. Physik, 31, 253– 258. Kirkpatrick S., Gelatt Jr., C. D., & Vecci, M. P. (1983). Optimization by simulated annealing. Science, 220, 671–680. Levick, W. R. (1973). Variation in the response latency of cat retinal ganglion cells. Vision Res., 13, 837–853. Li, S. Z. (1995). Markov random field modeling in computer vision. Berlin: SpringerVerlag. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equations of state calculations by fast computing machines. J. Chem. Phys., 21, 1087–1091. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. (Tech. Rep. No. CRG-TR-93-1). Toronto: Department of Computer Science, University of Toronto. Nischwitz, A., & Glunder, ¨ H. (1995). Local lateral inhibition: A key to spike synchronization? Biol. Cybern., 73, 389–400. Opara, R., & Worg ¨ otter, ¨ F. (1996). Using visual latencies to improve image segmentation. Neural Computation, 8, 1493–1520. Potts, R. B. (1952). Some generalized order-disorder transformations. Proc. Cambridge Philos. Soc., 48, 106–109. Sestokas, A. K., Lehmkuhle, S., & Kratz, K. E. (1987). Visual latency of ganglion X- and Y-cells: A comparison with geniculate X- and Y-cells. Vision Res., 27, 1399–1408. Swendsen, R. H., & Wang, S. (1987). Nonuniversal critical dynamics in Monte Carlo simulations. Phys. Rev. Lett., 58, 86–88. von der Malsburg, C. (1981). The correlation theory of brain function (Int. Rep. 81-2). Gottingen: ¨ Department of Neurobiology, Max-Planck-Institute for Biophysical Chemistry. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biol. Cybern., 54, 29–40. Vorbruggen, ¨ J. C. (1995). Zwei Modelle zur datengetriebenen Segmentierung visueller Daten. Frankfurt am Main: Verlag Harri Deutsch, Thun. Wolff, U. (1989). Collective Monte Carlo updating for spin systems. Phys. Rev. Lett., 62, 361–364. Worg ¨ otter, ¨ F., Opara, R., Funke, K., & Eysel, U. (1996). Utilizing latency for object recognition in real and artificial neural networks. NeuroReport, 7, 741–744. Received April 2, 1997; accepted December 9, 1997.
LETTER
Communicated by Alexandre Pouget
Probability Density Methods for Smooth Function Approximation and Learning in Populations of Tuned Spiking Neurons Terence David Sanger Massachusetts Institute of Technology Department of Brain and Cognitive Sciences, Cambridge, MA 02139, and Boston Children’s Hospital Department of Neurology, Boston, MA, U.S.A.
This article proposes a new method for interpreting computations performed by populations of spiking neurons. Neural firing is modeled as a rate-modulated random process for which the behavior of a neuron in response to external input can be completely described by its tuning function. I show that under certain conditions, cells with any desired tuning functions can be approximated using only spike coincidence detectors and linear operations on the spike output of existing cells. I show examples of adaptive algorithms based on only spike data that cause the underlying cell-tuning curves to converge according to standard supervised and unsupervised learning algorithms. Unsupervised learning based on principal components analysis leads to independent cell spike trains. These results suggest a duality relationship between the random discrete behavior of spiking cells and the deterministic smooth behavior of their tuning functions. Classical neural network approximation methods and learning algorithms based on continuous variables can thus be implemented within networks of spiking neurons without the need to make numerical estimates of the intermediate cell firing rates. 1 Introduction As electrophysiological recording data from multiple cortical cells become available, it is increasingly important that neural network models be able to interpret the information and computations performed by the neural code (Sejnowski, 1995). However, an important gap exists between the physiological representation of information and much of the large body of theoretical work on neural networks. Many neural network models are defined in terms of smooth function approximation (Poggio & Girosi, 1990, for example) under the assumption that continuous variables can be represented in the firing rate of cells (Judd & Aihara, 1993). Cell behavior is described by a tuning curve (Snippe, 1996; Rieke, Warland, van Steveninck, & Bialek, 1997, for discussion) that gives the average firing rate as a function of some external variable. It is assumed that once firing rates are known, they can c 1998 Massachusetts Institute of Technology Neural Computation 10, 1567–1586 (1998) °
1568
Terence David Sanger
be added, multiplied, and otherwise manipulated according to the rules of continuous mathematics. In fact, this assumption is so basic to such models that it is rarely stated explicitly. Simulated implementations use continuous variables as inputs and hidden units, and compute continuous functions at the outputs without attempting to relate these variables to the actual underlying spike data. The problem with this approach is that it can be difficult to implement in neural hardware. In order to extract the continuous rate variable from the spike data, we must count the number of spikes that occur within some fixed time interval. Since a rapidly (and regularly) firing cell might fire at most 100 spikes per second, we would need to count over at least 1 second in order to have an estimated error less than 1 percent (although Tovee, Rolls, Treves, & Bellis, 1993, suggest that shorter time intervals may be possible). If the cell’s firing is described by a Poisson process with an average rate of 100 spikes per second, then we will need to count over a significantly longer interval in order to make an accurate estimate of the rate (Stein, 1967). In either case, the error in rate estimation means that variables can be only approximately represented or represented with low precision (Softky, 1996). This situation becomes worse for cells with low firing rates or if the average rate is chaotic (van Vreeswijk & Sompolinsky, 1996) or varying with time (August & Levy, 1996; Gabbiani & Koch, 1996). As experimenters, we frequently have the option of observing a biological network over a long period of time under static or repeating conditions, so we can in fact compute accurate average rates. But a behaving animal cannot afford to wait 1 second or longer at each stage of computation. There are several ways in which a neural implementation of a continuous model might address this problem. One is to assume that there are many copies of each cell, so that we can use spatial averaging or a population code to increase the accuracy of rate estimation for short time windows (Shadlen & Newsome, 1994; Maass, 1995; Shadlen, Britte, Newsome, & Movshon, 1996; Theunissen, Roddey, Stufflebeam, Clague, & Miller, 1996). Although this appears to be wasteful of neural resources, it might be feasible given the large number of cortical cells available. Population averaging may not lead to optimal estimates if there is significant correlation between the output of different cells (Zohary, Shadlen, & Newsome, 1994). Other methods note that under certain circumstances, there is a remarkable predictability of spike firing times (Bair & Koch, 1996), and one can thus use the interspike interval (Judd & Aihara, 1993; Softky, 1994) or the filtered spike train (Bialek & Rieke, 1992; August & Levy, 1996; Gabbiani & Koch, 1996) as an “instantaneous” estimate of the relevant variable. There is considerable theoretical and experimental evidence to support both rate codes (Shadlen & Newsome, 1994, 1995) and temporal codes (Murthy & Fetz, 1994; Hopfield, 1995; Mainen & Sejnowski, 1995; Konig, Engel, & Singer, 1996), and some comparisons are given in Stein (1967), Gerstner and van Hemmen (1994), and Softky (1995, 1996). A detailed review can be found in Rieke et al. (1997).
Function Approximation in Spiking Neurons
1569
The underlying problem for rate codes is that although the firing rates are indeed continuous and often deterministic functions of the cell inputs, it is difficult to calculate these rates explicitly from the pattern of spike data. This article proposes a new method based on rate coding, in which smooth network approximations can be represented and manipulated by populations of spiking neurons. Similarly to other techniques, this method assumes that information is represented in the instantaneous average rate (or firing probability), but it avoids the need to compute this rate explicitly. Neural populations map directly onto other neural populations in such a way that the firing rate automatically computes desired functions. The mathematical formulation assumes that spike rate is a measure of the probability of firing and that data are encoded in the instantaneous probability of firing. I show that it is possible to connect a network of randomly spiking neurons such that the firing probability of the output neurons is a desired smooth function of the firing probability of the input neurons. For example, consider computing the sum of two functions f (x) and g(x), where f and g are the average probability of firing of two different neurons s f and sg in response to different stimulus values x. In order to do this computation using explicit firing rates, the network could wait for several seconds and count the input neuron spikes and then use the total spike count to modulate another neuron’s rate. Alternatively, many copies of each of the input neurons could be provided, and the total number of spikes over the input population could determine the output neuron’s rate. In any finite time period (or finite number of copies of the input neurons), these two methods can only approximate the firing rates f (x) and g(x), and thus the calculation of the output rate f (x) + g(x) will not be exact. I propose instead that if the output neuron is connected so that it fires every time either of the input neurons fires, then the average firing rate of the output neuron is the sum of the input neuron average firing rates (so long as two spikes do not coincide within the refractory period of the output neuron). If the average firing rate of one of the input neurons changes, so does the firing rate of the output neuron. The probability of firing and the output cell tuning curve are proportional to f (x) + g(x) from the instant of stimulus onset, although it may not be possible to measure this until at least one spike has occurred. The above example illustrates an important and subtle point that is the foundation for the rest of this article: it is possible to compute an output cell’s tuning curve deterministically using connections that transmit only the randomly firing spikes. This is because the probability of a spike is a continuous deterministic variable that is only implicitly linked to whether a spike actually occurred. Since the tuning curve is the average firing rate for each value of a measured variable, it will be proportional to the conditional probability of firing given that variable. Each cell has an implicit probability of firing, and computations on the probabilities can be performed without ever needing to estimate these probabilities from the pattern of spike firing. Only at the final output stage of the system, when the cells must actually
1570
Terence David Sanger
drive the muscles, is it necessary to use time or space averaging to estimate the firing probabilities from the spike data (Stein, 1967). In practice, if a cell fires rarely, then the implicit probability may be irrelevant since no information is transmitted in the absence of spikes. This would be an even greater problem for a rate-averaging system, since it would take a long time to estimate the average spike rate. 2 Assumptions of the Poisson Spike Model The mathematical foundations and fundamental assumptions of the computational method proposed here come from the probability density estimation method for interpretation of experimental results from population codes (Sanger, 1996). Probability density estimation was not intended to model the extraction of signals within the brain, but rather to indicate the extent to which information is implicit within the neural code. This implicit information is essential to understanding the method proposed here. Related theoretical results on maximum likelihood (ML) estimation have been proposed (Snippe, 1996; Seung & Sompolinsky, 1993; Paradiso, 1988). To summarize the results in Sanger (1996), consider a very simple model for neural firing described by P[spike | x] ∝ σ (x),
(2.1)
meaning that the probability of firing is proportional to a smooth function of an externally measurable variable x. The function σ (x) is referred to as the tuning curve of the cell with respect to the variable x, and it describes the average firing rate for each value of x. If the probability of firing within any time interval of the same length is the same, then firing is governed by a Poisson distribution, P[n spikes | x] =
σ (x)n e−σ (x) . n!
(2.2)
This model is quite general, since x can be a vector if the cell is dependent on multiple variables, and it can include time if the firing rate is time dependent. The model does not include any representation of refractory periods or the possibility of explicitly coding elements of x using the relative spike times or interspike intervals. From Bayes’ law, for an arbitrary number of spikes n, we have P[x | spike] = σ (x)n e−σ (x) P[x]/N ,
(2.3)
where P[x] is the prior probability distribution of the external variable x (usually controlled by the experimenter) and N is a normalization constant calculated to make the total probability equal to 1. For n = 1, this equation states what we learn about the variable x from a single spike. Under
Function Approximation in Spiking Neurons
1571
this model, the conditional density P[x | spike] completely describes the behavior of this cell. The probability density estimation method (Sanger, 1996) makes the assumption that the spike-generating units are independent (see also Maass, 1995). In other words, for any given value of x, whether cell i with tuning curve σi (x) fires is independent of whether any other cell fired: P[spikei and spikej | x] = P[spikei | x]P[spikej | x],
(2.4)
for all values of x. If each cell i fires si times during some time interval, then we can immediately calculate the conditional density of x given the entire population of cells as (Sanger, 1996) P[x | population] =
1
N
P[x]
Y
e−σi (x) σi (x)si .
(2.5)
i
The conditional density of x is implicit in the population-firing response, and this is the most information about x that can be extracted from this population. The importance of this result for the following discussion is that the complete information in the population is encapsulated by a polynomial in the tuning curves σi . Other methods for estimating x or the conditional density have been proposed based on linear estimation (Georgopoulos, Kettner, & Schwartz, 1988; Salinas & Abbott, 1994), basis function approximation (Anderson, 1995; Zemel, Dayan, & Pouget, 1998), maximum likelihood (Snippe, 1996; Seung & Sompolinsky, 1993), and iterative network convergence (Pouget & Zhang, 1997; Pouget, Zhang, Deneve, & Latham, 1998). 3 Implicit Computation with Spiking Neurons Under the models given by equations 2.1 and 2.2, a cell’s behavior is completely described by its tuning curve, and the tuning curve can (eventually) be extracted as the long-time average firing rate of the cell for each value of x. Therefore, the long-time spike data and the tuning curve essentially are two different ways of describing the same cell. This is the basis of rate coding. In order to create a cell with any particular behavior, we need only to choose a desired tuning curve and synthesize a cell with this tuning curve using the spike data from existing cells. For example, a motor neuron that controls a muscle intended to fire for movements that extend a particular joint needs to be connected to other joint-related neurons (with a variety of different tuning curves) in such a way that its tuning curve is a monotonically increasing function of joint angle. I now describe how simple neural connectivity can be used to synthesize rate-coded cells with almost arbitrary tuning curves. Consider a neuron k that is a coincidence detector for neurons i and j and thus has a nonzero probability of firing only if i and j both fire within
1572
Terence David Sanger
a small time window 1t. In this situation, we can write E[sk ] = λsi sj ,
(3.1)
where E[sk ] is the expected number of output spikes for neuron k within time 1t, si and sj are the number of spikes in 1t for each input neuron, and λ is a scaling factor. Since 1t is small, si and sj will usually be either zero or one, so the probability of firing E[sk ] is usually a boolean function of si and sj . Assume now that cell k is in fact a true multiplier, so that equation 3.1 also holds for si and sj greater than 1 (more than one spike in 1t). Then the tuning curve for fixed si and sj is given by σk =
λsi sj E[sk ] = , 1t 1t
where σk is the instantaneous average firing rate in spikes per second for neuron k. Note that σk is a function of the two random variables si and sj , and thus the event of neuron k firing is a Poisson process modulated by a random variable. The expected firing rate given x is E[σk (x)] = (λ/1t)E[si sj | x],
(3.2)
and since P[si sj | x] = P[si | x]P[sj | x] by the independence assumption, we have E[σk (x)] = (λ/1t)E[si | x]E[sj | x] = λ1tσi (x)σj (x),
(3.3)
where E[si | x] is the expected number of spikes in time 1t given x, and σi (x) is the average number of spikes per second for each value of x. We need to use the expected value E[σ ] here since the firing rate σ is now a random variable dependent on si and sj . The expected tuning curve (and thus the average firing rate) for cell k is proportional to the product of the tuning curves for cells i and j. The 1t term in the product represents the fact that the firing rate is lower if the spike coincidence (or product) must occur within a shorter time window. The continuous product relationship between the tuning curves was created simply by making cell k compute the integer product of the number of input spikes occurring within a small time interval. For sufficiently small 1t, cell k functions as a coincidence detector. There is no need to compute the firing rates or tuning curves of cells i or j explicitly. Figure 1 shows a simulated example. The tuning curves for neurons A and B are given by shifted raised cosine functions with a nonzero “dc” component. The maximum firing rate is 100 spikes per second, and spikes are generated randomly using Poisson statistics according to equation 2.2. Neuron C is a spike coincidence detector during 10 msec intervals, and the
Function Approximation in Spiking Neurons
1573
Figure 1: Simulation of tuning curve multiplication. The probability that cell C fires is nonzero only when both cells A and B fire simultaneously. Each row of the simulated spike plots shows the simulated spikes over time for one fixed value of x. σA (x) and σB (x) are the tuning curves for cells A and B, and σC (x) is the tuning curve for C estimated from the spike data. On the right side of the figure, σC is compared to the product of σA and σB .
resulting measured tuning curve σC is compared to the product of the input tuning curves σA ∗ σB . Suppose that neuron k has nonzero probability of firing λ only if neuron i fires exactly n times during some time interval 1t. Then, from equation 2.2, E[σk (x)] = λP[i fires n times | x] =
λ n n −σi 1t σ 1t e . n! i
(3.4)
For fixed and small 1t, this expression becomes proportional to the nth power of σi , which shows how an output neuron can approximately compute powers of the input neuron tuning curves. Now consider a cell whose firing rate is a weighted linear combination of the spike outputs of a set of other cells, so that the expected number of spikes within 1t is X wi si , (3.5) E[sk ] = i
and the average firing rate in spikes per second is given by X wi si /1t. σk = i
(3.6)
1574
Terence David Sanger
Figure 2: Simulation of tuning curve addition. The probability that cell C fires is proportional to the sum of the total number of spikes of cells A and B in any time interval. On the right side of the figure, σC (x) is compared to the sum of σA and σB .
The “synaptic weights” wi are real valued and indicate the incremental change in average firing rate for neuron k when neuron i fires. Then E[σk (x)] =
X i
wi E[si | x]/1t =
X
wi σi (x),
(3.7)
i
so the average firing rate is a linear combination of the firing rates of the input neurons σi (x). Note that the tuning curve σk was produced as a linear combination of the σi ’s without ever calculating the actual firing rates of the input neurons. In fact, σk itself remains implicit and cannot be extracted without time averaging. Figure 2 shows a simulated example. Again, the input neurons A and B have tuning curves that are shifted raised cosines, and they fire Poissondistributed spikes with a maximum average firing rate of 100 spikes per second. Neuron C fires according to Poisson statistics with an instantaneous firing probability proportional to the sum of the number of spikes sA + sB within the previous 10 msec window. The resulting sample tuning curve σC is compared to the sum of the input tuning curves σA + σB . Since we can now calculate linear combinations, powers, and products of input tuning curves, we have the ability to calculate polynomials and thus can approximate any polynomial function of the input probability distributions. (Polynomial computations and multiplicative interactions in neural
Function Approximation in Spiking Neurons
1575
networks are discussed in Koch & Poggio, 1992; Mel, 1993; Tal & Schwartz, 1997.) The output neurons are Poisson-distributed by assumption, although they are modulated by time-varying rates σ that depend on the particular input spike firing times. Succeeding neural layers can make use of the same type of computation; however, if the input neurons are Poisson, then the output neurons will continue to satisfy an independent interval property (for which the probability of n spikes in any 1t is independent of the number of spikes in any other time interval) and thus they will be Poisson distributed at all time scales. It is important to note some interesting properties of this form of implicit computation. The actual spike rates do not need to be calculated until they are used at the output stage to control muscles. The probability values propagate implicitly. Although spikes are random binary events, the underlying probabilities are continuous variables that can be computed exactly. The computational delay is equal to the synaptic delay; an output cell has the correct probability of firing as soon as an input cell spike could have affected it, even if the input cell has not yet actually fired. This is often a moot point, since the cell cannot communicate information until it fires. But over many trials, it would be predicted to be tuned immediately after stimulus onset. Time or population averaging may be needed to read out the information or to control muscles, but it is not needed for intermediate computations. An alternative model that has very short synaptic delays and is based on relative spike timing can be found in Maass (1994, 1996, 1997). 4 Learning Algorithms In the previous section, I showed how it is possible to perform smooth function approximation on the implicit firing probability of cells. The question naturally arises as to whether desired functions can be learned from examples. In this section, I show that it is possible to find spike-equivalent forms of common neural network algorithms such that the underlying implicit probabilities converge smoothly according to the network algorithms, while the synaptic weights change in response to cell spiking. The spike algorithms are closely related to neural network models of stochastic convergence, since it turns out that the expected change in weights from the spike-based algorithms is approximately equal to the weight change for the smooth, implicit algorithms. There are many examples in biology of adaptation in response to spike coincidence, although the relationship to current neural network models can be difficult to elucidate (Magee & Johnston, 1997; Markram, Lubke, Frotscher, & Sakmann, 1997). First, I show an example of a supervised learning algorithm whose average behavior approximates the Widrow-Hoff LMS algorithm (Widrow & Hoff, 1960; Widrow, McCool, Larimore, & Johnson, 1976). Let p be a “teaching” neuron with tuning curve σp (x), and let k be a neuron that we would like to train so that E[σk (x)] = σp (x). The firing rate of k is a linear combi-
1576
Terence David Sanger
nation of the spikes of other cells i according to equation 3.6. The training algorithm has access only to the spike data, and it must adapt the synaptic weights wi from the set of input neurons i to the output neuron k. Let sp , sk , and si be the number of spikes in cells p, k, or i during time 1t, and let γ be the learning rate. Consider the adaptation algorithm 1wi = γ (sp − sk )si .
(4.1)
The expected change in weight during time 1t is E[1wi | x] = γ E[(sp − sk )si | x]
(4.2)
= γ (E[sp si | x] − E[sk si | x])
(4.3)
= γ (E[sp | x]E[si | x] − E[sk si | x]),
(4.4)
since neurons p and i have independent firing by assumption. Neurons k and i are not independent, since the firing rate of k is directly modulated by whether neuron i fired. We have E[sk si | x] = E[sk | si , x]E[si | x] X wj E[sj | x] E[si | x] = w i +
j6=i
= wi (1 − E[si | x]) +
X
(4.5) (4.6)
wj E[sj | x] E[si | x]
(4.7)
j
= (wi (1 − E[si | x]) + E[sk | x])E[si | x],
(4.8)
and combining gives E[1wi | x] = γ (E[sp | x] − E[sk | x])E[si | x] − γ wi (1 − E[si | x])E[si | x] µ = γ (1t)2 (σp (x) − σk (x))σi (x) ¶ ¶ µ 1 − σi (x) σi (x) . − wi 1t
(4.9)
(4.10)
The first term on the right of equation 4.9 is the Widrow-Hoff supervised linear learning rule. The last term is always opposite in sign to wi and gives a decay term for wi that is most significant if the average firing rate for neuron i is close to 1/21t. The factor of (1t)2 shows that the learning rate decreases if spike coincidence must occur in a shorter time window. An illustrative example of convergence of the tuning curve when synapses are trained according to equation 4.1 is shown in Figure 3. Twenty input cells
Function Approximation in Spiking Neurons
1577
Figure 3: The normalized mean squared error as a function of time for a neuron trained with the supervised spike learning algorithm. There were 20 input cells with a maximum spike rate of 50/second and randomly chosen smooth tuning curves. The target output cell tuning curve was also chosen randomly. The learning rate is 0.001, and spike coincidence was calculated within 20 msec time windows.
and a target output cell were simulated with randomly generated smooth tuning curves and a maximum firing rate of 50 spikes per second. Learning occurred in 20 millisecond time bins (1t = 0.02 second), and convergence is shown for a total of 3000 trials (60 seconds). I now show an example of an unsupervised Hebbian learning algorithm that causes the tuning curves to converge to the eigenfunctions of the input probability density. Previous examples of Hebbian learning for spike data are found in Brown, Zador, Mainen, & Claiborne (1992) and Gerstner, Kempter, van Hemmen, and Wagner (1996), and the relationship between principal components analysis and information maximization for spike trains is elegantly derived in Fry (1995). Assume there are multiple outputs σk for many values of k. Let σk =
1 X wki si , 1t i
and consider the learning algorithm, 1wki = γ si −
X
wji sj sk .
(4.11)
j≤k
Here sj is an output neuron. Assume the spike generators for sj , j 6= k are almost independent of sk in the sense that P[sj sk | x] ≈ P[sj | x]P[sk | x]. (This is not strictly true, since sj and sk may have common inputs, but the
1578
Terence David Sanger
inaccuracy will be small if the number of inputs is large.) Then we can compute the average weight update during 1t as E[1wki | x] = γ E si −
X
wji sj sk | x
j≤k
= γ E[si | x] −
X
(4.12)
wji E[sj | x] E[sk | x]
j≤k
+γ wki (1 − E[si | x])E[si | x] X wji σj (x) σk (x) = γ (1t)2 σi (x) − µ 2
+γ (1t) wki
(4.13)
j≤k
¶ 1 − σi (x) σi (x). 1t
(4.14)
It should be noted that the first term of equation 4.14 describes the generalized Hebbian learning algorithm that finds the principal components of the input distribution (Sanger, 1989). The second term in equation 4.14 is proportional to wki and introduces a bias not present in the generalized Hebbian algorithm. Figure 4 shows an example of the convergence of the unsupervised spike algorithm (see equation 4.11) for simulated data. The graph gives the diagonalization score, which is a measure of the extent to which the matrix of cross-products of network outputs E[σk (x)σm (x)] is diagonal. The score is the sum of the magnitudes of the off-diagonal elements divided by the sum of the magnitudes of the diagonal elements. Low numbers indicate a more diagonal matrix, and therefore a set of relatively uncorrelated tuning curves. There are 20 input cells with smooth, randomly generated tuning curves and maximum firing rate of 50 spikes per second. There are 4 output cells, which were trained for 1000 iterations, with each step occurring within a 20 millisecond bin, for a total learning time of 20 seconds. The learning rate was annealed from 0.015 to 0.0055 during the training. Figure 5 shows the results of another simulation of the generalized Hebbian algorithm for 20 input cells (only 8 are shown) and 8 output cells. The other parameters are the same as for Figure 4. The figure shows 1-second spike tracings for the input cells on the left and the output cells on the right, demonstrating that output cells can have tuning curves that are more broadly or more narrowly tuned than the inputs. Because of the nature of the generalized Hebbian algorithm, the tuning curves for the output cells are orthogonal, and they proceed from greatest to least average firing rate in order. Both the supervised and unsupervised algorithms make use of random spike data to perform learning whose long-time average behavior approxi-
Function Approximation in Spiking Neurons
1579
Figure 4: The diagonalization score as a function of time for the spike unsupervised generalized Hebbian algorithm. There are 20 input cells, 4 output cells, and 1000 time steps of 20 msec each. The learning rate decreased from 0.015 to 0.0055 during training.
mates smooth learning algorithms. These algorithms demonstrate the possibility of performing standard learning tasks in the single-spike domain. 5 Unsupervised Learning and Information After the unsupervised algorithm has been trained, the tuning curves are orthogonal. In this section I show that this implies that the spike trains then become independent. This is important, since it implies that orthogonality of the tuning curves is the most separate that we can make the behavior of two cells. To see this, note that the mean of the conditional density P[si | x] is given by Z P[si | x]P[x]dx = P[si ], and define the zero-mean function fi (x) = P[si | x] − P[si ], for which Z fi (x)P[x]dx = 0. Under the independence assumption we can write Z P[si sj ] =
P[si sj | x]P[x]dx
(5.1)
1580
Terence David Sanger
Figure 5: Spike tracings for input and output neurons after training with the generalized Hebbian algorithm (see text for details). Input neurons are shown in the column on the left and output neurons on the right. The lines in the middle illustrate the connectivity. Of the 20 input neurons, only the first 8 are shown. The output neurons are in decreasing order of average spike rate. Each row of the spike plots gives 1 second of simulated spike data for a fixed value of movement direction. Different rows of each plot show the neuron’s spike response for different movement directions.
Function Approximation in Spiking Neurons
1581
Z =
Z
P[si | x]P[sj | x]P[x]dx
¢¡ ¢ fi (x) + P[si ] fj (x) + P[sj ] P[x]dx Z = P[si ]P[sj ] + fi (x) fj (x)P[x]dx, =
¡
(5.2) (5.3) (5.4)
and we see that P[si sj ] = P[si ]P[sj ] (meaning that the spikes are independent) if and only if Z fi (x) fj (x)P[x]dx = 0, so that the (zero-mean) tuning curves are orthogonal. The converse also holds, so that maximum dependence occurs for maximally cross-correlated tuning curves. The mutual information between x and the spikes s is given by I[X; S] = H[X] − H[X | S], and since P[x | s] is given by equation 2.3, we have 1
Z
(n log σ (x) − σ (x) + log P[x])P[x]dx N = (H[X] − P[s] + H[σ ])/N ,
H[X | S] =
(5.5)
so the mutual information is maximized when the probability of firing P[s] is large and the entropy of the tuning curve H[σ ] is small. For tuning curves that are gaussian, the entropy is minimized when the variance is small. There is thus a parallel between linear properties of the zero-mean tuning curves such as cross-correlation, and nonlinear properties of the spike trains such as independence. The existence of such relationships shows that simple linear algorithms on the underlying probability distributions can lead to optimal nonlinear behavior at the level of the spikes. This fact can considerably simplify the types of calculations that are necessary and gives neural networks an important set of tools for performing nonlinear computations. 6 Duality The results above demonstrate a close link between the behavior of spiking neurons and the underlying implicit tuning curves that describe their conditional probability of firing for different values of x. This link is a natural and unavoidable property of probabilistic models of spiking neurons. It is not specific to Poisson firing statistics, but will arise for any neuron model in which the probability of firing is smoothly modulated by a continuous tuning curve. The spike behavior and the tuning curve properties are two
1582
Terence David Sanger
different yet completely equivalent ways of describing a neuron’s response, and we can take advantage of their different mathematical properties to simplify certain types of computations. The equivalence of these two descriptions leads to a duality relationship between the space of probabilistic events (spikes) and the space of deterministic bounded integrable functions (tuning curves) of a random variable x. (See Zemel et al., 1998, for an alternate discussion of this type of duality.) Independence in the spike space is related to orthogonality in the tuning curve space. Polynomial functions in the two spaces are equivalent (under the assumption of independent spike generators). I hypothesize that for any learning algorithm in the tuning curve space, there is a spike learning algorithm with similar convergence properties, as I have shown above for the LMS and generalized Hebbian algorithms.
7 Conclusion The analysis given above shows the possibility of a direct link between familiar continuous neural network algorithms and learning algorithms performed by spiking neurons. It is not necessary to postulate that unit activity levels in a neural network are somehow formed from average neural firing rates (which require a long time to estimate), since these activity levels can be directly represented by the instantaneous probability of firing. Computations can be performed directly on firing probabilities without the necessity of first estimating these probabilities by estimating the average firing rate. This saves considerable time in synaptic transmission and provides a better model for the rapid transmission of information in large biological neural networks. The model still begs the question of how the data are eventually read out. Stein (1967) gives a detailed discussion of approximation error for reading out rate codes. As for any neuronal processing, the code must necessarily be mapped from the final layer of processing onto the relevant muscles. The significant advantage of the model proposed here is that the rate code needs to be interpreted only at the final step of computation, so that it does not contribute to approximation errors at every step. This model is a form of rate coding that provides a rapid computational method based on average firing rates, without explicitly computing those rates. It is different from codes that base processing on explicit differences in spike firing times or interspike intervals. For example, Softky (1996) argues for the use of a binary pulse code for maximizing the transmission of information. Such a code is definitely superior at the time of read-out, but it has considerable disadvantages in terms of the complexity of intermediate computations and learning. The distinction between “simple” rate codes and “efficient” binary spike codes (Softky, 1995) is blurred by the model proposed here, since we have the advantages of simple computation based
Function Approximation in Spiking Neurons
1583
on average firing rate without the disadvantage of slow and error-prone spike counting. It is important to realize that these results are inherent properties of probabilistic models of spiking neurons. Whether a system chooses to make use of these properties, they are necessarily present. A cell with a tuning curve has an implicit conditional probability of firing, and when connected to other cells will have a direct effect on their probabilities of firing. This effect requires only the synaptic transmission time and is not dependent on temporal integration of spike rates. There is evidence that cell tuning is present within the first few spikes after stimulus onset, thus suggesting that rapid processing of rate information is indeed occurring in biological systems (Celebrini, Thorpe, Trotter, & Imbert, 1993; Tovee et al., 1993). This work has shown probabilistic methods for interpreting the patterns of activity in populations of spiking neurons, computing new neuron activities with desired tuning curves, and learning in both supervised and unsupervised modes. It is hoped that such methods may form the basis for a probabilistic “neural calculus” that can allow interpretation of biological neural systems and explain some of the ways in which those systems perform internal computations. Acknowledgments I thank Emilio Bizzi, Simon Giszter, Alex Pouget, Peter Dayan, Barak Pearlmutter, and the reviewers for helpful comments and discussions. I was supported by a McDonnell-Pew Postdoctoral Fellowship. Dr. Bizzi’s laboratory is supported by grants from the National Institutes of Health (NS09343) and the Office of Naval Research (N00014/90/J/1946). References Anderson, C. H. (1995). Unifying perspectives on neuronal codes and processing. In Proc. 19th Int’l Workshop on Condensed Matter Theories. Caracas, Venezuela. August, D. A., & Levy, W. B. (1996). A simple spike train decoder inspired by the sampling theorem. Neural Computation, 8, 67–84. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving Macaque monkey. Neural Computation, 8, 1185–1202. Bialek, W., & Rieke, F. (1992). Reliability and information transmission in spiking neurons. Trends in Neuroscience, 15(11), 428–434. Brown, T. H., Zador, A. M., Mainen, Z. F., & Claiborne, B. J. (1992). Hebbian computations in hippocampal dendrites and spines. In T. McKenna, J. Davis, & S. F. Zornetzer (Eds.), Single neuron computation (pp. 81–116). San Diego: Academic Press. Celebrini, S., Thorpe, S., Trotter, Y., & Imbert, M. (1993). Dynamics of orientation coding in area V1 of the awake primate. Visual Neuroscience, 10, 811–825.
1584
Terence David Sanger
Fry, R. L. (1995). Observer-participant models of neural processing. IEEE Trans. Neural Networks, 6(4), 918–928. Gabbiani, F., & Koch, C. (1996). Coding of time-varying signals in spike trains of integrate-and-fire neurons with random threshold. Neural Computation, 8, 44–66. Georgopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. Coding of the direction of movement by a neuronal population. J. Neurosci, 8(8), 2928–2937. Gerstner, W., & van Hemmen, J. L. (1994). How to describe neuronal activity: Spikes, rates, or assembiles? In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processin systems 6 (pp. 463–470). San Mateo, CA: Morgan Kaufmann. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Judd, K. T., & Aihara, K. (1993). Pulse propagation networks: A neural network model that uses temporal coding by action potentials. Neural Networks, 6, 203–215. Koch, C., & Poggio, T. (1992). Multiplying with synapses and neurons. In T. McKenna, J. Davis, & S. F. Zornetzer (Eds.), Single neuron computation (pp. 315–345). San Diego: Academic Press. Konig, P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends in Neuroscience, 19, 130–137. Maass, W. (1994). On the computational complexity of networks of spiking neurons. In G. Tesauro, D. Touretzky, & D. Leen (Eds.), Advances in neural information processing systems 7 (pp. 183–190). Cambridge, MA: MIT Press. Maass, W. (1995). On the computational power of noisy spiking neurons. In G. Tesauro, D. Touretzky, & D. Leen (Eds.), Advances in neural information processing systems 7 (pp. 183–190). Cambridge, MA: MIT Press. Maass, W. (1996). Lower bounds for the computational power of networks of spiking neurons. Neural Computation, 8, 1–40. Maass, W. (1997). Fast sigmoidal networks via spiking neurons. Neural Computation, 9, 279–304. Magee, J. C., & Johnston, D. (1997). A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons. Science, 275, 209–213. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiology, 70(3), 1086–1101. Murthy, V. N., & Fetz, E. E. (1994). Effects of input synchrony on the firing rate of a three-conductance cortical neuron model. Neural Computation, 6, 1111–1126. Paradiso, M. A. (1988). A theory for the use of visual orientation information
Function Approximation in Spiking Neurons
1585
which exploits the columnar structure of striate cortex. Biological Cybernetics, 58, 35–49. Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982. Pouget, A., & Zhang, K. (1997). Statistically efficient estimation using cortical lateral connections. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9. Cambridge, MA: MIT Press. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population code. Neural Computation, 10, 373–401. Rieke, F., Warland, D., van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from firing rates. J. Computational Neuroscience, 1, 89–107. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2, 459–473. Sanger, T. D. (1996). Probability density estimation for the interpretation of neural population codes. J. Neurophysiology, 76(4), 2790–2793. Sejnowski, T. F. (1995). Time for a new neural code? Nature, 376, 21–22. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci USA, 90, 10749–10753. Shadlen, M. N., Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Shadlen, M. N., Newsome, W. T. (1995). Is there a signal in the noise? Current Opinion in Neurobiology, 5, 248–250. Shadlen, M. N., Britten K. H., Newsome, W. T., & Movshon, J. A. (1996). A computational analysis of the relationship between neuronal and behavioral responses to visual motion. J. Neuroscience, 16(4), 1486–1510. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–529. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58(1), 13–41. Softky, W. R. (1995). Simple codes versus efficient codes. Current Opinoin in Neurobiology, 5, 239–247. Softky, W. R. (1996). Fine analog coding minimizes information transmission. Neural Networks, 9(1), 15–24. Stein, R. B. (1967). The information capacity of nerve cells using a frequency code. Biophysical Journal, 7, 797–826. Tal, D., & Schwartz, E. L. (1997). Computing with the leaky integrate-and-fire neuron: Logarithmic computation and multiplication. Neural Computation, 9. Theunissen, F., Roddey, J. C., Stufflebeam, S., Clague, H., & Miller, J. P. (1996). Information theoretic analysis of dynamical encoding by four identified primary sensory interneurons in the cricket cercal system. J. Neurophysiology, 75(4), 1345–1364. Tovee, M. J., Rolls, E. T., Treves, A., & Bellis, R. P. (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. J. Neurophysiology, 70(2), 640–654.
1586
Terence David Sanger
van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Widrow, B., & Hoff, M. E. (1960). Adaptive switching circuits. In IRE WESCON Conv. record, Part 4 (pp. 96–104). Widrow, B., McCool, J. M., Larimore, M. G., & Johnson, C. R. (1976). Stationary and nonstationary learning characteristics of the LMS adaptive filter. Proc. IEEE, 64(8), 1151–1162. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10(2), 403–430. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143. Received March 26, 1997; accepted December 10, 1997.
LETTER
Communicated by C. Lee Giles and Eric Mjolsness
A Potts Neuron Approach to Communication Routing Jari H¨akkinen Martin Lagerholm Carsten Peterson Bo Soderberg ¨ Complex Systems Group, Department of Theoretical Physics, University of Lund, SE-223 62 Lund, Sweden
A feedback neural network approach to communication routing problems is developed, with emphasis on multiple shortest path problems, with several requests for transmissions between distinct start and end nodes. The basic ingredients are a set of Potts neurons for each request, with interactions designed to minimize path lengths and prevent overloading of network arcs. The topological nature of the problem is conveniently handled using a propagator matrix approach. Although the constraints are global, the algorithmic steps are based entirely on local information, facilitating distributed implementations. In the polynomially solvable singlerequest case, the approach reduces to a fuzzy version of the Bellman-Ford algorithm. The method is evaluated for synthetic problems of varying sizes and load levels, by comparing to exact solutions from a branchand-bound method, or to approximate solutions from a simple heuristic. With very few exceptions, the Potts approach gives high-quality legal solutions. The computational demand scales merely as the product of the numbers of requests, nodes, and arcs.
1 Introduction Communication-routing resource allocation problems are becoming increasingly relevant given the upsurge in demand of the Internet and other telecommunication services. One such problem amounts to assigning arcs in a connected network to requests from start to end nodes, given capacity constraints on the arcs such that a total additive cost (path length) is minimized. (For a review of notation and existing routing techniques, see Bertsekas & Gallager, 1987.) A relatively simple routing problem, with only one request at a time, is the shortest path problem (SPP), which can be solved exactly in polynomial time using, for example, the Bellman-Ford (BF) algorithm (Bellman, 1958) (see also Bertsekas & Gallager, 1987). The multiple shortest path problem (MSPP), where arcs are allocated simultaneously to several rec 1998 Massachusetts Institute of Technology Neural Computation 10, 1587–1599 (1998) °
1588
Jari H¨akkinen et al.
quests, is more difficult. We strongly suspect MSSP to be NP-hard, although to our knowledge this has not been proved in the literature. In this article we address the MSPP using feedback Potts neural networks, which have proved to be powerful in other resource allocation problems, with (Lagerholm, Peterson, & Soderberg, ¨ 1997) or without (Gisl´en, Peterson, & Soderberg, ¨ 1992) a nontrivial topology. For each request, we assign a Potts network, with units encoding which arcs are to be used by that request. Appropriate energy terms are constructed in terms of the Potts neurons to minimize total path lengths and ensure that capacity constraints are not violated. Mean field (MF) equations are iterated using annealing to minimize the total energy. In contrast to earlier usage of Potts encoding and MF annealing (Peterson & Soderberg, ¨ 1989; Gisl´en et al., 1992; Lagerholm et al., 1997), where global objective functions are minimized, here each node minimizes its own local energy. For the case of a single request, the Potts MF approach reduces in the zero temperature limit to the BF algorithm; hence our approach contains this standard algorithm as a special case. For each request, the Potts MF network (Peterson & Soderberg, ¨ 1989) defines an (inverted) “fuzzy” spanning tree (a subgraph without loops connecting all nodes) rooted at the end node. In order to project out the part defining the (fuzzy) path from the start node, and to keep track of the paths in general, we utilize a propagator matrix formalism (following Lagerholm et al., 1997). The computation of the propagator requires matrix inversion; fortunately this can be done using an iterative procedure with a low computational cost. As in the previously considered airline crew scheduling problem (Lagerholm et al., 1997), proper preprocessing is employed to identify independent subproblems in order to reduce the problem complexity. Despite the existence of global constraints, the implementation of the approach is truly local. When updating the MF equations for a particular node, only information residing at neighboring nodes is needed. The approach is gauged by an exact branch-and-bound (BB) algorithm (for smaller problems) and by two BF-inspired heuristics (for larger problems), on a set of synthetic but challenging test problems, showing an excellent performance of the Potts MF approach, with a CPU consumption per request scaling merely like NNL , where N is the number of nodes and NL the number of links in the network. The method is also very robust with respect to parameters. Another novel method (Boyan & Littman, 1994) has been proposed for the MSSP. In contrast to our approach, it is aimed at dynamical problems. It is also rooted in the BF algorithm, but is in the static limit unrelated to our Potts algorithm. However, in this limit it reduces to the independent BF approach, which is used for comparisons in this work.
A Potts Neuron Approach to Communication Routing
1589
a1
b1
a2
b2
a3
b3
Figure 1: Example of a solution to a three-request problem. Dotted lines represent unused links, and solid lines represent links used by the requests.
2 The Multiple Shortest Path Problem An MSPP is defined by specifying the following: • A connected network of N nodes and NL (bidirectional) links, corresponding to 2NL arcs. • For each arc ij, a cost (arc length) dij and a capacity Cij . • A set of NR transmission requests r, each defined by a start node ar and an end node br . The task is to assign to each request a connected loop-free path of arcs from the start node to the end node. This is to be done such that the total cost for those paths is minimized, without the load on any arc exceeding its capacity, with the load defined as the number of requests sharing it. A three-request problem example is shown in Figure 1. Not all problems of this kind are solvable. A reliable algorithm should be able to recognize and signal a failure, so that proper measures can be taken. 3 The Bellman-Ford Algorithm in the Mean Field Language Prior to dealing with the MSPP, we revisit the simpler SPP and demonstrate how the BF algorithm can be recast in a Potts MF language. This formulation will be the starting point for designing a Potts MF approach to MSPP. In the SPP, there is only a single request, from a to b, and the capacity constraints are irrelevant. The task is simply to find the shortest path from a to b. In the BF algorithm (Bertsekas & Gallager, 1987) this is achieved by minimizing the path lengths Di from every node i to b, by iterating Di → min(dij + Dj ), j
i 6= b
(3.1)
and keeping track of the involved arcs ij. Db is fixed to zero by definition. The resulting solution defines a spanning tree rooted at b. In particular, Da
1590
Jari H¨akkinen et al.
is determined, and the minimal path from a to b is easily extracted from the spanning tree. If no arc exists from node i to j, dij could formally be defined to be infinite; in practice it is more convenient to restrict j in equation 3.1 to the actual neighbors of i, reachable via an arc from i. Equation 3.1 can be rewritten as Di =
X
vij Eij ≡
X
j
vij (dij + Dj ),
(3.2)
j
in terms of a Potts spin vi for every node i 6= b, with components vij taking the value 1 for the j with the smallest local energy Eij , and 0 for the others (winner takes all). Note the distinct philosophy here: each node i minimizes its own local energy Di = minj Eij , rather than all nodes striving to minimize some global objective function. An MF version of equation 3.2 is obtained by using for vi its thermal average in the MF approximation, defined by e−Eij /T vij = P −E /T , ik ke
(3.3)
where j and k are neighbors of i and T is an artificial temperature. Note that each Potts MF neuron vi obeys the normalization condition X
vij = 1, i 6= b.
(3.4)
j
Thus, vij can be interpreted as a probability for node i to choose j as a continuation node. At a nonzero temperature, iteration of equations 3.2 and 3.3 can be viewed as a fuzzy implementation of the BF algorithm, while in the T → 0 limit, the neurons are forced on-shell,1 that is, vij → 1 (for the minimizing j) or 0 (for the rest), and proper BF is recovered. Given this obvious neural recast of the BF algorithm in terms of Potts neurons, it is somewhat surprising that nonexact neural approaches based on Ising spins have been advocated in the literature (Thomopolous, Zhang, & Wann, 1991). 4 The Potts Mean-Field Approach to MSPP The Potts MF formulation of the Bellman-Ford algorithm for SPP (equations 3.2 and 3.3) is a suitable starting point for approaching MSPP. We will 1 This notation originates from physics, where “on-shell” and “off-shell” denote whether particles are real or virtual, respectively.
A Potts Neuron Approach to Communication Routing
1591
stick with the philosophy inherited from BF of focusing on independent local energies, in contrast to what has become standard when using feedback neural networks for resource allocation problems. This represents a novel strategy. Thus, we introduce a separate Potts system, {vrij }, for each request r, with basic local energies Eij as before representing distances to the end node for the load constraints, to br . In addition, we will need energy terms Eload ij be discussed later; this introduces an interaction among the Potts systems. This formulation introduces the possibility of undesired loop formation, since forming a loop might induce less energy penalty than violating a load constraint. As will be discussed below, such loops can be suppressed by suitable penalty terms and by adding a possibility for each proper node to connect, by means of an artificial escape arc, to an artificial escape node, for each request connecting directly to the end node. This enables a “give-up” state for an unresolvable situation, signaled by some path containing the escape node. The cost for “giving up” must be larger than that for any legal path. Therefore, the cost of each escape arc is set to the sum of the costs of the proper arcs, while the corresponding capacity is chosen large enough to be able to host all the requests. In order to terminate the path for a request r, its end node must be a sink for the corresponding Potts system. Consequently, there will be no Potts neuron vrbr associated with it. In order to construct appropriate penalty terms, a propagator matrix will be used. This technique has proved to be a powerful tool in neural optimization for problems with a nontrivial topology (Lagerholm, et al., 1997). In particular, it will be crucial for extracting properties of the fuzzy paths defined by the MF approach at finite T. 4.1 Path Extraction and the Propagator. The normalization condition (see equation 3.4) ensures that for each request r, precisely one continuation is chosen for each node except br , although for T 6= 0, it is fuzzily distributed over the available neighbors. On shell, the path from start to end node is trivial to extract. One follows the vrij = 1 path starting from the start node. However, for T 6= 0 a more refined path extraction mechanism is needed. This is provided by a propagator matrix (Lagerholm et al., 1997) Pr for each request r, defined by:
Prij =
h¡ X X ¢−1 i 1 − vr = δij + vrij + vrik vrkj + vrik vrkl vrlj + · · · ij
k
(4.1)
kl
For a graphical representation, see Figure 2. On shell, it is easy to see that Prij can be interpreted as the number of paths from i to j defined by the Potts
1592
Jari H¨akkinen et al.
Figure 2: Graphical illustration of the expansion, equation 3.5, of the propagator Pr with respect to path length. A dot represents a node i, and a line a single nodeto-node step vrij .
neurons associated with r; similarly, the elements of the matrix square (Pr )2ij are related to the number of arcs used in those paths.2 In particular, we have in the absence of loops (⇒ Prii ≡ 1), ½ 1, if node i appears in the path ar → br , Prar i = 0, otherwise, identifying the arcs used in the paths. Off shell, these interpretations are still valid in a probabilistic sense. Thus, a probabilistic measure of how much node i participates in the path ar → br is given (naively) by Prar i (since Pribr ≡ 1).3 Dividing by Prii to correct for a possible normalization error due to a small loop contamination yields the path factor Fri ≡
Prar i Prii
≤ 1,
(4.2)
which can be used to filter out the nodes of the fuzzy paths defined by the MF neurons. 4.2 Load and Loop Control. Armed with the propagator formalism, we proceed to set up the penalty terms, to be added to the energies Eij corresponding to equation 3.2. In order to avoid complications from self-interactions in the local energies, independently of T, we follow Ohlsson, Peterson, and Soderberg ¨ (1993), and define the penalty terms based on analyzing the result of setting one component vrij of the neuron vri at a time to one, with the other components set to zero, as compared to a reference state with all components set to zero. The total load Lij on an arc ij is the sum of contributions from the different requests, Lij =
X
Lrij .
(4.3)
r
More precisely, the number of arcs is given by P2 − P. The path must eventually end in the sink br . Proof : Multiply equation 3.4 for vr by Prki , sum over i, and use the matrix identity Pr = 1 + Pr vr . 2 3
A Potts Neuron Approach to Communication Routing
1593
The contribution from request r can be expressed as Lrij = Fri vrij ,
(4.4)
where we have used the path factor of equation 4.2. The load constraints, Lij ≤ Cij , define a set of inequality constraints. In the realm of feedback neural networks, such constraints have been successfully handled by means of step functions (Ohlsson et al., 1993). For a given request r, the overloading of the arc ij due to the other requests is given by O(X) ≡ X2(X),
(4.5)
where 2() is the Heaviside step function, and X is given by X ≡ Lij − Lrij − Cij .
(4.6)
If the arc were also to be used by r, the overloading would increase to O(X+1), and a suitable overloading penalty can be defined as the difference, = O(X + 1) − O(X). Eload ij
(4.7)
The number of loops introduced by connecting i → j can be expressed as Y≡
Pjir Prii
≤ 1,
(4.8)
and we choose as a loop-suppression term, loop
Eij
=
Y . 1−Y
(4.9)
The generalization of the local energy in equation 3.2 to the multiple request case now reads, for a particular request r, loop
+ γ Eij Eij = dij + Djr + αEload ij
,
(4.10)
with the added terms based on equations 4.7 and 4.9. The resulting algorithm allows for a wide range of choices of the coefficients α and γ without severely changing the performance. 4.3 Updating Equations. All neurons are repeatedly updated, with a slow annealing in T. For each request r and each node i, the corresponding neuron vri is updated according to e−Eij /T vrij = P −E /T , ik ke
(4.11)
1594
Jari H¨akkinen et al.
with Eij given by equation 4.10. The corresponding cost Dri from node i to the end node is then updated, in the BF spirit, as Dri →
X
vrij Eij .
(4.12)
j
In principle, the corresponding update of the propagator could be done using an exact incremental matrix inversion scheme like Sherman-Morrison (Press, Flannery, Teukolsky, & Vettering, 1986). We prefer, though, to let local changes propagate through the network, in analogy to the update of Dri . Thus, only the ith row of Pr is updated: Prim → δim +
X
r vrij Pjm , for all m.
(4.13)
j
This gives a convergence toward the exact inverse, which turns out to be good enough. The advantage of this method is twofold: it is faster, and all information needed is local to the relevant node i and its neighbors j (assuming each node to keep track of its own row of Pr ). Details of the algorithmic steps and the initialization can be found in the appendix. 4.4 Test Problems and Explorations. In order to test the Potts MF method, we have generated a set of challenging synthetic problems. The most important parameters governing the difficulty of a problem, apart from network size and connectivity, are the number of requests and the average arc capacity. In cases where all the arcs are able to host all the requests, Cij ≥ NR , the problem is separable into an independent SPP for each request, which can be solved using the BF algorithm. We have therefore chosen to work with rather tight arc capacities. For each problem, we generate a random connected network, where every node has at least one path to all other nodes. To that end, all nodes are first connected in a random spanning tree. Additional links (creating loops) are then randomly placed. Every arc is given a random integer capacity between 1 and 6 and a random cost in the interval (0, 1). The desired number of random requests is generated, in terms of start and end nodes. An example of a generated test problem is shown in Figure 3. This procedure does not automatically yield a solvable problem, where all requests can be fulfilled simultaneously without violating any constraint. In principle, solvability could be built into the problem generator, but here we adopt another strategy. We attempt to solve small problems (see Table 1) exactly with a BB algorithm; those not solved within a certain amount of CPU time are disregarded. In principle, this method could introduce a bias toward simple problems. However, for the smaller problem sizes considered, only a tiny fraction of the problem candidates are “timed-out,” except for those defining the last row in Table 1, where about one-third are disre-
A Potts Neuron Approach to Communication Routing
1595
Figure 3: An example network with 13 nodes and 17 links. The solid lines define the initial spanning tree, and the dotted lines represent the additional links.
Table 1: Results for Small Problems. Nodes
Links
Requests
hSi
LegalMF
h1BB i
hCPU timei
5 5 10 10 15 15
10 10 15 20 20 20
5 10 10 10 10 15
14 28 28 48 27 40
100.0% 99.9 100.0 100.0 99.8 99.9
0.003 0.002 0.004 0.003 0.03 0.06
0.1 0.2 0.4 0.5 0.5 0.7
Note: 1000 problems of each size are probed. Only legal entries are used when calculating the averages h1BB i and hCPU timei. hCPU timei refers to the MF Potts approach using a DEC Alpha 2000, and is given in seconds. Typical times for the BB method are around 600 seconds.
garded. For larger problems (see Table 2), BB cannot be used, and we do not know in general whether they are solvable.
Table 2: Results for Large Problems. N
NL
NR
Reduction
50 80 100 100 100 100 100
200 200 200 200 200 300 400
20 20 20 50 100 100 100
1.00 0.97 0.91 0.90 0.91 0.99 1.00
legalIBF 4% 8 7 0 0 0 0
h1IBF i 0.000 0.000 0.000 —– —– —– —–
legalSBF
h1SBF i
legalMF
100% 98 96 82 22 83 99
−0.019 −0.021 −0.015 −0.043 −0.092 −0.066 −0.063
100% 98 96 82 30 84 100
Note: 100 problems of each kind are probed. The averages, h1X i, are for legal entries. The reduction is defined as NR NNL after the decomposition divided by the value before the decomposition. Typical CPU times for MF are O(10) minutes on a DEC Alpha 2000.
1596
Jari H¨akkinen et al.
Prior to attacking a problem, a decomposition into independent subproblems is attempted, to reduce complexity. The decomposition uses the fact that a network typically consists of subnetworks, connected to each other by single nodes. The complete problem can then be solved by independently solving the implied subproblem for each subnetwork. The CPU demand of the nonreduced MF approach scales like NR NNL , while the required computer time for a decomposed problem is dominated by the largest subproblem. In Table 2 we indicate the effect on the CPU demand of the decomposition for the large test problems. The performance of the MF Potts approach is probed by measuring the relative excess path length as compared to a reference algorithm (X), 1X =
DMF − DX , DX
(4.14)
where DMF and DX are the total path costs resulting from MF and the reference algorithm, respectively. 4.5 Comparison with an Exact Method. For the smaller problems, the BB algorithm is used as a reference algorithm. The results are displayed in Table 1, together with characteristics of the generated test problems. As a measure of the complexity of a problem, we use the entropy, S, defined as the logarithm of the total number of possible configurations, disregarding load constraints. Table 1 indicates excellent performance of the MF Potts approach, with respect to giving rise to legal solutions with good quality, with a very modest computational demand. 4.6 Comparing with Other Approximative Methods. The computer demand for the BB algorithm grows rapidly. For problems larger than those of Table 1, we have chosen as reference algorithms two BF-inspired heuristics: • Independent Bellman-Ford (IBF). Each request is taken to define an independent SPP, which is solved using BF. Thus, the load constraints are neglected, and the resulting solution is not necessarily legal. If legal, it gives the global energy minimum. • Sequential Bellman-Ford (SBF). The requests are served in a sequential fashion using the BF algorithm. When the maximum capacity of an arc is reached, its use is prohibited for the remaining requests. It does not always find a solution, even to a solvable problem, and when it does, it is not necessarily the global minimum. The early ARPANET (Bertsekas & Gallager, 1987) (Internet) protocol was based on an algorithm similar to IBF (but with a dynamically updated arc cost based on traffic congestion). The results are displayed in Table 2. In the
A Potts Neuron Approach to Communication Routing
1597
few cases where the global minimum is known (from IBF), it is found by the MF algorithm. In the other cases, we conclude from Table 2 that MF on the average performs better than SBF. When the MF cannot find a solution, this is signaled by requests routed via the escape node.4 In all such cases, also IBF and SBF have failed. 5 Summary and Outlook We have developed a Potts MF neural network algorithm for finding approximate solutions to the MSPP. The starting point is a Potts MF recast of the exact Bellman-Ford algorithm for the simpler single SPP. This approach is then extended to the MSPP by using several Potts networks, one for each request. Complications of topological nature are successfully handled by means of a convenient propagator approach, which is crucial for the following issues: • The MF approach yields at T 6= 0 fuzzy spanning trees, from which the propagator is used to extract the loads corresponding to the fuzzy paths, needed for the interaction between different requests. • Loops are suppressed by energy terms, based on the propagator. To open up escape paths for unresolvable situations, an auxiliary arc to an escape node is introduced for each proper node. The method is local in that only information available from neighboring nodes is required for the updates. This attractive feature, inherited from the Bellman-Ford algorithm, facilitates a distributed implementation. The computational demand of the method is modest. The CPU time scales as NR NNL . With fixed connectivity, this corresponds to NR N2 , whereas for the worst case of full connectivity, it yields NR N3 . The performance of the algorithm is tested on a set of synthetic challenging problems. This is done by comparing to exact results from a BB method for smaller problems and comparing to results from independent and sequential Bellman-Ford approaches for larger problems. The comparisons show that the Potts MF approach with very few exceptions yields very good approximate solutions at modest computer time consumption. The method is now being generalized to other routing problems (H¨akkinen, Lagerholm, Peterson, & Soderberg, ¨ 1998).
4 This could be used in dynamical cases as a decision tool for which requests should be put on hold.
1598
Jari H¨akkinen et al.
Appendix: The Potts MF Algorithm A.1 Initialization. The initial temperature T0 is first set to T0 = 50. If the saturation 6, 6≡
X 1 v2 , NR (N − 1) i6=b i
(A.1)
r
has changed more than 10% after all neurons have been updated once, then the system is reinitialized, with T0 → 2T0 . For all nodes (except the end and escape nodes), the corresponding Potts neurons are initialized in accordance with the high temperature limit, that is, vri,j = 1/ni ,
(A.2)
for all ni neighbors j (including the escape node) of i. Prij and Dri are initialized consistently with equation A.2. A.2 Iteration. Until T ≤ T f or 6 ≥ 6 f do: • For every request r do: 1. For every node i except br and the escape node: (a) Update vri (see equations 3.3 and 4.10). (b) Update Dri (see equation 4.12). (c) Update Pri (see equation 4.13). 2. Update Lij . • Decrease the temperature: T = kT. We have used the parameter values k = 0.9, T f = 0.0001, and 6 f = 0.99999. For the energy coefficients in equation 4.10, we have consistently used α = 1 and γ = 5. References Bellman, R. (1958). On a routing problem. Quarterly of Applied Mathematics, 16, 87. Bertsekas, D., & Gallager, R. (1987). Data networks. Englewood Cliffs, NJ: Prentice-Hall. Boyan, J. A., & Littman, M. L. (1994). Packet routing in dynamically changing networks: A reinforcement learning approach. In J. D. Cowan, G. Teauso, & J. Alspector (Eds.), Advances in neural information processing systems, 6. San Mateo, CA: Morgan Kauffman.
A Potts Neuron Approach to Communication Routing
1599
Gisl´en, L., Peterson, C., & Soderberg, ¨ B. (1992). Complex scheduling with Potts neural networks. Neural Computation, 4, 805. H¨akkinen, J., Lagerholm, M., Peterson, C., & Soderberg, ¨ B. (1998). In preparation. Lagerholm, M., Peterson, C., & Soderberg, ¨ B. (1997). Airline crew scheduling with potts neurons. Neural Computation, 9, 1627. Ohlsson, M., Peterson, C., & Soderberg, ¨ B. (1993). Neural networks for optimization problems with inequality constraints—the knapsack problem. Neural Computation, 5, 331. Peterson, C., & Soderberg, ¨ B. (1989). A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems, 1, 3. Press, W. P., Flannery, B. P., Teukolsky, S. A., & Vettering, W. T. (1986). Numerical recipes: The art of scientific computing. Cambridge: Cambridge University Press. Thomopolous, S. C. A., Zhang, L., & Wann, C. D. (1991). Neural network implementation of the shortest path algorithm for traffic routing in communication networks. Proceedings of the IEEE International Joint Conference on Neural Networks. Singapore. Received March 14, 1997; accepted December 10, 1997.
VIEW
Communicated by Joshua Alspector and John Wyatt
Analog Versus Digital: Extrapolating from Electronics to Neurobiology Rahul Sarpeshkar Department of Biological Computation, Bell Laboratories, Murray Hill, NJ 07974, U.S.A.
We review the pros and cons of analog and digital computation. We propose that computation that is most efficient in its use of resources is neither analog computation nor digital computation but, rather, a mixture of the two forms. For maximum efficiency, the information and informationprocessing resources of the hybrid form must be distributed over many wires, with an optimal signal-to-noise ratio per wire. Our results suggest that it is likely that the brain computes in a hybrid fashion and that an underappreciated and important reason for the efficiency of the human brain, which consumes only 12 W, is the hybrid and distributed nature of its architecture. 1 Introduction We estimate that the human brain performs on the order of 3.6×1015 synaptic operations per second (appendix A.1 in Sarpeshkar, 1997). From measurements of cerebral blood flow and oxygen consumption, it is known that the brain consumes only 12 W (appendix A.2 in Sarpeshkar, 1997). Its efficiency of computation is thus about 3 × 1014 operations per joule. The human brain is capable of doing tremendously complex computation in real time despite the slow and noisy components in our heads and bodies. An extremely fast microprocessor such as the DEC Alpha 21164 performs about 255 × 106 floating-point operations per second and consumes 40 W.1 Its efficiency is thus about 6.25 × 106 operations per joule. It is incapable of solving even relatively simple behavioral tasks in real time in spite of its blazingly fast and precise transistors. If we compare the computing efficiency of the human brain with that of a digital microprocessor, we observe that the brain is at least seven or-
1 On the specfp92 Ear Program, which performs auditory computations similar to those in the human ear, the DEC 21164 running on an Alpha Server 8200 5/300 is 1275 times as fast as a VAX 11/780, which would run at about 0.2 MFLOPS for our computation. Thus, we estimate that it is equivalent to about 1275 × 0.2 = 255 MFLOPS. These numbers are for 1995.
c 1998 Massachusetts Institute of Technology Neural Computation 10, 1601–1638 (1998) °
1602
Rahul Sarpeshkar
ders of magnitude more efficient.2 Mead was the first scientist to point out the great discrepancy in the computational efficiencies of neurobiology and electronics (Mead, 1990). He also pioneered the field of neuromorphic computation—electronic computation inspired by and similar to that performed by neural systems (Mead, 1989). How is efficient and complex computation with noisy components achieved in neurobiological systems? Mead attributed the enormous efficiency of neurobiological systems to their clever exploitation of the physics of the medium that they were built in, to their local wiring strategies, and to their enormous capabilities to adapt and learn. In this article we will focus on the trade-offs involved in using physics to do computation. The three physical resources that a machine uses to perform its computation are time, space, and energy. Computer scientists have traditionally treated energy as a free resource and have focused mostly on time (the number of clock cycles required for the computation to terminate) and space (the amount of memory needed or the number of devices needed to perform the computation). However, energy cannot be treated as a free resource when we are interested in systems of vast complexity, such as the brain. With the current efficiencies of digital computation, it would take tens of megawatts to build a system like the brain, assuming we could do so at all. If we wanted to make this system portable as well, energy constraints would be very important indeed. Energy has clearly been an extremely important resource in natural evolution. (For an interesting discussion on energy constraints in biology and evolution, see Allman, 1990; and Aiello & Wheeler, 1995.) On a smaller scale, energy constraints are important in all portable applications, such as radio telephony, laptop computing, and hearing aids. Biological systems typically compute constantly, rather than episodically, with the resource of time fixed by the computational requirements of the task. For example, for a sensorimotor task, we may need to respond within a few hundred milliseconds, whereas for the task of hearing a 1 K Hz tone, we will need to respond to cycle-by-cycle variations on a 1 msec time scale. Thus, throughout this article, we will assume that the bandwidth of the computational task is fixed and that the resource of time is not a degree of freedom (it will be a parameter in our equations but not a variable). The other two resources (energy and space) will be degrees of freedom; we shall 2 It may be argued that our comparisons have not been fair since the floating-point computations that a microprocessor performs are more complex than are those that a synapse performs, and they are also more precise. However, in addition to multiplication, synaptic computations involve temporal filtering and adaptation, which are fairly complex operations in digital computation. We have also neglected several complex spatiotemporal correlations and additions that are performed in the dendrite of a neuron. Thus, for simplicity, we have chosen to compare just the efficiency of an “elementary operation” in digital computation and in neurobiology. There are so many orders of magnitude of discrepancy between neurobiology and electronics that such concerns will not alter our conclusions.
Analog Versus Digital
1603
use the more natural resources of power (energy per unit time) and area (the spatial resource in a two-dimensional substrate such as nerve membrane or in VLSI) as our degrees of freedom. Suppose that we are given two systems, A and B, that do a computation at the same bandwidth (in Hz), at the same output information rate (in bits/sec), and with the same input. A is more efficient than B if it consumes less power (and/or area) in doing this computation. In this article, we shall be interested in understanding the reasons for the efficiency of one system over another. In particular, we will study the reasons for differences in efficiency between analog and digital systems. Electronic systems are far simpler to understand and analyze than are biological systems. So in sections 2 and 3, we begin by analyzing the differences between analog and digital electronic systems. In section 4, we use the insights gained by this analysis to outline how efficient, precise computation can be achieved by hybrid and distributed electronic architectures. In section 5 we extrapolate our ideas for electronic systems to neurobiological systems. Section 6 summarizes the article. 2 Analog Versus Digital: The Intuitive Picture Electronic systems operate with continuous signals (CS) or discrete signals (DS), and in continuous time (CT) or discrete time (DT). Thus, there are four classes of systems: CSCT, CSDT, DSCT, DSDT (Hosticka, 1985). Figure 1 shows examples of systems, either electronic or biological, in each class. Typically, CS systems are referred to as analog, and DS systems are referred to as digital, irrespective of their representation in the time domain. In this article, we first concentrate on analog systems that are continuous in both the signal and time domains (CSCT), and on digital systems that are discrete in both the signal and time domains (DSDT). Such systems are the most common examples of analog and digital systems, respectively, and are also the most disparate from each other. Later, in section 4, we discuss why an alternation between the CSCT and DSCT domains can be advantageous over operation in the DSDT or CSCT domain alone. We shall ignore the CSDT domain in this article because its relevance to neurobiology is generally believed to be small. Following is a comparison of CSCT and DSDT systems from a signalprocessing viewpoint, emphasizing topics of importance in this article. It is by no means a comprehensive and exhaustive list of all the differences between analog and digital systems. For example, we completely omit all discussion of programmability and learning in these systems, although these issues are very important; also, we omit all discussion of temporal aliasing, which is an important source of distortion in discrete systems.
1604
Rahul Sarpeshkar
Figure 1: The four types of systems. The figure shows examples of electronic and biological systems that operate with continuous or discrete signals (CS or DS) and in continuous or discrete time (CT or DT). Analog systems that are continuous in both the signal and time domains (CSCT) and digital systems that are discrete in both the signal and time domains (DSDT) have been boxed in the figure. SCF stands for switched capacitor filter; CCD stands for charge coupled device.
ANALOG 1.
Compute with continuous values of physical variables in some range, typically voltages between the lower and upper power-supply voltages.
DIGITAL Compute with discrete values of physical variables, typically the lower and upper power supply voltages, denoted by 0 and 1, respectively.
Analog Versus Digital
1605
Primitives of computation arise from the physics of the computing devices: physical relations of transistors, capacitors, resistors, floating-gate devices, Kirchoff’s current and voltage laws and so forth. The use of these primitives is an art form and does not lend itself easily to automation. The amount of computation squeezed out of a single transistor is high. One wire represents many bits of information at a given time.
Primitives of computation arise from the mathematics of boolean logic: logical relations like AND, OR, NOT, NAND, and XOR. The use of these primitives is a science and lends itself easily to automation. The transistor is used as a switch, and the amount of computation squeezed out of a single transistor is low.
4.
Computation is offset prone since it is sensitive to mismatches in the parameters of the physical devices. The degradation in performance is graceful.
Computation is not offset prone since it is insensitive to mismatches in the parameters of the physical devices. However, a single bit error can result in catastrophic failure.
5.
Noise is due to thermal fluctuations in physical devices.
Noise is due to round-off error.
6.
Signal is not restored at each stage of the computation.
Signal is restored to 1 or 0 at each stage of the computation.
7.
In a cascade of analog stages, noise starts to accumulate. Thus, complex systems with many stages are difficult to build.
Round-off error does not accumulate significantly for many computations. Thus, complex systems with many stages are easy to build.
2.
3.
One wire represents 1 bit of information at a given time.
2.1 Physicality: Advantage Analog. Items 1 through 3 show that analog computation can be far more efficient than digital computation because of analog computation’s repertoire of rich primitives. For example, addition of two parallel 8-bit numbers takes one wire in analog circuits (using Kirchoff’s current law), whereas it takes about 240 transistors in static CMOS digital circuits. The latter number is for a cascade of 8 full adders. Similarly an 8-bit multiplication of two currents in analog computation takes 4 to 8 transistors, whereas a parallel 8-bit multiply in digital computation takes approximately 3000 transistors. Although other digital implementations could make the comparisons seem less stark, the point here is simply that exploiting physics to do computation can be powerful. The advantage of an analog machine over a digital machine is especially great when there is a straightforward mapping between the operations needed in the com-
1606
Rahul Sarpeshkar
putation and the primitives of the technology. For large-scale systems, as in the implementation of silicon cochleas (Sarpeshkar, Lyon, & Mead, 1998), depending on the nature of the digital implementation, the advantage can range from a factor of 300 to 105 in power consumption. Because the number of devices required to perform a computation is greater in digital systems, there is more wiring and communication overhead. The presence of more devices and more communication overhead causes digital circuits to have typically higher area consumption than that of analog circuits. The switching energy dissipation due to the large number of devices and the communication overhead also causes the power consumption to be higher in digital circuits. If the number of devices switching per clock cycle is N, the clock frequency is f , the average load capacitance that a device has to drive is C, and the power supply voltage is VDD , then the power consumption PD of digital circuits is given by the simple formula (Rabaey, 1996), PD = Nf CVDD 2 .
(2.1)
Unlike digital CMOS circuits, whose power dissipation occurs only during switching and is entirely dynamic, many analog circuits have standby or static power dissipation and little or no dynamic power dissipation.3 Thus their power dissipation is given by the simple formula, PA = NVDD I,
(2.2)
where N is the number of computational stages, VDD is the power supply voltage, and I is the average bias current flowing through each computational stage. We can make digital computation more power efficient by using architectures that operate on a slow-and-parallel paradigm. Such architectures conserve power by allowing the use of lower–clock-frequency and lower– supply-voltage operation, although they require increased area consumption (Chandrakasan, Sheng, & Brodersen, 1992). Bit-serial digital implementations are area efficient because they use time multiplexing to perform several computations on the same circuit (Denyer & Renshaw, 1985). The rapid evolution of digital technology has shrunk the efficiency gap between analog and digital computation. However, the inefficiency of ignoring the physical computational primitives inherent in the technology and the inefficiency of encoding only 1 bit per wire is always present in digital computation. Consequently, analog computation still retains its advantage.
3 Of course, class AB analog systems have dynamic power dissipation, but we are focusing on only general trends.
Analog Versus Digital
1607
2.2 Noise and Offset: Advantage Digital. Although the use of physics made analog systems much more efficient than digital systems, items 4 through 7 reveal that the very same physics causes analog systems to be much more sensitive to noise and offset than digital systems. The use of continous signal variables precludes analog systems from having any discrete attractor state to which they can be restored. Thus, for a sufficiently complex computation, the noise accumulation in analog systems becomes severe, not enough precision can be maintained at the output of the system, and analog systems clearly emerge as the losers. Adaptation can help to compensate for offset in analog systems. However, performance is still ultimately limited by residual offsets due to the finite loop gains of the compensating circuits and by the offsets introduced by the compensating circuits themselves. If the compensation of offset is done periodically or continuously, such that the offset remains bounded throughout the computation, then the problem of offsets may be alleviated in analog systems. However, offset compensation is achieved at the cost of increased complexity, area, or power consumption; also, care must be taken to ensure that the feedback loops do not cause unwanted dynamics due to interactions with the rest of the analog system. We can attenuate noise if we are willing to spend a large amount of power (and/or area) resources. However, as we shall show in section 3, by this point a digital solution would be more efficient than an analog solution. Parasitic capacitances and resistances in physical devices set a lower bound on the achievable noise floor in practical analog systems. 3 Analog Versus Digital: The Quantitative Picture In this section we quantify the intuitive picture of section 2. We need to have an understanding of what causes noise in the devices with which we compute and how the noise accumulation from the various devices in a system degrades the output signal-to-noise ratio of an analog system. 3.1 Noise in MOS Transistors. We usually treat current as though it is the flow of a continuous fluid, although it is the flow of discrete charged electrons. Due to thermal fluctuations, these electrons have random, diffusive motions that are uncoordinated with one another. These incoherent motions give rise to shot-noise currents and cause white noise in the device. The noise √ is called white because its power spectrum is flat. Intuitively, by simple N law-of-large-numbers arguments, we might expect that shot noise would be less important at larger current levels because we average over the motions of more electrons per unit time. This intuition is indeed borne out. (For further details of noise in transistors, see Sarpeshkar, Delbruck, ¨ & Mead, 1993, and Sarpeshkar, 1997.) White noise is fundamental and is present in all physical devices at room temperature. The input-referred white noise of
1608
Rahul Sarpeshkar
an MOS transistor is given by v2n =
Kw (p) 1f, Ip
(3.1)
where p = 1.0 in the subthreshold region of operation of the MOS transistor, and p = 0.5 in the above-threshold region of operation of the MOS transistor; I is the DC current through the transistor; v2n is the expected value of the square of a band-limited white noise voltage signal, applied between the transistor gate and source; 1f = fh − fl is the bandwidth of operation, with fh and fl being, respectively, the highest and lowest frequencies of operation; the technology-dependent parameter Kw (p) increases with temperature and with thick gate oxides, and is given by Kw (1.0) =
4kTUT 2κ 2
(3.2)
in the subthreshold regime, and by 4kT(2/3) Kw (0.5) = r³ ´ 2µCox W L
(3.3)
in the above-threshold regime. The parameter κ is the subthreshold exponential coefficient; kT is a unit of thermal energy; UT = kT/q is the thermal voltage where q is the charge on the electron; µ is the mobility of the electron; Cox is the oxide capacitance per unit area; and W and L are the width and length of the transistor, respectively. Note that Kw is independent of transistor geometry in the subthreshold regime, but is dependent on the transistor geometry in the above-threshold regime. The parameter Kw (p) is an important parameter of MOS technology. Another kind of noise in the transistor is called 1/ f noise because its power spectrum varies inversely with the frequency. It is widely believed that this form of noise arises from electrons in the channel going into and out of surface states, and into and out of impurities or defect traps in the gate oxide of the transistor. It is known that the mean square 1/ f noise voltage at the gate input of the transistor v2nf scales inversely with the area of the transistor A = WL. The noise is approximately independent of the current flowing through the transistor, v2nf
Kf = A
Z
fh
df , f fl µ ¶ Kf fh ln = . A fl
(3.4)
Analog Versus Digital
1609
The parameter K f is given by Kf =
B , Cox
(3.5)
where B is a measure of the number of surface states, impurities, or defects in the gate oxide of the transistor. The electronic fluctuations just described dynamically modulate the surface potential and thus the threshold voltage of the transistor. Hence, 1/ f noise can be viewed as noise due to a dynamically varying threshold voltage. Since the current in a transistor depends on the difference between the gate voltage and its threshold voltage, independent of where the transistor is operating, the input-referred 1/ f noise is independent of current. The larger the area of the transistor, the greater the oxide capacitance of the transistor, and the smaller the effect of any one fluctuating electronic charge on the transistor’s threshold voltage. However, since the trap and defect densities are approximately constant, the larger the area of the transistor, the greater the number of fluctuating charges. The increased capacitance effect reduces the noise power like 1/A2 , and the increased total charge effect increases the noise power like A, such that the input-referred noise scales like 1/A. The parameter B also determines the magnitude of typical offsets in MOS technology. Offsets between transistors are mainly due to mismatches in threshold voltage caused by charges in impurities, surface states, defect traps, and so on. By applying the reasoning of the previous paragraph, we can show that offsets scale inversely with the area of the transistor as well. Thus, the R.H.S. of equation 3.4, which models the magnitude of 1/ f noise in MOS technology, also models the magnitude of the typical offsets in this technology. Actually, the total 1/ f noise would be affected by fl and fh , but the offsets would not be. So to model offsets, we should add another term proportional to K f but independent of fl and fh . However, this added complication neither affects nor adds to our conclusions. Thus, we model 1/ f noise and offsets with one term. For similar reasons, we do not discuss other less important sources of offset such as geometric mismatches which scale like 1/L or 1/W or some function of L and W. Adaptation may help lower the effective value of K f in a circuit, but it cannot make it zero. Area is expended in the adaptation circuitry and in improving the residual offsets after adaptation. Thus, we observe that the noise and offset (or 1/ f noise) in an MOS transistor decrease with an expenditure of power and area resources in the transistor, respectively. The total input-referred noise is given by v2n =
Kf Kw (p) ln( fh /fl ). 1f + p I A
(3.6)
We call such an equation a noise resource equation for the transistor. In any technology, each device will have its own noise resource equation that il-
1610
Rahul Sarpeshkar
lustrates how the noise in the device decreases with an increase in the resources consumed by the device. In this case, we consume the resource of power (current) to reduce thermal noise, and the resource of area to reduce 1/ f noise (or offset). In general, in any technology, by similar law-of-largenumbers arguments, the thermal noise reduces with power consumption, and the offsets reduce with the increased consumption of some spatial resource like length, area, or volume. The resource of time is implicitly represented in equation 3.6 as the 1f and ln( fh /fl ) variables. A small bandwidth ( fl and fh are near each other) implies that we have a lot of time at our disposal, which we may trade off for lower noise or lower power/area consumption. Thus, equation 3.14 also captures trade-offs between maintaining low noise with few resources and maintaining bandwidth. Averaging is an example of a technique that reduces the bandwidth of a computation while lowering noise. 3.2 Noise in Analog Systems. Figure 2 shows a cascade of M analog computational stages with an input Vin and output Vout . Each stage i has a certain number of devices ni , has a gain gi , consumes a current Ii , consumes an area Ai , and adds a certain amount of noise vni . The cascade is representative of many analog computations that involve distributed gain amplification. In neurobiology, distributed gain amplification occurs in the dendrites of neurons or in the traveling wave amplifier architecture of the cochlea. The complexity of the computation sets a lower bound on the number of devices in each stage ni and on the total number of stages M. The ingenuity of the analog designer determines how close to the bound a realization of this system is. Depending on the details of the computation, the bound may be on M × 6ni , on all the ni , on n1 and M × 6ni , and so on. We assume that the power supply voltage VDD is fixed and is equal to or slightly greater than the linear voltage range of the system, otherwise power is unnecessarily wasted, with no increase in output signal-to-noise ratio. So, we choose not to operate the system in this nonoptimal situation. We also make two simplifying assumptions. We assume that the current Ii and area Ai of stage i are divided equally among all the ni devices in the stage. We also assume that each of the ni devices contributes equally to the noise of the stage vni and is amplified by the full gain gi of that stage. In practice, the circuit topology of a stage determines the amount of current through a device. The circuit topology also determines the noise contribution of that device to the noise of the stage. In spite of our simplifying assumptions, our model captures the general trend of the noise in each stage to increase with increasing ni , and the noise at the output of the cascade to increase with increasing M. The total mean square noise at the output of the cascade, v2no , is made up of noise from each of the computational stages. The noise at the first stage is amplified by the cascaded gain of all the stages, whereas noise at the output of the ith stage is amplified by the cascaded gain of all stages from i to M.
Analog Versus Digital
1611
Figure 2: Noise accumulation in an analog system. The figure shows a cascade of M analog computational stages, each of which contributes some noise to the output. The common power supply is represented by VDD . If we want to minimize noise at the final output, Vout , subject to fixed constraints on total current consumption (sum of the Ii ’s) and total area consumption (sum of the Ai ’s), then equations 3.7, 3.11, and 3.13 show that the complex stages (stages with large values of ni ) and the early stages (stages with large amounts of accumulated gain) should get most of the system’s resources of current and area.
Therefore, the early computational stages typically contribute more noise than do the later stages (Haus & Adler, 1959). We define the noise gain from stage i to the output as Gi , with Gi =
k=M Y
gk .
(3.7)
k=i
Then, from equation 3.6, the assumptions of the previous paragraph, and Figure 2, we have the total noise at the output given by4 v2no =
i=M X
v2ni G2i
i=1
=
i=M Xµ i=1
ni
¶ Kf Kw (p) ln( f 1f + n /f ) G2i . i h l (Ii /ni )p (Ai /ni )
(3.8)
The nature of the computational task determines the requirements on fl and 4 For simplicity, we assume that p is the same across all stages. In a practical situation, the first few amplifiers may be operated above threshold (p = 0.5) to reduce noise, and the last few amplifiers may be operated in subthreshold (p = 1.0).
1612
Rahul Sarpeshkar
fh . The bandwidth of the system, 1f = fh − fl , is the overall bandwidth of the system at the output. Any individual computational stage may have a bandwidth higher than this, but that is not the bandwidth that is relevant for noise calculations at the final output. Suppose that we have a total amount of current IT , or equivalently power PT = VDD IT , at our disposal; suppose that we also have a total amount of area AT —that is, i=M X
Ii = IT ,
i=1 i=M X
Ai = AT .
(3.9)
i=1
We now ask how we should distribute our current and area resources among the various stages to minimize the output noise given by equation 3.8. The answer to this question is a simple exercise in multivariable minimization through a Lagrange-multiplier technique. We find that the currents Ii and areas Ai should be distributed such that wi IT Ii = P , wi
(3.10)
2/(1+p)
ni , wi = Gi zi AT Ai = P , zi zi = Gi ni .
(3.11) (3.12) (3.13)
With the optimal allocation of resources, the total noise at the output is given by ³P i=M v2no =
i=1
wi
´1+p p
IT
Kw (p)1f
+
³P ´2 i=M i=1 zi K f ln( fh /fl ) AT
.
(3.14)
This equation is the noise resource equation for our system. We find that the noise resource equation for the device equation 3.6 and the noise resource equation for the system equation 3.14 are very similar. The noise resource equation for the device modeled the technology with the p, Kw (p), and K f parameters. The noise resource equation for the system added the effects of the P of the task and the ingenuity of the analog designer in the P complexity wi and zi terms. Both equations reveal that power and area resources lower thermal noise and 1/f noise (or offset), respectively. (Further subtleties of noise in analog systems are discussed in Sarpeshkar, 1997.)
Analog Versus Digital
1613
To first order, equation 3.14 quantitatively captures all the intuitive ideas about noise and offset that we expressed in items 4–7 of our analog-versusdigital list. Equation 3.14 reveals how noise accumulates in analog systems; if M and/or ni are large, as would be the case for a complex computation, then the output noise can be large indeed. Equations 3.11, 3.13, and 3.14 show that if noise is to be minimized, more resources should be distributed to the parts of a system that affect all other parts of it (the initial stages) and to those parts of it that are complex (high ni ). Above threshold, the weighting of power resources toward the early stages is more severe than 4/3 is that for subthreshold (Gi versus Gi ). It is convenient to rewrite equation 3.14 as v2no =
Cf Cw + , PT p AT
(3.15)
where PT = VDD IT . The parameter Cw is simply the numerator of the first p term of equation 3.14 multiplied by VDD , and the parameter C f is the numerator of the second term of equation 3.14. 3.3 The Costs of Analog Precision. In an analog system, the maximum possible amplitude of an output sinusoidal signal Y is VDD /2. The power of 2 /8. For such a signal, the maximum possible signal-to-noise this signal is VDD ratio is given by SN =
2 VDD , 8v2no
(3.16)
where v2no is the noise power at the output. The parameter SN is important because the information H(Y) that we can observe at the output of our system is a monotonically increasing function of SN . The larger the value of SN , the more finely can we distinguish among states at the output, and the greater is the output precision. The exact form of the function depends on the amplitude distribution of the output signal and the output noise. For many practical situations, H(Y) ≈ (log2 (1 + SN ))/2 is a good approximation to the number of bits of information present at the output; this formula is exact if the amplitude distributions of the signal and noise are gaussian. The information at the output is an upper bound on the mutual information between the function of the input implemented by the computation and the output (Cover & Thomas, 1991). By using the expression for system-level noise from equation 3.15 in equation 3.16, solving for PT at constant SN and AT , and solving for AT at constant SN and PT , we get, Ã PT =
Cw SN ¡ ¢ 2 VDD /8 − C f /AT SN
! 1p ,
(3.17)
1614
Rahul Sarpeshkar
à AT =
! C f SN ¡ ¢ . 2 /8 − C /P p S VDD w T N
(3.18)
We refer to these equations as the resource precision equations for analog computation; they tell us how the resource utilization is a function of SN , the variable that determines the output precision. For small values of SN , the denominator is constant in both expressions and PT ∝ SN 1/p , while AT ∝ SN . Since p = 1.0 in the subthreshold regime and p = 0.5 in the above-threshold regime, the scaling laws of power versus SN are PT ∝ SN in the subthreshold regime and PT ∝ SN 2 in above-threshold regime. The scaling laws for area, A ≈ SN , are similar in both regimes. The power cost PT diverges when SN is limited by 1/ f noise (or offset); we must spend area in this situation to reduce 1/ f noise (or offset). Similarly, the area cost AT diverges when SN is limited by thermal noise; we must spend power in this situation to reduce the thermal noise. Actually, these conclusions of divergence are true only for the subthreshold regime, where we cannot trade the power and area resources of a transistor to obtain a certain value of SN . Sarpeshkar (1997) shows how to trade between power and area in the above-threshold regime. 3.4 The Costs of Digital Precision. In many digital systems, the power and area costs are proportional to the number of bits b used in the computation. In such cases, a 12-bit computation consumes one-half as much area and one-half as much power as does a 24-bit computation if all parameters— such as clock frequency f , average switching capacitance C, and power supply voltage—remain fixed. If we do allow the clock frequency and power supply voltage to scale with the number of bits, as in a bit-serial implementation, then the power costs scale as a polynomial function of the number of bits. Some computations like multiplication have power and area costs that scale like the square of the number of bits. In general, most tractable computations scale as a polynomial function of the number of bits. For simplicity, we assume that the power and area costs are proportional to the number of bits. It is straightforward to extend the arguments that follow to the polynomial-scaling case, although a quantitative solution may not be possible for any general polynomial. Thus, the resource precision equations for digital computation are given by PT = Lp log2 (1 + SN ),
(3.19)
AT = La log2 (1 + SN ),
(3.20)
where b is defined from the relationship b ≈ (log2 (1+SN ))/2. The parameter La would scale like NWL where W and L are the widths and lengths of a small transistor, and N represents the complexity of the task and the ingenuity of the digital designer. The parameter Lp would scale like Nf CVDD 2 .
Analog Versus Digital
1615
3.5 Precision Costs: Analog Versus Digital. Figure 3 shows power and area resource-precision curves for subthreshold analog computation (equations 3.17 and 3.18 with p = 1) and for digital computation (equations 3.19 and 3.20). We see that analog computation is cheaper than digital computation at low values of SN and more expensive than digital computation at high values of SN . Note also the divergence in power and area costs when SN is limited by 1/ f noise (area) and thermal noise (power), respectively. The exact location of the crossover point will depend on the task, technology, and ingenuity of the analog and digital designers. We have chosen values for Cw , C f , Lp , and La such that the crossover happens near 10 bits (60 dB in SN ). For many common computations in today’s CMOS technology, the crossover happens near 8 bits. Curves such as the ones in Figure 3 were first proposed for comparisons of delay operations in a seminal paper (Hosticka, 1985). Recently, there has been additional work on comparing analog and digital systems for delay operations (Furth & Andreou, 1996). Vittoz (1990) compared filtering operations in analog versus digital systems, and Kearns (1993) compared analog and digital systems for their performance on the tasks of comparing two N-bit numbers, and also for the construction of delay lines. To our knowledge, the comparison presented in this article is the first to generalize the prior results to a broad class of analog and digital systems and to include the effects of 1/f noise and offset along with the effects of thermal noise. 3.6 Caveats. A/Ds and D/As are analog systems, and the costs of operating these systems at high precision (high SN ) are high. In a digital system with analog inputs and outputs, the precision costs of the A/D and D/A are paid once at the front end and once at the back end, respectively. The cost of the high-precision processing between the front end and back end is determined by the digital system in between. Thus, the total cost of the overall system is made up of an analog part for the A/D and D/A, and a digital part for the rest of the processing. Our comparisons between analog and digital computation ignored the additional A/D and D/A costs of a digital system. In a sufficiently complex computation, the A/Ds and D/As represent a small fraction of the total cost of the computation. In an analog system doing the same high-precision processing, the high-precision analog costs are paid throughout all parts of the system, rather than only at the front and back ends; that is why, for a sufficiently complex task, a digital system with an A/D and D/A would still be more efficient than an analog system. In practice there is a minimum area or power consumption for both technologies that is independent of the value of SN —for example, the minimum feature size of a technology determines the minimum possible area that may be expended. Thus, both analog and digital curves flatten out to constant values at low SN . We have ignored such overhead costs in our simple analysis.
1616
Rahul Sarpeshkar
Figure 3: Resource precision curves—analog versus digital. Plots of the resource precision equations for analog computation (equations 3.17 and 3.18) and digital computation (equations 3.19 and 3.20) for subthreshold technology (p = 1). The plots show how the resource utilization—power in (a) and area in (b)—is a function of SN , the output signal-to-noise ratio (a measure of precision).
Analog Versus Digital
1617
3.7 Summary of the Analog-Versus-Digital Analysis. Before we begin our discussion of hybrid systems in section 4, it is worth recapitulating the lessons learned from our analysis. Physical primitives are more efficient at computing than are logical primitives as long as we do not attempt to compute with low noise on one wire. Thus, the analog constants Cw and C f and the digital constants Lp and La are such that the analog curves lie below the digital curves at low SN . At high SN , however, the multiwire representation of information by digital systems divides the information processing into independent bit parts that many simple processing stages can collectively handle more efficiently than can one precise single-wire analog processing stage. This intuition is mathematically expressed by a logarithmic scaling of digital computation with SN , and a power law–like scaling of analog computation with SN . Furthermore, the lack of signal restoration in analog systems causes the noise accumulation for complex analog systems to be much more severe than that for complex digital systems. Thus, we have large values of Cw and C f for complex analog computations (large M, wi , or zi in equation 3.14), whereas Lp and La remain of reasonable size for the equivalent complex digital computation.
4 The Best of Both Worlds It is attractive to combine the best of both computing paradigms to make a hybrid paradigm that is better than either one. In this section, we suggest a framework for such a paradigm. In section 4.1 we show that analog computation that distributes its precision and processing resources over many wires is maximally efficient at a certain signal-to-noise ratio per wire. In section 4.2, we propose a hybrid architecture that combines the advantages of discrete-signal restoration with the advantages of continuous-signal continuous-time analog computation. In section 4.3 we describe a computing architecture that illustrates the simultaneous workings of distributed and hybrid computation.
4.1 Distributed Analog Computation. Figure 4a shows an example that illustrates the idea behind distributed analog computation. Instead of the usual analog paradigm that represents 8 bits of information on one wire, or the usual digital paradigm that represents 8 bits of information on 8 wires, in distributed analog computation, we represent 8 bits of information on two wires that carry analog signals; instead of one analog processor maintaining 8 bits of precision on its output wire, we now have two processors that interact with each other and maintain 4 bits of precision on their respective output wires. The analog signals each have a signal-to-noise ratio of 24 dB in order to encode 4 bits of information. For example, we could encode the four most significant bits of a digital number as an analog signal on one
1618
Rahul Sarpeshkar
Figure 4: Distributed analog computation. (a) The idea behind distributed analog computation is illustrated by contrasting it with purely analog and purely digital computation. In distributed analog computation, analog processors interact with one another and maintain only a moderate amount of precision on their respective output wires. (b) Plots of the total cost of computation and communication as a function of SN in each wire, for c = 1, and for various w/c ratios in equation 4.1 are shown.
wire by doing a 4-bit D/A operation on the four most significant bits of the digital number. Similarly, we could encode the four least significant bits of the number as an analog signal on another wire. If the original signal was an analog signal present on a single wire, then
Analog Versus Digital
1619
an 8-bit A/D encoder must first convert the single-wire analog representation into a digital number. The precision of this front-end A/D operation will be at a resolution of 8 bits. However, once we have a distributed representation (2 wires × 4 bits), all subsequent analog operations may be done at a resolution of 4 bits. As in digital computation, where overflows in one channel are handled via carry propagation to an adjacent channel, the analog processors must interact with each other appropriately to preserve their distributed representation. The interaction between analog processors necessarily involves interaction between their signal-restoration circuitry as well (signal-restoration circuitry is described in section 4.2). Because each analog processor operates at a low precision, its power consumption and area consumption requirements are low. We are interested in knowing whether the total costs in power consumption and area consumption are lower for two 4-bit processors than for one 8-bit processor. We therefore ask the following question: Suppose we want to output N bits of information by outputting b bits of information from N/b analog processors on N/b wires. What is the optimal number of bits b on each wire such that the total power or area consumption of all circuitry is minimized? To answer the question, we will have to take the costs of wiring (communication) and computation into account. Wires cost area and add capacitance. In order to keep the bandwidth of the system constant as capacitance is added, the power consumption in the system rises. The wiring costs for area increase in linear proportion to the number of wires.5 If bandwidth is to be maintained, the power consumption must rise in linear proportion to the total capacitance in the analog processor. Thus, the power costs of wiring also increase in linear proportion to the number of wires. In neurobiological systems, the power costs of wiring include the costs of active restoring circuitry in axons as well. Thus, wiring costs are a function of the technology. From equations 3.17 and 3.18, for relatively small SN where analog computation is more effective than digital computation, the power consumption and area consumption are power law functions of SN in the subthreshold and above-threshold regimes. Thus, the analog cost function for computation per processor is well described by cSN l , where l = 2 for above-threshold power consumption, and l = 1 in all other cases of interest; here, c is a computation cost constant that accounts for all computation costs at each channel, including those necessary for interactions with adjacent channels, and the cost of signal restoration circuitry in the channel. We will discuss only the 5
The linear proportionality of area cost with the number of wires accounts for only the area occupied by the wires themselves. In practice, area costs for wiring will involve the area between wires and the area between computational elements as well. Such considerations cause the area cost function to be supralinear in the number of wires. For simplicity, we assume a linear function as the supralinear case will not alter the basic nature of our conclusions.
1620
Rahul Sarpeshkar
case for l = 1 since the l = 2 case follows by straightforward extension. The cost function for wiring is¡ given by a constant cost of w per wire. The num¢ ber of bits per wire b = log2 (1 + SN ) /2. Thus the total cost function for computation and communication is given by µ
N Cost = (cSN + w) b Ã
¶
! N ¡ ¢ . = (cSN + w) 0.5 log2 (1 + SN )
(4.1)
Figure 4b shows plots of the total cost of computation and communication as a function of SN in each wire, for c = 1, and for various w/c ratios. We see that when wiring is expensive (w/c = 10), the optimal signal-to-noise ratio is high, b is high, and we have few wires. When wiring is cheap (w/c = 0.1), the optimal signal-to-noise ratio is low, b is low, and we have many wires. By simple calculus, we can show that the optimal SN occurs when µ ln (1 + SN ) =
¶ SN + w/c . 1 + SN
(4.2)
The optimal value SoN has the following limiting solutions: SoN = SoN
ln SoN
p w/c if w/c ¿ 1,
= w/c if w/c À 1.
(4.3) (4.4)
At the optimal value, the total cost of computation and communication is 2Nc ln 2(1 + SN ). For the case where w/c ¿ 1, the cost is 2Nc ln 2. The cost of outputting all N bits from one single analog processor is c2N . Thus, if N is sufficiently big, 2Ncln2 ¿ c2N . Therefore, if the amount of output information is large, it is better to distribute the information and information processing on many wires. 4.1.1 Caveats. In general, there may be overlap in the information distributed among the channels; for example, one wire may encode the six least significant bits of an 8-bit digital number, and the other wire may encode the six most significant bits of the 8-bit number. In the latter case, we have a redundant and correlated representation of amplitude information between the two wires. We do not analyze such cases here for they are technically harder and do not illustrate the point any better. In our analysis we have ignored the front-end costs of distributing the information from a single wire onto many wires. As we described in section 3.6, this operation is analogous to an A/D encoding cost that we pay once at the front end. For a sufficiently complex computation where we do a lot of distributed computation, this cost is negligible. Similarly, if we must
Analog Versus Digital
1621
eventually collapse distributed information back onto a single wire (e.g., at the output end of the system), then we will have to pay a high-precision decoding cost, as in an output D/A. If the encoding and decoding costs are a significant part of the computation, then we have another trade-off in having our representation be highly distributed. An excessively distributed representation may require very complex encoding and decoding operations (such as A/Ds and D/As) that grow in an exponential fashion with the number of wires. The optimization of resources must then include the costs of encoding and decoding in addition to those of computation and communication. 4.2 Hybrid Computation. Noise always accumulates in a cascade of analog processing stages. If a computation is sufficiently complex, then at some point, an analog system simply cannot maintain enough precision at its output to do anything useful. Even if we require the system to maintain only 1 bit at its output, it will be unable to do so. We now show how to use a building block called the A/D/A, and an architecture that uses A/D/As for solving the noise accumulation problem in analog systems. The A/D/A is an A/D converter that is immediately followed by a D/A converter. However, its most efficient circuit implementation does not involve explicit implementation of an A/D converter and a D/A converter. The A/D/A has been proposed as a useful building block for various analog and digital storage and processing applications (Cauwenberghs, 1995). The basic ideas are illustrated in Figure 5. A hybrid link is a set of analog processing stages (denoted Ai in the figure) followed by an A/D/A that restores the analog signal to one of M discrete attractor states. A hybrid chain is composed of a sequence of hybrid links. Each chain can maintain analog information to a precision of N = log2 (M) bits with a low probability of error, provided that we meet the following constraint: The net inputreferred noise of the A/D/A, due to all processing stages in a link and the restoration circuits in the A/D/A, must be significantly lower than the minimum distance between attractor states. In section 4.2.1, we show that an error probability of 10−12 can be achieved in an N-bit hybrid link if the inputreferred noise is low enough such that we operate with a precision of N + 4 bits. Thus, in order to restore signals reliably, we need four redundant bits of precision. To keep the error probability low in a hybrid chain composed of many links, the requisite precision before restoration needs to grow only very slowly with the number of links in the chain (like the log(log(size of the chain)). Thus, a hybrid chain can do an extremely large amount of analog processing and still maintain a precision of log2 (M) bits at its output. Effectively, we can operate with the precision and complexity characteristic of digital systems, while doing efficient analog processing. If we assume that we do not want to have more than 8 bits of precision at the input to the A/D/A, then the best A/D/A that we can build would restore a signal to 4 bits of
1622
Rahul Sarpeshkar
Figure 5: Hybrid computation. In this form of computation, analog processing is followed by restoration of the analog signal to a set of M discrete attractor states. As discussed in section 4.2, hybrid chains allow us to operate with the precision and complexity characteristic of digital systems, while doing efficient analog processing.
precision. Using A/D/As is probably not a good technique for maintaining anything more than 4 bits of precision on an analog input. As we shall discuss in section 4.3, the main use for A/D/As is in distributed analog computation, where it is unnecessary to maintain too much precision on one wire. To maximize the efficiency of information processing in a hybrid chain, there is an optimal amount of analog processing that must occur before signal restoration in a hybrid link; that is, hybrid links should not be too long or too short. If the link is too long, we expend too much power (or area, or both) in each analog stage to maintain the requisite precision at the input of the A/D/A. If the link is too short, we expend too much power (or area or both) in frequent signal restorations. In section 4.2.2, we analyze the optimal length of a hybrid link quantitatively. Needless to say, if we are unconcerned about efficiency, then the link can be as long or as short as we like, as long as we meet the A/D/A constraint. 4.2.1 The A/D/A. To restore a signal, we must have discrete attractor states. In digital signal restoration, the input signal is compared with a threshold, and high-gain circuits restore the output to an attractor state that is a function of the input attractor state. The input may deviate by a fairly large amount from its attractor state, and the output will still be very close to its attractor state. The noise immunity of digital circuits arises because the typical distance in voltage space between an input attractor-state level and a threshold level is many times the variance of the noise or the offset in the
Analog Versus Digital
1623
circuit. We can generalize this two-state restoration to an M-state restoration by having M − 1 input threshold levels and M output state levels. The input signal is compared with M − 1 threshold levels and is rounded off to that attractor state level that it is closest to. Systems like these have been proposed for multistate logic systems. Figure 6a shows the threshold levels VTi and restoration levels VLi for a four-state or 2-bit system. The arrows converge on restoration levels and diverge from threshold levels. The A/D/A modifies the digital restoration scheme for M states to an analog restoration scheme for M states. In the analog restoration scheme, M can be arbitrary and does not have to be 1, 2, 4, 8, 16, 32, and so on. It can be any arbitrary number that we choose because, unlike multistate logic, we do not do any digital computation with our inputs or outputs. The input Vin is an analog signal that may have been processed by many analog stages. The output Vout is a restored and filtered analog signal that can serve as an input to future analog-processing stages. Figure 6b shows a circuit for one possible implementation of a four-state A/D/A.6 The analog signal is compared with three thresholds, and zero, one, two, or three currents are switched onto a resistor, whose voltage then equilibrates at VL1 , VL1 + IR, VL1 +2IR, or VL1 +3IR, respectively. The RC circuit acts as a filter and removes sharp edges in the signal. The capacitance is chosen such that 1/RC is at or near the desired bandwidth of the input. Figure 6a shows that if an input analog signal happens to be exactly at a threshold level VTi , then it will be constantly restored at random to the attractor state above or below it. However, since we are always within half a bit of the analog input, this random restoration still preserves the input information to within 1 bit, as desired. All other analog inputs are restored to within a half-bit of their input values as well. Thus, we preserve information in the analog signal to a precision of log2 M bits. Now we analyze how large the input noise and offset of the A/D/A can be if we need to preserve a precision of log2 M bits in the output analog signal. Suppose that because of noise and offsets, the input signal is described by a gaussian probability distribution with variance σ 2 , as shown in Figure 6c.7 If the analog input is situated at a threshold level VTi , then it needs to deviate by a full 1-bit distance from this level for a bit error to occur. If, on the other hand, the analog input is situated at a restoring level VLi that is not at the extremes such as VL1 or VL4 , but rather is midway such as VL1 and VL2 , then a deviation from this level by half-bit distance is sufficient for 6 There are vastly more efficient circuit representations that we can use to construct an A/D/A. However, we do not discuss these here because they are of a more technical nature and require a background in analog circuit design. 7 The gaussian assumption is not essential to the qualitative nature of our arguments, although it does affect our quantitative answers. If the probability distribution was not gaussian, we may still perform the calculations outlined below, although closed-form answers may not be possible.
1624
Rahul Sarpeshkar
Figure 6: The A/D/A. (a) The threshold levels VTi and restoration levels VLi for a four-state or 2-bit A/D/A system. (b) A circuit for one possible implementation of a four-state A/D/A. (c) The probability of a bit error for a worst-case situation when the input is at VL2 is given by the area under the gaussian tails—to the left of VT1 and to the right of VT2 . Section 4.2.1 provides further details.
a bit error to occur. Thus, we analyze this worst-case situation for the input situated at VL2 . Let the variance of the noise be σ 2 . The distance between a threshold level and a restoration level is bd /2, where bd is a bit distance given by (VLM − VL1 )/(M − 1) in an M-state A/D/A. The probability of a bit error Pe is then given by the area under the gaussian tails in Figure 6c, that is, to the
Analog Versus Digital
1625
left of VT1 and to the right of VT2 . Thus, Pe is given by µ Pe = erfc
¶ bd /2 √ , σ 2
(4.5)
where erfc(x) is defined by 2 erfc(x) = √ π
Z x
∞
e−x 2 e−u du ≈ √ . πx 2
(4.6)
√ √ Now Pe = 1 × 10−12 if bd /(2σ 2) = 5.04. Thus, bd = 2 2 × 5.04σ = 14.3σ . Hence, to restore the signal faithfully, with a low bit-error rate, an N-bit A/D/A requires that the precision at its input be ≈ N + 4 bits (log2 (14.3) ≈ 4). 4.2.2 The Optimal Length of a Hybrid Link. For simplicity, assume that our computation is a cascade of N identical analog processing stages, as in a many-pole filter. By the reasoning of the last paragraph of section 4.2, if the stages are not identical, we can show that an optimal length still exists. However, the closed-form solution is hard to obtain. The simplest case with all identical gains for which we may obtain a closed-form solution corresponds to all the stages having unity gain. Thus, we shall discuss only the case with identical unity-gain stages to avoid complexity that does not add much insight. For similar reasons, we shall analyze only the simple case of current (power) optimization assuming that the 1/f (or offset) terms in the resource noise equation (3.6) are negligible. Other simplifying assumptions also include that p = 1(subthreshold) and that we pay a fixed cost in power per A/D/A restoration stage.8 Suppose that there are M computational stages and 1 A/D/A in every hybrid link. Then, there will be N/M links with a total of N computational stages, and N/M A/D/As in the chain. Suppose that the complexities of the A/D/A stage and of each computational stage correspond to nr and nc devices, respectively. By equation 4.5, corresponding to whatever error criterion we pick, the input-referred noise σ at every A/D/A must be less than or equal to some value σt . The value of σt depends on only bd , the distance between attractor states in the A/D/A, which is fixed by the precision desired for a given hybrid chain. Thus, from equations 3.14, 3.11, and 3.13, with ni = nc for all i, and Gi = 1 for all i, the noise due to the computational stages in a link is given by vc 2 =
(Mnc )2 Kw (1)1f , IC
(4.7)
8 There is a possible variant of the problem, where we simultaneously optimize the power allocated between the A/D/A stages and the computation stages, as well as the number of stages per link.
1626
Rahul Sarpeshkar
where IC is the total power consumption in the computational stages. Similarly, the noise due to an A/D/A stage in a link is given by v2r =
(nr )2 Kw (1)1f , IR
(4.8)
where IR is the fixed current consumption of the restoration stage. The A/D/A constraint gives us v2c + v2r = σt 2 .
(4.9)
Algebraic manipulation of equations 4.7, 4.8, and 4.9 then yields
2 K (1)1f n w , IC = M2 c 2 n K (1)1f σt 2 − r wIR = M2 Cc ,
(4.10) (4.11)
where Cc is defined by the preceding equations. The total current consumption due to N/M links in the entire chain is then given by ¶ N (IC + IR ) , M ¶ µ IR . = N Cc M + M µ
ICH =
(4.12)
Figure 7 shows a plot of the current consumption for differing values of Cc = 2.06 pA, 5.11 pA, 9.52 pA, and 15.34 pA; IR is fixed at 100 pA. The parameter Cc was changed by varying σt in equation 4.10. Thus, as we increase the precision of the hybrid link, the costs of computation rise with respect to the costs of signal restoration, and the optimal length of the link decreases. The mathematics is in accord with the intuition expressed in the last paragraph of section 4.2. The curves in Figure 7 were drawn for 1f = 100 Hz, Kw (1) = 4.38 × 10−22 , nr = 3, and nc = 150. It is easy to show that the location of the optimum in equation 4.12 is given by s M=
IR . Cc
(4.13)
4.3 Distributed and Hybrid Computation. Figure 8 combines the ideas of sections 4.1 and 4.2. The information from a single-wire analog input is encoded onto many wires by an analog encoder. Typically, the encoder might be more redundant and thus might distribute the information over
Analog Versus Digital
1627
Figure 7: Optimal length of a hybrid link. (a) A hybrid chain with M stages of computation per link, N/M links, and N total stages of analog computation. (b) A plot of the current consumption (obtained from equation 4.12) versus link length (M) for differing values of precision, parameterized by σt , the inputreferred noise at the A/D/A. As the precision increases, the optimal length of the hybrid link is shortened. Section 4.2.2 provides further details.
many more wires, but for simplicity, we have shown a nonredundant encoder. A cochlea, retina, and A/D are all good examples of encoders that distribute information from one wire onto many wires. In this example, we have an analog encoder, so if we used an A/D, we would have to follow it with a D/A. In the example of Figure 8, the distributed information is preserved in the first stage of processing by 2-bit A/D/As. In the next stage of processing, the analog processors or the A/D/As, or both, make decisions based on the information and reduce the output information to 1 bit. Thus, the analog circuits in the second half can afford to be noisier, since the A/D/A restoration has a precision of only 1 bit. The use of distributed analog computation and low-precision A/D/A signal restoration makes this architecture ideal for efficient precise computation. Mixed-signal circuits that involve analog and digital techniques have
1628
Rahul Sarpeshkar
Figure 8: Distributed and hybrid computation. The information from a singlewire analog input is encoded onto many wires by an analog encoder such as a cochlea, retina, or A/D. Interacting hybrid chains process the information on these wires. Section 4.3 provides further details.
been proposed for efficient low-precision sensory data processing (Martin, 1996). Distributed-and-hybrid schemes, such as ours, illustrate how mixedsignal circuits can be architected to be suited for high-precision processing as well. For example, it is possible to implement efficient high-precision arithmetic circuits using distributed-and-hybrid architectures. The results from several low-precision analog addition and multiplication operations are appropriately combined via carry and A/D/A interactions. A more
Analog Versus Digital
1629
detailed description of such arithmetic architectures is outside the scope of this article and a topic of our research; these architectures may represent the first practical applications of the ideas described in this article. 5 Extrapolating to Neurobiology Our analysis for electronic systems suggests why neuronal information processing is distributed, that information processing in the brain is likely to be hybrid, and how signal restoration in neurons may be implemented. In sections 5.1 through 5.3 we discuss these suggestions in more detail. In sections 5.4 through 5.5 we shall discuss how our arguments about noise in electronic systems can be extrapolated to neurobiology. 5.1 Why Neuronal Information Processing Is Distributed. Information processing in networks of neurons is accomplished in a tremendously distributed fashion. It has often been pointed out that this distribution results in fault-tolerant behavior, since the destruction of any one neuron or synapse hardly affects the operation of the overall network. However, we suggest that the primary reason for the distributed nature of neuronal information processing is not fault tolerance but efficiency. We showed in section 4.1 that if the costs of computation are to be cheap, then the information and information processing must be distributed across as many wires as possible. However, if the costs of communication are to be cheap, then the information and information processing must be localized among as few wires as possible. The trade-off between these two constraints, as revealed in equation 4.1, results in an optimal number of wires and an optimal signalto-noise ratio per wire, as revealed in Figure 4. In neurobiological systems, where communication costs are relatively low compared with communication costs in silicon, the optimal signal-to-noise ratio is lower than that in silicon.9 Thus, we believe that nature was smart to distribute computational resources over many noisy neurons (dendrites and somas) and communicate that information between neurons over many noisy fibers (axons). The noisiness of the brain is due to the wisdom of millions of years of evolution, and is not a reflection of the incompetence of biology. We believe that the “use” of neuronal noise in phenomena such as stochastic resonance, or in phenomena that prevent trapping in a local minima, may be valuable in certain special cases, but the primary reason for the noisy nature of the brain is efficiency. Experimentally based estimates of the energy required to transmit a bit of information in various stages of the blowfly retina are rather large (Laughlin,
9 In today’s electronic technology, it would be unthinkable even to dream of wiring on the scale of neurobiology. For example, the million fibers of the optic nerve or the 35,000 fibers of the auditory nerve would simply be too expensive to implement.
1630
Rahul Sarpeshkar
van Stevenick, & Anderson, 1998a; Laughlin, Anderson, O’Carroll, & van Stevenick, 1998b). Therefore, these authors have independently arrived at conclusions very similar to ours: Distributed coding of information among multiple pathways is important for energy efficiency in noise-limited systems. 5.2 Information Processing in the Brain Is Likely to Be Hybrid. Action potentials are all-or-none discrete events that usually occur at or near the soma or axon hillock. In contrast, dendritic processing usually involves graded synaptic computation and graded nonlinear spatiotemporal processing. The inputs to the dendrites are caused by discrete events. Thus, in neuronal information processing, there is a constant alternation between spiking and nonspiking representations of information. This alternation is reminiscent of the constant alternation between discrete and continuous representations of information in Figure 5. Thus, it is tempting to view a single neuron as a D/A/D. However, although the firing of a spike is a discrete event, it does not imply that it encodes information about a discrete state. The information encoded by a spike is meaningful only in relation to spikes in different neurons, or in relation to earlier or later spikes in the same neuron. If these relationships are analog, then all-or-none events do not imply the encoding of discrete states. So how do we know whether the brain is analog (continuous signal) or digital (discrete signal) or hybrid (both)? Almost everybody accepts that the brain does a tremendous amount of analog processing. The controversy lies in whether there is anything digital about it. We know, from the arguments of this article, that the noise accumulation in complex systems is simply too high for purely analog processing to be efficient in such systems. Given that the brain is made up of a large number of physical devices that exhibit noise at room temperature and is yet extremely efficient (12 W power consumption and 300 ms response time for complex tasks), we may hypothesize that it must be mixing continuous-signal and discrete-signal processing to compute in a hybrid fashion. In section 5.4 we review noise in biological devices, and in Sarpeshkar (1997) we review numbers on the interconnectivity and complexity of the brain’s architecture. These reviews suggest that although it is theoretically possible that the brain’s complexity is small enough that a purely analog brain could be efficient, a purely analog brain seems unlikely. However, more quantitative studies need to be done on noise in biological devices and on the architecture of the brain before we can conclusively rule out the possibility of a purely analog brain. Thus, the suggestion that the brain is hybrid is only a hypothesis supported by our quantitative arguments from electronics and by some qualitative facts from our current knowledge of neurobiology. 5.3 How Signal Restoration May Be Implemented. To implement signal restoration, there must be a set of discrete states that the continuous
Analog Versus Digital
1631
signal is periodically restored to. How are the discrete restorative states of neurons encoded in the firing of action potentials? Conceptually, at the level of a single neuron, the discrete states of a spike train may be encoded in the number of spikes that occur in a given window of time (the mean-firingrate code), or in a discrete set of firing patterns that occur within that same window of time (the timing-pattern code). Such codes are scalar codes since they involve only one neuron. As experimental (Abeles, Bergman, & Gat, 1995) and theoretical (Hopfield & Herz, 1995) work indicate, it is more likely that discrete states involve a vector code that is implemented in a collective fashion across many neurons. The window of time over which we count spikes or detect temporal patterns within the spikes is determined by the integration time constants of the neurons. The mean firing rate and timing pattern scalar codes have direct analogies in vector codes. In the mean firing rate case, instead of counting the number of spikes in one neuron within a window of time, we count the number of spikes across many neurons that are present within some time window. In the timing pattern case, instead of a discrete set of firing patterns of one neuron that occur within some time window, we have a discrete set of cross-correlated firing patterns of many neurons within some time window. For simplicity, we shall assume that our time window is short enough that each neuron contributes at most one spike within that time window. It is easy to generalize our ideas to multiple spikes within one time window. The key building block of our electronic signal restoration schemes was the A/D/A, which was basically an A/D followed by a D/A. In the signalrepresentation scheme of neurons, how might we build a A/D/A? We shall discuss only signal restoration for vector codes.
5.3.1 Von Neumann Restoration for Spike Counting. Suppose we have 3N neurons. We group them into N sets of three each. For each of the N sets we perform a simple majority vote and regenerate three signals, each of which encodes the result of the majority vote. Thus, if we have (spike, spike, no spike) across the three neurons, we restore this signal to (spike, spike, spike). If we have (no spike, spike, no spike) across the three neurons, then we restore this signal to (no spike, no spike, no spike). Thus, we restore the original 3N + 1 possible states (ordering of neurons does not matter) into N + 1 possible states. Just as in the A/D/A, if we want to have low rates of error, we must compute with more redundancy (20N instead of 3N). The majority-vote scheme was first proposed by John Von Neumann (Von Neumann, 1952) as a way for doing signal restoration. Note that in this scheme, we are really restoring a fine-grained discrete quantity to a coarse-grained discrete quantity. In the A/D/A, we restore a continuous analog quantity with a fine-grain size determined by analog noise into a coarse-grained discrete quantity.
1632
Rahul Sarpeshkar
5.3.2 Restoration for Spike Timing. Here, we detect the presence of a discrete timing pattern by building suitable delays in the dendrites or synapses or input axons of a “matched-filter” neuron such that the inputs from the N neurons that encode the timing pattern arrive in synchrony at the axon hillock (Abeles, 1991). The matched-filter neuron regenerates the timing pattern by fanning out collaterals to a set of N output neurons with appropriate axonal or synaptic delays such that the timing pattern is regenerated. The restoration in pattern timing will occur if the matched-filter neuron is configured to respond to inputs with somewhat skewed timing patterns; this is accomplished by setting its threshold so it is not too high. If we want to restore M timing patterns that are encoded on N input axons, then we need M matched-filter neurons and N output neurons. Each of the N output neurons could receive inputs in parallel from the M matched-filter neurons, as, in a good design, only one of the M matched-filter neurons would be active at any given time. As in the A/D/A, if we want to ensure low error rates, M should be significantly less than the possible number of timing patterns encoded among the N neurons. It is also crucial that the delays involved in regeneration be precise enough to maintain a precision that is a few bits above log2 (M) bits. In digital electronic circuits, an inverter performs restoration and computation at the same time. It inverts its input (1 goes to 0, and 0 goes to 1), but it is also restorative since a “bad 1” is restored to a “good 0.” Similarly, in a time-delay restoration scheme, we could have the regenerated pattern be a different timing pattern such that a somewhat skewed input temporal pattern is restored to a clean-output temporal pattern. In a pattern-recognition computation, such as that performed by an associative memory, computation and restoration are intermingled because the nature of the computation inherently requires a discrete set of outputs. 5.3.3 Caveats. We have made many simplifying assumptions such as treating computation and restoration as distinct entities, and similarly treating computation and communication as separate entities. It is likely that such entities are more deeply intertwined in the brain and that the rather sharp digital restorations that we propose are really soft restorations in the brain, such that a more accurate description would need to involve the language of complex nonlinear dynamical systems. The processing of information in a single dendrite, let alone the whole brain, is enormously complex. Such processing could be very useful in performing signal restoration within the level of the dendrite itself. Thus, we do not, by any means, claim that the brain is implementing the particular architectures, and the particular restorative schemes that we have proposed. We have merely offered our schemes as a possible way in the hope that it will stimulate further discussion and work on the subject. The importance of action potentials for avoiding temporal dispersion and signal attenuation over long communication distances is well known.
Analog Versus Digital
1633
That is not the issue under discussion in this article since we take that issue to be resolved. Rather, the issue under discussion revolves around the importance of action potentials for signal restoration in complex local networks of neurons. In the latter case, the complexity of computation degrades the signal-to-noise ratio due to the large number of processing steps, and restorative action-potential codes serve to preserve the signal-to-noise ratio. In this article, we have emphasized that the hybrid and distributed nature of the brain’s signal processing is likely to be an important and underappreciated reason for its efficiency. Other reasons for the efficiency of the brain are discussed in detail in Sarpeshkar (1997). They include the marvelous technology of devices and interconnect available to the brain, its nonlinear and adaptive signal processing strategies, and its strategies for processing only the information that is useful for solving a given computational task. 5.4 Noise in Biological Devices. In any technology, the starting point for an analysis of the information costs of computing is the noise resource equation of that technology. It was the noise resource equation for MOS technology (see equation 3.6) that enabled us to construct a set of resource precision equations (equations 3.17 and 3.18). The resource precision equations evaluated the costs of a computation as a function of the output information or precision. What might the noise resource equations for neurobiological devices look like? Due to the great diversity of biological devices and the incomplete knowledge that we have about their functioning, a quantitative theory for the technology of neurobiology seems premature. However, we can make qualitative statements that reveal how the noise can be decreased with an increase in resource consumption. The limiting form of noise in biological devices is typically the randomness in ion channel openings and closings (DeFelice, 1981) and the unreliability of synaptic vesicle release (Allen & Stevens, 1994). Channels transition between discrete closed and open states with certain finite probabilities per unit time. The transition probabilities depend on the membrane voltage or the chemical concentration of a substance. For a good discussion of the√kinetics of ion channels, see Weiss (1996). The noise can be reduced by N law-of-large-numbers averaging over several ionic channels (i.e., through the increase of ion channel densities). Similarly, the noise of synaptic transmission may be reduced through the use of averaging over many vesicles, many synaptic contacts, and so on. Such averaging costs area and also turns up power consumption since the power per unit channel, vesicle, or contact is approximately constant. It is intuitive to expect that averaging over large areas of membrane would improve offsets and 1/f noise, but we are unaware of any actual experimental measurements that address whether they do. Interestingly, as in electronics, the magnitude of the 1/f noise in biology is highly unpredictable. It is dependent on the concentrations in the
1634
Rahul Sarpeshkar
cellular environment of substances that alter transport properties of nerve membrane (DeFelice, 1981). In electronics, 1/f noise is also strongly dependent on the concentrations of impurities in an insulating membrane, the gate oxide. Averaging strategies were at the root of a reduction in noise in electronic systems as well. In electronics, we averaged over more electrons per unit time (to reduce thermal noise by increasing power), or over more traps and impurity defects (to reduce 1/f noise and offset by increasing area). In biology, we average over √more ion channels or over larger stretches of membrane. Indeed, a simple N averaging would yield noise resource equations that are similar to equation 3.6, with p = 1. However, there are suggestions that neurobiological systems may be even smarter and may attain √ noise reductions that scale like 1/N rather than like 1/ N (Salman, Soen, & Braun,1996); such scaling laws require the use of interactions between channel kinetics and membrane kinetics through the use of membrane voltage feedback. In situations where it is important to maintain reliability and precision, such as at a neuromuscular junction, there is a lot of averaging over numerous synaptic connections and synaptic vesicles. In situations where it is not that important to be very reliable, such as in the highly distributed architecture of cortex, there is little averaging over synapses or vesicles (Koch, 1997). When timing must be precise, synapses are typically large (Zhang & Trussell, 1994). From numerous examples, it is qualitatively clear that the reduction of noise is accomplished through resource consumption in neurobiology, as it is in electronics. Neurobiology and electronics behave similarly because physical and mathematical laws like the laws of thermodynamics and the law of large numbers do not change with technologies. 5.5 Noise in Neurobiological Systems. In section 3.2, we abstracted the mapping from computational task to circuit topology in the parameters M and ni . The ingenuity of the designer lies in mapping the task to the primitives and architecture of the technology, such that M and ni are as small as possible. Consequently, when function is well mapped to structure, noise is minimized; the wiring of the architecture also is more efficient. Computational architectures where function and structure are well matched amplify the computational information above the noise in the components. This amplification is analogous to the way a matched filter amplifies the signal above the background noise. Two topologies that may be completely equivalent functionally may have markedly different noise properties. For example, suppose that in topology A we take the difference between a large, positive current and a large, negative current to output a small differential current; in topology B we just output a small differential current. The noise of topology A will be much higher than that of topology B even though the two topologies may be in-
Analog Versus Digital
1635
distinguishable as far as outputs go. Thus, the mapping from function to structure must be done with care. It is clear that natural structures have evolved to match structure and function. The architectures of the cochlea, the retina, the hippocampus, the cerebellum, the neocortex, and various other regions have patterns of connectivity, cellular organization, and cell differentiation that indicate a close relationship between structure and function. Cells of various anatomical types are specialized to have certain functional characteristics. (For a good review, see Shepherd, 1990.) For noise minimization, resource allocation should be increased in the initial and complex stages of a computation, as discussed in section 3.2. 6 Summary We conclude by reviewing the main points of the article: 1. Analog computation is efficient at low-precision processing, and digital computation is efficient at high-precision processing. The resource precision equations for analog computation (equations 3.17 and 3.18) and the resource precision equations for digital computation (equations 3.19 and 3.20) quantify the costs of computing at a certain precision in MOS technology. Figure 3 shows a plot of the costs of analog and digital computation at different levels of precision. The noise resource equation of a technology (equation 3.6 for MOS technology) determines the form of the resource precision curves for that technology. 2. The advantages of analog computation arise from its exploitation of physical primitives for computation. The advantages of digital computation arise from its multiwire representation of information and information processing, and from its signal restoration properties. 3. Analog computation that distributes its precision and processing resources over many wires is maximally efficient at a certain signal-tonoise ratio per wire, due to the trade-offs between computation and communication. Equation 4.1 and Figure 4 illustrate this fact in more detail. 4. We proposed a hybrid architecture that combines the advantages of discrete-signal restoration with the advantages of continuous-signal, continuous-time analog computation. The key building block of such a hybrid scheme is a restoration circuit called an A/D/A, which is described in section 4.2.1. Figures 5 and 6 illustrate the workings of the hybrid scheme. For maximum efficiency in a computation, there is an optimal amount of continuous analog processing that must be done before a discrete signal restoration; Figure 7 and equation 4.12 illustrate how this optimum can be determined.
1636
Rahul Sarpeshkar
5. We described a computing architecture that illustrates the simultaneous working of distributed and hybrid computation in section 4.3 and Figure 8. Distributed and hybrid computation combines the best of the analog and digital worlds to create a world that is more efficient than either. 6. In neurobiological systems, where communication costs are relatively low compared with communication costs in silicon, the optimal signalto-noise ratio per wire is lower than that in silicon. Thus, we believe that nature was smart to distribute computational resources over many noisy neurons (dendrites and somas) and communicate information between neurons over many noisy wires (axons). 7. Since the brain appears to be extremely efficient in its information processing and hybrid representations are the most efficient representations in massively complex systems, it is likely that the brain uses hybrid representations. 8. Experiments suggest that discrete states in the brain are encoded in the cross-correlated firing patterns of neurons in a network (Abeles et al., 1995). Neuronal information processing is thus most likely to involve vector signal restoration. In section 5.3, we discussed how signal restoration in networks of neurons may be implemented using A/D/A-like schemes. 9. From numerous examples, it is qualitatively clear that in neurobiology, the reduction of noise is accomplished through resource consumption as it is in electronics. Neurobiology and electronics behave similarly because physical and mathematical laws such as the laws of thermodynamics and the law of large numbers do not change with technologies. It is such laws that, with a few technology-dependent parameters, determine noise resource equations. Since our conclusions depend only on general properties of noise resource equations such as a polynomial reduction in noise with resource consumption, we suggest that our extrapolation from electronics to neurobiology is correct to leading order. Acknowledgments I thank Carver A. Mead, John Allman, Christof Koch, Yaser Abu Mostafa, Kwabena Boahen, Sanjoy Mahajan, Mike Levene, David Kewley, and a couple of anonymous reviewers for useful discussions. As well, I thank Lyn Dupre for her detailed editing of the manuscript, and Shih–chii Liu for a thorough review of this work.
Analog Versus Digital
1637
References Abeles, M. (1991). Corticonics (pp. 227–259). Cambridge: Cambridge University Press. Abeles, M., Bergman, H., & Gat, I. (1995). Cortical activity flips among quasistationary states. Proceedings of the National Academy of Sciences of the United States of America, 92, 8616–8620. Aiello, L. C., & Wheeler, P. (1995). The expensive-tissue hypothesis: The brain and digestive system in human and primate evolution. Current Anthropology, 36,199–221. Allen, C., & Stevens, C. F. (1994). An evaluation of causes for unreliability of synaptic transmission. Proceedings of the National Academy of Sciences, 91, 10380–10383. Allman, J. (1990). Evolution of neocortex. Cerebral Cortex, 8A, 269–283. Cauwenberghs, G. (1995). A micropower CMOS algorithmic A/D/A converter. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, 42, 913–919. Chandrakasan, A., Sheng, S., & Brodersen, R. W. (1992). Low-power CMOS digital design. IEEE Journal of Solid-State Circuits, 27, 473–484. Cover, T., & Thomas, J. (1991). Elements of information theory (p. 20). New York: Wiley. DeFelice, L. J. (1981). Introduction to membrane noise. New York: Plenum Press. Denyer & Renshaw. (1985). VLSI signal processing: A bit serial approach. Reading, MA: Addison-Wesley. Furth, P. M., & Andreou, A. G. (1996). Bit-energy comparison of discrete and continuous signal representations at the circuit level. Proceedings of the 4th Physics of Computation Conference. Boston. Haus, H., & Adler, R. (1959). Circuit theory of linear noisy networks. Hopfield, J., & Herz, A. V. M. (1995). Rapid local synchronization of action potentials: Toward computation with coupled integrate-and-fire neurons. Proceedings of the National Academy of Sciences, 92, 6655–6662. Hosticka, B. J. (1985). Performance comparison of analog and digital circuits. Proceedings of the IEEE, 73, 25–29. Kearns, D. A. (1993). Experiments in very large-scale analog computation. Unpublished doctoral dissertation, California Institute of Technology. Koch, C. (1997). Biophysics of computation: Information processing in single neurons. Unpublished manuscript. Laughlin, S., de Ruyter van Stevenick, R., & Anderson, J. C. (1998a). The metabolic cost of neural information. Nature Neuroscience, 1. Laughlin, S., Anderson, J. C., O’Carroll, D., & de Ruyter van Stevenick, R. (1998b). Coding efficiency and the metabolic cost of sensory and neural information. In Information theory and the Brain. R. Baddeley, P. Hancock, & P. Foldiak (Eds.), Cambridge: Cambridge University Press. Martin, D. A. (1996). ADAP: A mixed-Signal array processor with early vision applications. Unpublished doctoral dissertation, Massachusetts Institute of Technology.
1638
Rahul Sarpeshkar
Mead, C. A. (1989). Analog VLSI and neural systems. Reading, MA: AddisonWesley. Mead, C. A. (1990). Neuromorphic electronic systems. Proceedings of the IEEE, 78, 1629–1636. Mead, C. A., & Conway, L. (1980). Introduction to VLSI systems. Reading, MA: Addison-Wesley. Rabaey, J. (1996). Digital integrated circuits. Englewood Cliffs, N.J.: Prentice Hall. Salman, H., Soen, Y., & Braun, E. (1996). Voltage fluctuations and collective effects in ion-channel protein ensembles. Physics Review Letters, 77, 4458– 4461. Sarpeshkar, R. (1997). Efficient precise computation with noisy components: Extrapolating from an electronic cochlea to the brain. Unpublished doctoral dissertation, California Institute of Technology. Chapter 5: Section 6, Appendix A, Appendix B, and Appendix C; Postscript copy available on http://www.pcmp.caltech.edu/anaprose/rahul/thesis/. Sarpeshkar, R., Delbruck, ¨ T., & Mead, C. (1993). White noise in MOS transistors and resistors. IEEE Circuits and Devices, 9, 23–29. Sarpeshkar, R., Lyon, R. F., & Mead, C. A. (1998). A low-power wide-dynamicrange analog VLSI cochlea. Analog Integrated Circuits and Signal Processing, 16, 3; Postscript copy available on http://www.pcmp.caltech.edu/anaprose/rahul/cochlea/. Shepherd, G. M. (1990). The synaptic organization of the brain. Oxford: Oxford University Press. Vittoz, E. A. (1990). Future of analog in the VLSI environment. Proceedings of the International Symposium on Circuits and Systems, 2, 1347–1350. Von Neumann, J. (1952). Probabilistic logics and the synthesis of reliable organisms from unreliable components. Lecture delivered at the California Institute of Technology, Pasadena, CA, January 4–15. Weiss, T. F. (1996). Cellular biophysics (2 vols.) Cambridge, MA: MIT Press. Zhang, S., & Trussell, L. O. (1994). A characterization of excitatory postsynaptic potentials in the avian nucleus magnocellularis. Journal of Neurophysiology, 72, 705–718. Received February 3, 1997; accepted March 4, 1998.
NOTE
Communicated by Alain Destexhe
Employing the Z -Transform to Optimize the Calculation of the Synaptic Conductance of NMDA and Other Synaptic Channels in Network Simulations J. Kohn ¨ and F. Worg ¨ otter ¨ Department of Neurophysiology, Ruhr-Universit¨at, 44780 Bochum, Germany
Calculation of the total conductance change induced by multiple synapses at a given membrane compartment remains one of the most time-consuming processes in biophysically realistic neural network simulations. Here we show that this calculation can be achieved in a highly efficient way even for multiply converging synapses with different delays by means of the Z -transform. Using the example of an NMDA synapse, we show that every update of the total conductance is achieved by an iterative process requiring at most three multiplications, which together need only the history values from the two most recent iterations. A major advantage is that this small computational load is independent of the number of synapses simulated. A benchmark comparison to other techniques demonstrates superior performance of the Z -transform. Nonvoltagedependent synaptic channels can be treated similarly (Olshausen, 1990; Brettle & Niebur, 1994), and the technique can also be generalized to other synaptic channels.
1 Introduction In most biophysically realistic network simulations, the conductance changes at every synapse have to be updated for every simulated time step. The divergence in the connection structure thereby determines how many synapses actually exist. Usually, however, this number exceeds the number of neurons by far. Thus, synaptic conductance update is the most time-consuming computational process in almost all biophysically realistic network simulations. This process has to be repeated for every membrane compartment, which could be a small dendritic cylinder or the total cell, depending on the network model’s level of complexity. The basic problem is how to compute the total conductance gtotal most efficiently for every postsynaptic membrane compartment that is given by the convolution of all input spike trains si with the characteristic synaptic functions gi of the (possibly different) synapses at the current time tx : c 1998 Massachusetts Institute of Technology Neural Computation 10, 1639–1651 (1998) °
1640
J. Kohn ¨ and F. Worg ¨ otter ¨
In continuous time: gtotal (tx ) =
N X
Z
tx
ωi
i=0
si (τ )gi (tx − τ )dτ ,
(1.1)
0
where N is the total number of synapses and si denotes the input spike train consisting of delta pulses at synapse i. The synaptic weight is given by ωi . In discrete time, the integral will be replaced by a second sum. The problem that hides beneath this equation is the fact that a rather long history of past incoming spikes has to be saved and used in order to calculate gtotal at the current time tx . These “history values” have to be stored in an extended memory queue in order to make them available to the next iteration step, and every history value enters in the calculation, leading to a rather high number of numerical operations. Several algorithms have been proposed to accelerate this process based on different basic synapse models (Srinivasan & Chiel, 1993; Bernard, Ge, Stockley, Willis, & Wheal, 1994; Lytton, 1996). Of primary relevance here are the most commonly used models, based on so-called α functions. These functions are described by t e g(t) = gˆ te− τ , τ
(1.2)
where gˆ is peak conductance and τ is time constant (typically around 1 ms). They represent the characteristic synaptic function of a regular nonvoltagedependent synapse, which can also be inhibitory. In 1990 Olshausen showed that the Z -transform (Doetsch, 1967; Oppenheim & Schafer, 1975) can be employed to accelerate tremendously the calculation of the convolution between exponential functions and impulse inputs. Synaptic computations at regular nonvoltage-dependent synapses are the immediately obvious application of this technique. In a first step, all synapses modeled by the same α function are combined, and all individual inputs si are accumulated into one weighted input function s for every simulated membrane compartment. We call this function the accumulated spike train: s(t) =
N X
ωi si (t).
(1.3)
i=0
Then the convolution is solved in discrete time tn = nT using the Z transform, arriving at: gtotal (nT) = T · gˆ A · e− τ s((n − 1)T) + 2e− τ gtotal ((n − 1)T) T
− e−2 τ gtotal ((n − 2)T). T
T
(1.4)
Employing the Z Transform
1641
This final form of the conductance update (Brettle & Niebur, 1994) shows that only two values of the history of gtotal need to be taken into account in order to arrive at the exact solution, which is spelled out for regular exponential functions in Olshausen (1990).1 Here we will use the Z -transform to derive the solution for gtotal of the voltage-dependent NMDA channel, and we will show that the final result requires only two history values and a few multiplications. It should be noted that this article summarizes a known technique (Olshausen, 1990; Brettle & Niebur, 1994), extending it to NMDA channels and comparing it to other methods. 2 The Algorithm The NMDA channel is characterized by a nonlinear current-voltage relationship in which the actual conductance g depends on the currently existing membrane potential Vm . It is modeled by: −
−
t
t
e τ1 − e τ2 , g(t) = gˆ 1 + η[Mg2+ ]e−(γ Vm )
(2.1)
where gˆ is peak conductance of NMDA channel, τ1 and τ2 are the first and second time constants of NMDA channel (τ1 =40 ms, τ2 =0.33 ms), Mg2+ is magnesium concentration (mM), η is 0.33 mM−1 , γ is 0.06/mV, and Vm is membrane potential in mV. We will not discuss the validity of this equation which has been used by several authors (Jahr & Stevens, 1990a, 1990b; Mel, 1992a) to model the NMDA channel. Instead, we assume that it represents an NMDA channel with sufficient accuracy. This equation is split into two parts that can be treated independently, as shown by Bernard et al. (1994), because c is not dependent on the time but only on the membrane potential: g(t) = c · y(t)
(2.2)
− τt
y(t) = gˆ · (e
1
− τt
−e
2
)
c = (1 + η[Mg2+ ]e−γ Vm )−1 .
(2.3) (2.4)
This method had also been used by Bernard et al. (1994), whose experimental results show that the rise time is independent of the Mg2+ concentration (Lester & Jahr, 1992; Stern, Edwards, & Sakmann, 1992). 1 To our knowledge the derivation for general exponential functions is explicitly spelled out only in this technical report. This report may be hard to obtain, but we think that it should be easy to derive equation 1.4 in a similar way as the results shown here.
1642
J. Kohn ¨ and F. Worg ¨ otter ¨
Equation 2.3 has to be Z -transformed in discrete time tn =nT: − nT τ
Y(z) = Z {y(nT)} = Z { gˆ · (e
1
− nT τ
−e
2
)},
(2.5)
with − nT τ
Z {e
1
}=
1 − τT
1−e
1
· z−1
;
− nT τ
Z {e
2
}=
1 − τT
1−e
2
· z−1
.
(2.6)
These terms can be added in the Z -domain just as in the time domain: Y(z) = gˆ
a1 z−1 − a2 z−1 , (1 − a1 z−1 )(1 − a2 z−1 ) −
T
(2.7) −
T
where we have set a1 = e τ1 and a2 = e τ2 . A convolution of the weighted input function s(nT) with the numerator of the conductance function y(nT) is equivalent to a multiplication of both functions in the Z -domain: h(nT) = y(nT) ? s(nT) ⇔ H(z) = Y(z) · S(z).
(2.8)
After some conversions we arrive at: ˆ 1 − a2 )S(z)z−1 + (a1 + a2 )H(z)z−1 − a1 a2 H(z)z−2 . H(z) = g(a
(2.9)
To get the inverse transform, we use the following feature of the Z −1 transform,
Z −1 {F(z) · z−j } = f [(n − j)T],
(2.10)
and get as the final solution: ˆ 1−a2 )s[(n − 1)T]+(a1+a2 )h[(n − 1)T]−a1 a2 h[(n − 2)T]. (2.11) h[nT] = g(a Where s[(n − 1)T] is the summed total input at time step (n − 1)T, s[(n − 1)T] =
X
ωi si [(n − 1)T].
(2.12)
i
It should be noted that s[nT] does not contribute to h[nT] because spikes arriving at the moment nT will affect the conductance only afterward. h[(n−1)T] and h[(n−2)T] represent the last and the last-but-one iteration of the function h, which are now reutilized to compute the actual value
Employing the Z Transform
1643
of h[nT]. Thus, very little computational effort is required to perform the ongoing iteration of h, and the final conductance is computed as: gtotal [nT] = c · h[nT].
(2.13)
This shows that the term c does not enter into the iterations. The validity of this has been demonstrated by Bernard et al. (1994). It should be noted that the final result is not an approximation; it is exact in the sense that the Z transform does not alter the accuracy of the computation of the convolution in discrete time. Figures 1A and 1B show the behavior of such an NMDA synapse for a single input spike in comparison to a non-NMDA-synapse modeled with equations 1.2 and 1.4. In Figures 1C and 1D, the response to a 100 Hz spike train s for the NMDA synapse is depicted. The resulting curves are in accordance with those found in other simulation studies (Mel, 1992b; Tr˚av´en et al., 1993). Curves determined by a conventional calculation of the convolution with single-precision floating-point variables are identical up to a numerical accuracy of 10−8 if we assume memory queue sizes of about 800 values in the conventional calculation (iteration time-step 0.05 ms). Note that in order to cover the complete response for high-frequency input spike trains, the required memory queue size is particularly large for the slow NMDA synapse. Thus, the increase in computational speed and the reduction of memory queue size in this case exceeds two orders of magnitude. Both estimates obviously depend on the accuracy (e.g., the size of the time-step) required for the conventional calculation. 3 Benchmark Comparison to Other Techniques Several techniques have been described in order to speed up synaptic conductance calculation. Srinivasan and Chiel (1993) showed how multiple α functions could be consolidated by representing their summation in an iterated closed form. Lytton (1996) used a different type of synapse based on the work of Desthexe, Mainen, and Sejnowski (1994a, 1994b), which can be seen as a concave rising exponential (1 − e−t ) pieced together with one that is convex and falling (e−t ). The major advantage of the latter approach lies in the fact that actual transmitter application durations can be included in this model. Desthexe et al.’s set of equations is even more compact than that of Srinivasan and Chiel, and they can, with some additional effort, combine several synapses into one closed algorithmic form. Bernard et al. (1994) finally found a simplified description of the conductance changes occurring at an NMDA channel based on differential equations, but their algorithm is much more complicated than the one proposed here. Therefore, we have compared our approach with that of Srinivasan and Chiel as well as with that of Lytton, but not with that of Bernard et al.. The three flow diagrams in Figure 2 show the algorithmic complexity of
1644
J. Kohn ¨ and F. Worg ¨ otter ¨
conductance [nS]
-50
50
membrane potential [mV]
single input spike
single input spike 40
NMDA
NMDA non-NMDA
non-NMDA
30
-60
20
non-NMDA
non-NMDA
NMDA
-70
10 NMDA
0
0
100
A 200 300 time [ms]
400
500
conductance [nS]
-80 0
-50
50
B 100
200 300 time [ms]
400
500
membrane potential [mV]
multiple input spikes
multiple input spikes 40 -60
30 20
-70
10 0 0
C 100
200 300 time [ms]
400
500
-80
D 0
100
200
300 time [ms]
400
500
Figure 1: (A) Time course of the conductance of a simulated NMDA channel and a simulated non-NMDA channel for one incoming spike. (B) Corresponding membrane potentials. (C) Time course of the conductance of an NMDA channel during input of a 100 Hz spike train. (D) Corresponding membrane potential.
the approaches when modeling a single synapse. The Z -transform has the simplest and Lytton’s algorithm the most complex structure. Since spikes are rare events, the thin pathways are seldomly followed. The thick arrows, on the other hand, represent the passive decay case. Therefore, these command sequences occur much more often and are mainly responsible for the computational time needed. Table 1 compares how many components and operations are required to perform the calculations if a single synapse is modeled. All techniques are fairly similar, and the complexity of Lytton’s algorithm pays off by reducing the number of operations to a minimum. The similarity of the three techniques is also reflected in the actual benchmark runs (see Figure 3 and, for the simulation parameters, Table 2). We simulated a single synapse with different input spike frequencies between 5 and 500 Hz. A total of 50 million time steps was simulated, and each had a duration of 0.05 ms, leading to a total simulated time of 2500 seconds. All simulations were run on a SUN SPARC 5 client machine
Employing the Z Transform
1645
(A) Srinivasan and Chiel
(B)
yes
Spike ?
Z-Transform
Spike ?
no Var1= Const1*Var1
Var1= Const1*Var1 +1
Var2= Const1*Var2
Var2= Const1*Var2 +time
Result= Var1*Const2+ Var2*Const3
Var2=Var1
Result=Const2* ((time+timestep)*Var1-Var2)
no
yes Flag=1 ?
Spike ?
Result=Const1+ Var1*Const2+ Var2*Const3
Var1=Result
Var1 <=time ?
yes Var1=0
Var2*=Const2
yes
Flag=1 ?
yes
no Flag=1
Var1= time+Const1
Var2=Const3 +Var2*Const4
(C) Lytton
Figure 2: Flow diagrams of the different algorithms for an AMPA synapse (see equation 1.2). (A) The algorithm of Srinivasan and Chiel. (B) The Z-transform. (C) Lytton’s algorithm.
in multiuser mode. No disk access was required during the simulations. We did not use any unusual computer operation mode (like “single user mode”) in order to make sure that our benchmarks will also apply to an everyday laboratory situation. This can, however, lead to tiny differences when running the same simulation twice due to the differing state of activity
1646
J. Kohn ¨ and F. Worg ¨ otter ¨
Table 1: Resources Needed for the Different Techniques in the Case of a Single Synapse.
Memory Flags Variables Constants Operations Spike No spike
Srinivasan and Chiel
Lytton
Z-Transform
0 2 2
1 3 4
0 2 3
4∗ 4+ 4∗ 2+
1∗ 2+ 1∗
2∗ 2+ 2∗ 1+
Note: ∗ = multiplication; + = addition, as shown in the flow diagrams in Figure 2.
Table 2: Parameters Used for the Benchmark Simulations.
Non-NMDA τ1 α0 β Cdur NMDA τ1 τ2
Srinivasan and Chiel
Lytton
Z-Transform
1 ms — — —
— 600 s−1 550 s−1 1 ms
1 ms — — —
40 ms 40 ms
— —
0.33 ms 0.33 ms
Note: The difference between Lytton’s and the other techniques does not make the parameters immediately comparable.
of the system’s background demons. In the diagrams we plot the CPU time required for one simulated time step. As expected from the flow diagrams, the computational effort remains constant for Srinivasan and Chiel and also for the Z -transform. Lytton’s technique shows a linear increase in computational time. It is the fastest up to an input frequency of about 210 Hz, which looks very promising because few neurons fire with such a high frequency for an extended period of time. The problem of computational efficiency becomes truly relevant only in the context of simulating the convergence of multiple inputs onto a single cell (or compartment). This is the case with which Lytton and we are mainly concerned. Thus, the main part of Lytton’s algorithm is the optimization of the calculation for many synapses with a shared time constant at a given neuron. He was able to simplify the update to two variables only, and he had to maintain only a single two-valued queue of spike arrival times and synaptic states (instead of N queues in the case of the regular
Employing the Z Transform
1647
Figure 3: Benchmark comparison of the three techniques for a single AMPA and a single NMDA synapse. Lytton’s technique (AMPA only) is the fastest for the single synapse case and reasonable input firing frequencies, but see Figure 4.
convolution). In addition, he required a few bookkeeping commands in his algorithm to keep track of the synaptic state changes. Thus, the algorithmic complexity increases in Lytton’s algorithm. On the other hand, when using the Z -transform, only a single one-values queue of the spike arrival times has to be maintained when more than one synapse is simulated. The linearity of the transform allows one to define an accumulated spike train (see equation 1.3), which does not affect the computational effort regardless of the number of synapses (the value of N in equation 1.3). And this also holds for synapses with different delays because a synaptic delay 1t can be implicitly included in the accumulated spike train by reassigning the time variable for this synapse to tresultant = tarrival + tdelay . Consequentially, although Lytton had found quite an ingenious way to do the bookkeeping and the update, it is rather obvious that the Z -transform will be faster since it remained basically unchanged2 (see Figure 4). It can be seen in Figure 4A that the Z -transform is significantly faster than Lytton’s algorithm when simulating 50 synapses. The difference between the curves, however, is largely independent of the number of simulated synapses (see Figure 2 Obviously both algorithms require a queue for the inputs. Thus, the Z-transform will be slightly slower. But apart from this, the Z-transform does not need any other algorithmic addition, as opposed to Lytton’s algorithm, which requires bookkeeping.
1648
J. Kohn ¨ and F. Worg ¨ otter ¨
4B). As expected, it is due only to the additionally introduced bookkeeping in Lytton’s algorithm and produces a constant computational overhead. This shifts the curves with respect to each other such that they do not cross anymore (compare Figure 3, where the bookkeeping was not implemented because only a single synapse was simulated). Note that in (probably) all real simulation problems, Lytton’s algorithm can never be implemented without the bookkeeping because a convergence of more than one synapse onto any compartment must always be allowed. Thus, the performance advantage for the single synapse case (see Figure 3) may actually never be attainable in a real simulation problem where a situation similar to Figure 4 applies. The additional small linear increase in Lytton’s curve results from the total increase in incoming spikes since all simulated synapses receive 100 Hz input (compare again Figure 3). Srinivasan and Chiel did not optimize their algorithm for more than one synapse. Therefore, their algorithm is much slower than the two other ones (not shown). 4 Discussion Based on the technical report of Olshausen (1990), Brettle and Niebur (1994) made use of the Z -transform in order to calculate synaptic conductance more efficiently, but in their article, they did not give a full account of this technique. It may be for this reason that this technique is still not widely appreciated. Therefore, we have tried to give an account of how the Z transform could be used even in the context of the more complicated NMDA synapse in order to arrive at a rather simple iterative procedure that reduces the computational complexity and improves speed by at least two orders of magnitude as compared to a straightforward calculation of the convolutions. The comparison of the Z -transform with other algorithms shows that its algorithmic complexity is minimal and that its performance exceeds that of the other techniques significantly as soon as multiple inputs are considered. Lytton’s approach, on the other hand, has a more solid physiological foundation because actual transmitter application durations can also be implemented in his model. It should be pointed out that—as opposed to many other approaches— the Z -transform can be applied to more than one type of synapse. The solutions for a regular non-NMDA synapse and the NMDA synapse were given here, but it is easy to see that, for example, the equation that describes the calcium-activated current through the respective synapse in the mollusc Tritonia and other species can also be reduced by means of the Z -transform. The original equation is given by Getting (1989): gc (t) =
τd (e−t/τd − e−t/τo ), τd − τo
and the Z -transform follows the same steps as above.
(4.1)
Employing the Z Transform
1649
Figure 4: Benchmark for more than one synapse comparing the algorithm of Lytton with the Z-transform. (A) Number of synapses fixed at 50 and variable average input frequency. (B) Input frequency fixed at 100 Hz and variable number of synapses.
It is immediately obvious that the Z -transform can be applied to all synaptic functions that consist of multiplicative or additive combinations of exponential terms. The largest gain in computational efficiency of the Z -transform is achieved when many synapses converge onto a single compartment. But even in a cell model with a high degree of compartmentation, the Z -transform may be preferential because any applicable version of Lytton’s algorithm will al-
1650
J. Kohn ¨ and F. Worg ¨ otter ¨
ways require the bookkeeping procedure, which reduces its computational efficiency to below that of the Z -transform in almost all situations. Performance differences in a model with a high degree of compartmentation may be marginal, though. The actual choice of an algorithm for modeling a synapse should therefore be guided by several factors: 1. Do I need highly realistic synapses plus multiple voltage-dependent channels? Then probably none of the discussed algorithms can be used and the differential equations need to be implemented explicitly. 2. Can I live with less realism and use a fast pooling approach (little compartmentation) but still with an explicitly modeled transmitter release? Then use Lytton’s. 3. Do I need maximal speed and a high degree of pooling? Then use the Z transform. In particular when modeling a large or very large network, the performance gain factor of about 2 between the Z -transform and Lytton’s algorithm will certainly be beneficial, because individual simulations often take more than 10 hours. In some special cases, even more abstract connections can be defined—for example, by following the approach of Amit and Tsodyks (1991), which models the membrane potential in a nonstandard way and is basically confined to rates and currents. Due to their intrinsic assumptions, however, this approach cannot be directly compared with the other algorithms studied here. Acknowledgments We acknowledge the support of the Deutsche Forschungsgemeinschaft and the Human Frontier Science Program RG0035/1997-B. C-functions for all three techniques for AMPA and NMDA synapses can be obtained via ftp from ftp://www.neurop.ruhr-uni-bochum.de/pub/z-trans/. References Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural networks retrieving at low spike rates: I. Substrate-spikes, rates and neuronal gain. Network, 2, 259–273. Bernard, C., Ge, Y. C., Stockley, E., Willis, J. B., & Wheal, H. V. (1994). Synaptic integration of NMDA and non-NMDA receptors in large neuronal network models solved by means of differential equations. Biol. Cybernet., 70, 267–273. Brettle, D., & Niebur, E. (1994). Detailed parallel simulation of a biological neuronal network. IEEE Comp. Sci. Eng. 1, 31–43. Desthexe, A., Mainen, Z. F. & Sejnowski, T. J. (1994a). An efficient method for
Employing the Z Transform
1651
computing synaptic conductances based on a kinetic model of receptor binding. Neural Comp., 6, 14–16. Desthexe, A., Mainen, Z. F. & Sejnowski, T. J. (1994b). Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic mechanism. J. Comp. Neurosci., 1, 195–230. Doetsch, G. (1967). Anleitung zum praktischen Gebrauch der Laplace-Transformation und der Z-Transformation. Munich: R. Oldenbourg Verlag. Getting, P. (1989). Reconstruction of small neural networks. In C. Koch & I. Segev (Eds.), Methods in neuronal Modeling, (pp. 171–195). Cambridge, MA: MIT Press. Jahr, C. E., & Stevens, C. F. (1990a). A quantitative description of NMDA receptorchannel kinetic behavior. J. Neurosci., 10, 1830–1837. Jahr, C. E., & Stevens, C. F. (1990b). Voltage dependence of NMDA-activated macroscopic conductances predicted by single-channel kinetics. J. Neurosci., 10, 3176–3182. Lester, R. A., & Jahr, C. E. (1992). NMDA behavior depends on agonist affinity. J. Neurosci., 12, 635–643. Lytton, W. W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Comp., 8, 501–509. Mel, B. (1992a). NMDA-based pattern discrimination in a modeled cortical neuron. Neural Comp., 4, 502–517. Mel, B. (1992b). Information processing in an excitable dendritic tree (CNS Memo No. 17). Pasadena, CA: California Institute of Technology. Olshausen, B. (1990). Discrete-time difference equations for simulating convolutions (Tech. Memo). Pasadena: California Institute of Technology. Oppenheim, A. V. & Schafer, R. W. (1975). Digital-signal processing. London: Prentice Hall International. Srinivasan, R., & Chiel, H. J. (1993). Fast calculation of synaptic conductances. Neural Comp., 5, 200–204. Stern, P., Edwards, F. A., & Sakmann, B. (1992). Fast and slow components of unitary EPSCs on stellate cells elicited by focal stimulation in slices of rat visual cortex. J. Physiol., Lond., 428, 707–722. ¨ Wall´en, P. & Grillner, S. Tr˚av´en, H. G. C., Brodin, L., Lansner, A., Ekeberg, O., (1993). Computer simulations of NMDA and non-NMDA receptor-mediated synaptic drive: Sensory and supraspinal modulation of neurons and small networks. J. Neurophysiol., 70, 695–709. Received June 6, 1997; accepted March 2, 1998.
LETTER
Communicated by John Lisman
Site-Selective Autophosphorylation of Ca2+ /Calmodulin-Dependent Protein Kinase II as a Synaptic Encoding Mechanism C. J. Coomber School of Computing and Mathematics, Deakin University, Geelong, Victoria, Australia
A detailed kinetic model of the Ca2+ /calmodulin-dependent protein kinase II (CaMKII) is presented in which subunits undergo autophosphorylation at several sites in a manner that depends on the frequency and duration of Ca2+ spikes. It is shown that high-frequency stimulation causes autophosphorylation of the autonomy site (Thr286), and promotes persistent catalytic activity. On the other hand, low-frequency stimulation is shown to cause autophosphorylation of an inhibitory site (Thr306), which prevents subunit activation. This site-selective autophosphorylation provides the basis for a molecular switch. When activated by a strong stimulus, the switch remains on for many minutes, even in the presence of a CaMKII-specific phosphatase. However, prolonged low-frequency stimulation disables the switch, and influences the response to subsequent stimulation. It is conceivable that a regulatory mechanism such as this may permit CaMKII to mediate synaptic frequency encoding and thereby direct an appropriate change in synaptic efficacy. It is indicated how the behavior of the model may relate to the induction of long-term potentiation. 1 Introduction The Ca2+ /calmodulin-dependent protein kinase II (CaMKII) is concentrated in forebrain neurons and is particularly abundant in postsynaptic densities, where it comprises 20% to 50% of all proteins (Colbran & Soderling, 1990). CaMKII moderates the effects of fluctuations in intracellular [Ca2+ ] coordinated with synaptic activity by phosphorylating membranebound proteins (McGlade-McCulloch, Yamamoto, Tan, Brickey, & Soderling, 1993), regulating gene expression (Antoine, Gaiddon, & Loeffler, 1996), and controlling synaptic transmission (Benfenati et al., 1992). The induction of long-term potentiation (LTP), which plays a part in spatial memory (Silva, Stevens, Tonegawa, & Wang, 1992), may require persistent activation of CaMKII (Malinow, Madison, & Tsien, 1988; Otmakhov, Griffith, & Lisman, 1997). CaMKII undergoes autophosphorylation at multiple sites following exposure to Ca2+ /calmodulin (Ca2+ /CaM). There is a one-to-one corresponc 1998 Massachusetts Institute of Technology Neural Computation 10, 1653–1678 (1998) °
1654
C. J. Coomber
dence between Ca2+ /CaM binding and subunit activation (Ikeda, Okuno, & Fujisawa, 1991). Hanson, Meyer, Stryer, and Schulman (1994) show that a subunit activated by exposure to Ca2+ /CaM is autophosphorylated on Thr286 in an intersubunit reaction, thereby trapping the bound Ca2+ /CaM for several seconds. The probability of a subunit’s becoming autophosphorylated on Thr286 increases with the fraction of autonomous subunits in the oligomeric enzyme, and therefore CaMKII activation exhibits the features of a cooperative process. Other studies indicate that catalytic activity is inhibited when certain sites are autophosphorylated, namely, Thr305 and Thr306 in the α subunit. In the nonautonomous enzyme, autophosphorylation at Thr306 occurs as a slow intrasubunit reaction (Colbran, 1993). However, following conversion of the enzyme to a Ca2+ /CaM-independent form, autophosphorylation of the same site may occur as an intrasubunit (Hanson et al., 1994) and intersubunit reaction (Mukherji & Soderling, 1994). Other sites are also autophosphorylated, such as Thr253 in the postsynaptic density associated CaMKII (Dosemeci, Gollop, & Jaffe, 1994), the regulatory role of which is unknown, and Ser314 in the α subunit, which decreases CaM binding and is resistant to phosphatase activity. Autophosphorylation of CaMKII has attracted interest for its possible role in synaptic information storage (Lisman, 1985, 1994). In particular, autophosphorylation may enable the kinase to act as a molecular switch (Miller & Kennedy, 1986). The form of the switch is still an open question. For example, Lisman and Goldring (1988) propose a simple on-off switch in which the on state is relatively permanent. More recent work (Matsushita, Moriyama, & Fukai, 1995) proposes a dynamic switch in which the on state may be transient (minutes) or protracted (hours), depending on the level of phosphatase activity. Other work treats CaMKII as a Ca2+ frequency detector (Hanson et al., 1994), or, as was recently suggested, that if certain sites are autophosphorylated preferentially under different forms of synaptic activity, then the permutation of autophosphorylated sites might record the interval between Ca2+ transients (Dosemeci & Albers, 1996). In this article, we investigate a detailed model of CaMKII that is based on experimental findings. The model draws together concepts from earlier modeling studies (Hanson et al., 1994; Matsushita et al., 1995; Dosemeci & Albers, 1996; Coomber, in press). Specifically, we consolidate findings concerning CaM trapping, persistent Ca2+ -independent autophosphorylation, and site-selective autophosphorylation, and explore the unified model for possible roles in synaptic information storage. Further, we demonstrate how the kinase may act as a dynamic switch that depends on the level of local phosphatase activity. Therefore, subsequent use of the word switch refers to protracted kinase activity that may last for minutes or hours, and not necessarily the binary switch described in earlier studies.
Synaptic Encoding by a Brain Enzyme
1655
2 Materials and Methods 2.1 Model Design. The following model is proposed for regulation of CaMKII: k1
Reaction 1: S0 + C S0 C k2
k3 A
Reaction 2: S0 C − S0 C −→ S0 C − S1 C k4
Reaction 3: S1 C S1 + C k5
k6 A
k7 A
Reaction 4: S1 −→ S2 −→ S3 k8 A
Reaction 5: S∗ − S0 −→ S∗ − Sx k9 A
Reaction 6: S∗ − S0 C −→ S∗ − S1 C k10 A
Reaction 7: S∗ − Si −→ S∗ − Si+1 , i = 1, 2 k11
k13
Reaction 8: S1 C + PP S1 CPP −→ S0 C + PP k12
k14
k16
Reaction 9: Si + PP Si PP −→ Si−1 + PP, i = 1, 2, 3 k15
k17
k19
Reaction 10: Sx + PP Sx PP −→ S0 + PP k18
It is assumed that subunits are arranged in a circular structure (Kanaseki, Ikeuchi, Sugiura, & Yamauchi, 1991), thereby permitting nearest-neighbor interactions, and that each subunit possesses three autophosphorylation sites: Thr286, Thr305, and Thr306. These are the principal sites involved in regulating catalytic activity in the α subunit, which predominates in forebrain neurons. Although other sites, such as Ser314, are autophosphorylated, they do not seem to play a critical role in enzyme regulation. The ten reactions describe cofactor binding, intrasubunit and intersubunit autophosphorylation, and dephosphorylation by a CaMKII-specific phosphatase, such as phosphatase 1. Reaction 1 describes the binding of Ca2+ /CaM, designated as C, to a nonautonomous subunit S0 . The resultant complex S0 C is catalytically active and may not only phosphorylate substrate proteins but also Thr286 of a similarly active neighboring subunit, as in Reaction 2. This reaction involves transference of a phosphate group from ATP, shown as A, to Thr286, giving S1 C. This autophosphorylated complex has been shown to exist for some seconds (Hanson & Schulman, 1992; Hanson et al., 1994; Meyer, Hanson, Stryer, & Schulman 1992) before
1656
C. J. Coomber
dissociating into Ca2+ /CaM and S1 (Reaction 3). At this point, the activated subunit S1 may still bind Ca2+ /CaM or proceed through a series of intrasubunit autophosphorylation steps at Thr305 and Thr306 (Hanson et al., 1994), as shown in Reaction 4. The order in which these events occur is not considered important. A catalytically active subunit S∗ (one of S0 C, S1 C, S1 , S2 , or S3 ) may also autophosphorylate its neighbor on Thr306 (Mukherji & Soderling, 1994). This is shown by Reaction 5, in which Sx represents a subunit autophosphorylated only at Thr306. Such a subunit is prevented from binding Ca2+ /CaM, and its catalytic activity is therefore inhibited. Reaction 6 extends Reaction 2 by permitting any catalytically active subunit to autophosphorylate S0 C. Furthermore, the possibility of intersubunit autophosphorylation of catalytically active subunits at Thr305 and Thr306 is allowed by Reaction 7. The final three reactions (8–10) describe dephosphorylation of sites by a CaMKII-specific phosphatase PP. Moreover, according to Reaction 9, a subunit can be deactivated, that is, dephosphorylated, at Thr286 only after Thr305 and Thr306 have been dephosphorylated. In other words, when either of the inhibitory sites is phosphorylated, the autonomy site Thr286 is protected from dephosphorylation. Note that in this study we have assumed, unless otherwise stated, that a given subunit may autophosphorylate only one of its nearest neighbors (clockwise). In the multimeric enzyme, each subunit or pair of neighboring subunits may participate in one or more of the reactions. Therefore, a particular subunit in an enzyme of 8 to 10 subunits may exist in one of 12 possible states: S0 , S0 C, S1 C, S1 , S2 , S3 , S1 CPP, S1 PP, S2 PP, S3 PP, Sx , or Sx PP. In other words, there are 812 to 1012 possible enzyme configurations. As individual subunits change state by reacting with Ca2+ /CaM, adenosinetriphosphate (ATP), phosphatase 1, and each other, new configurations result. This process is quite easily represented by a type of sequential ligand binding model (Kirtley & Koshland, 1967) from which the same number of differential equations as configurations is derived using the law of mass action. However, because of the large number of configurations in the proposed model, a challenging combinatorial problem is presented when implementing such a model on a computer. To reduce the overall number of differential equations, the number of subunits in the multimeric enzyme has been limited to four. In addition, configurations that are rotations of each other—that is, they have equivalent subunit arrangements—are grouped together to form a collective configuration. Statistical factors are then multiplied by certain rate constants to indicate the number of equivalent ways in which a configuration may change as it participates in a reaction. The first few reactions of the reduced system are shown in Figure 1. Initially, all subunits in the enzyme are assumed to be inactive (designated as configuration E0 ). If any one of the four inactive subunits binds Ca2+ /CaM, an identical configuration (E1 ) results (since rotations are isomorphic). Therefore, four equivalent ways exist for binding Ca2+ /CaM to the multimeric enzyme, and hence this number multiplies the forward rate constant. Similarly, there is one way for
Synaptic Encoding by a Brain Enzyme
1657
Figure 1: A fragment of the sequential ligand-binding model of CAMKII. A CaMKII oligomer is assumed to consist of four subunits arranged in a circular structure. A subunit has a Ca2+ /CaM binding site, and three autophosphorylation sites (Thr286, Thr305, and Thr306). The autonomy site (Thr286) is autophosphorylated in an intersubunit reaction. The other sites may be autophosphorylated in both inter- and intrasubunit reactions. All autophosphorylation sites are dephosphorylated by a CaMKII-specific phosphatase. See the text for subunit state symbols.
Ca2+ /CaM to dissociate from E1 , which returns the enzyme to E0 . The same reasoning is applied to develop a complete enzymatic system based on the above equations. 2.2 Rate Constant Estimation. Rate constants have, where possible, been taken from recent studies on CaMKII. Where doubt may exist, ranges of values have been chosen based on contemporary enzymatic studies (Fersht, 1985). The forward and backward rate constants associated with the binding of Ca2+ to CaM are taken to be 5.0 × 107 /M/s and 500.0/s, respectively (Zador, Koch, & Brown, 1990). In Reaction 1, we assume k1 is between 1.0×106 /M/s and 1.0×107 /M/s, and that k2 is between 100.0/s and 500.0/s, since inactive CaMKII has a weak affinity for Ca2+ /CaM. We assume k3 and k9 , the rates of intersubunit autophosphorylation at Thr286, are in the range 5.0×103 to 1.0×104 /M/s. The dissociation rate of trapped Ca2+ /CaM, k4 , is assumed to be in the range 10−4 to 1.0/s, although recent reports suggest a slower rate (Putkey & Waxham, 1996). We assume k5 , the reassociation rate of Ca2+ /CaM, is equal to k1 . In Reactions 4 and 7, we assume the rates of intra- and intersubunit autophosphorylation (k6 , k7 , and k10 ) at Thr305 and Thr306 of an autonomous subunit are equal and lie in the range 1.0 × 103 to 5.0 × 103 /M/s. These residues are autophosphorylated more slowly than Thr286 (Hanson et al., 1994). In Reaction 5, autophosphorylation at Thr306
1658
C. J. Coomber
of a nonautonomous subunit is in the range 1.0×103 to 5.0×103 /M/s. Since CaMKII-specific phosphatases, such as phosphatase 1, are potent opposers of phosphorylation, we have allowed phosphatase binding to occur rapidly. That is, the rates k11 , k14 , and k17 are assumed greater than 1.0 × 108 /M/s. In addition, the dissociation rates k12 , k15 , and k18 are assumed equal and approximately 50.0/s. Similarly, k13 , k16 , and k19 are assumed equal and approximately 1.0/s, since the CaMKII-specific phosphatase binding is known to be strong. The basal concentration of Ca2+ is around 10.0 nM in dendritic spines (Koch & Zador, 1993); however, we have assumed a reference concentration of zero for convenience of interpretation. The concentration of free CaM is assumed to be 0.1 mM (Molloy & Kennedy, 1991), and the average concentration of ATP is 1.0 mM (Fersht, 1985), which is much greater than the concentrations of CaMKII and the CaMKII-specific phosphatase. Moreover, we assume that the concentration of ATP is constant. Its time evolution is therefore omitted from the model equations. The concentration of CaMKII is assumed to be between 0.1 µM and 0.5 µM, and its ratio to a CaMKII-specific phosphatase is between 0.1 and 1.0. These concentrations concur with related modeling work (Matsushita et al., 1995). Although care has been taken in constraining the choice of rate constants, some uncertainty still remains in particular cases. Therefore, the reader should take care in interpreting some quantitative results.
2.3 Numerical Computation. The system of differential equations modeling CaMKII was generated automatically by software, which may be obtained from the author. Numerical integration was performed using either the fourth-order Runge-Kutta or fourth- or fifth-order Runge-KuttaFehlberg methods with an integration time step of 10.0 µs. Simulations were run on either a SUN ULTRA II or a GATEWAY Pentium Pro 200 computer system.
2.4 Input Protocol. The model is driven by fluctuations in [Ca2+ ] caused by presynaptic spikes. The time evolution of [Ca2+ ] is specified as a dualexponential function, which is sometimes used to model synaptic conductance. The peak amplitude of a transient rise in [Ca2+ ] is assumed to be at most 1.0 µM, and time constants were chosen so that the time to peak is approximately 10.0 ms and the time to decay to baseline is 1.0 second. Repeated stimulation permits transients to sum temporally. The fluctuation in [Ca2+ ] generated by this simple model (given a structured and limited range of presynaptic input) is a reasonable approximation to more detailed kinetic models of [Ca2+ ] dynamics. For example, see the concentration object in GENESIS (Wilson, Bhalla, & Bower, 1990). The steady state [Ca2+ ] is simply proportional to frequency.
Synaptic Encoding by a Brain Enzyme
1659
3 Results In the results reported below, we have assumed, unless stated otherwise, the following assignment of rate constants: k1 k2 k3 k4 k6
= k5 = 5.0 × 106 /M/s = 100.0/s = k9 = 5.0 × 103 /M/s = 0.1/s = k7 = k8 = k10 = 1.0 × 103 /M/s
k11 = k14 = 5.0 × 105 /M/s k17 = 1.0 × 107 /M/s k12 = k15 = k18 = 50.0/s k13 = k16 = k19 = 1.0/s
The default initial concentrations of CaMKII and phosphatase are taken to be 0.5 µM and 0.1 µM, respectively. As will be shown later, the initial concentration ratio of phosphatase to CaMKII determines the duration of enzymatic activity and influences switching behavior. However, to confirm the robustness of results, the majority of simulations have been performed several times using other rate constants and concentrations selected from the ranges specified previously. 3.1 CaM Trapping and Autophosphorylation. Hanson et al. (1994) show that although each subunit is individually regulated by Ca2+ /CaM, autophosphorylation at Thr286, and consequently CaM trapping, is a property of the oligomeric enzyme. Intersubunit autophosphorylation at Thr286 requires Ca2+ /CaM to be bound to two neighboring subunits, where one subunit acts as the “kinase” and the other as the “substrate.” Therefore, as the fraction of subunits with bound Ca2+ /CaM increases, so does the chance of intersubunit autophosphorylation at Thr286. In other words, autophosphorylation at Thr286, and therefore CaM trapping, exhibits the characteristics of a cooperative process. Consequently, low-frequency stimuli (which evoke infrequent calcium transients) would lead to minimal kinase activation, but progressively increasing frequencies would lead to potentiation of kinase activity. To verify that such behavior is characteristic of our model, we conducted a simulation in which 100-spike calcium transients were delivered to the model at frequencies ranging from 0 to 100 Hz. Furthermore, since the rate at which trapped CaM is released will play a critical part in the expression of cooperative behavior, we have repeated the simulation for a range of release rates (k4 = 1.0, 10−1 , 10−2 , 10−3 , and 10−4 /s). The results of the simulations are plotted in Figure 2, which shows the dependence of CaM trapping on the calcium spike frequency. The log scale is used for convenience. For all tested CaM release rates, low-frequency stimuli (LFS) (< 2 Hz) result in minimal Ca2+ -dependent kinase activity (<∼ 20% subunits with trapped CaM, S1 C). Intermediate frequencies (5–20 Hz) promote a greater level of kinase activity (∼ 20–60% S1 C), and high-frequency stimuli (HFS) (50–100 Hz) result in near maximal kinase activity (> 70% S1 C). These results are qualitatively identical to those reported in Hanson et al. (1994). In
1660
C. J. Coomber
Figure 2: The dependence of CaM trapping on spike frequency. The model was subjected to 100-spike calcium transients at different frequencies for a range of CaM release rates (k4 = 1.0, 0.1, 1 × 10−2 , 1 × 10−3 , and 1 × 10−4 /s). CaM trapping and its ability to potentiate kinase activity with the availability of free Ca2+ /CaM is evident for high-frequency stimulation.
that study, it was shown that CaM trapping provides a mechanism to detect the frequency of calcium spikes, but only when CaM concentration is limiting. If the concentration of CaM is in saturation, prolonged low-frequency calcium spikes will progressively increase the yield of trapped CaM, since trapped CaM does not dissociate significantly between consecutive calcium spikes. In other words, as pointed out by Dosemeci and Albers (1996), the effects of prolonged LFS and brief HFS are the same in that model. In this study, the form of the plots in Figure 2 does not depend on a limited concentration of free CaM. Whether CaM is in excess or limited, prolonged LFS will yield relatively little trapped CaM in comparison to brief HFS. This is solely a consequence of inhibitory autophosphorylation taking place preferentially during LFS, as will be discussed in the following section. Importantly, the model is robust against changes in the CaM release rate. That is, the frequency ranges seen in Figure 2 agree with those of LTP and LTD (long-term depression) induction if more realistic values of k4 are chosen. Altering the background phosphatase activity may change the profile of the plot. Strong phosphatase inhibition would allow LFS to exert a greater effect on kinase activity. Alternatively, robust phosphatase activity would liberate phosphate from Thr286 and oppose CaM trapping. 3.2 Inhibitory Autophosphorylation. In the model, provision has also been made for autophosphorylation of inhibitory sites, namely, Thr305 and
Synaptic Encoding by a Brain Enzyme
1661
Thr306. Both intra- and intersubunit autophosphorylation of the latter site is believed to occur (Mukherji & Soderling, 1994; Hanson et al., 1994), which blocks the CaM binding site. Miller and Kennedy (1986) and Lou and Shulman (1989) show that a substoichiometric level of Ca2+ /CaM-dependent autophosphorylation can precipitate maximal Ca2+ -independent autophosphorylation. In a recent study (Dosemeci et al., 1994), evidence of such autophosphorylation was observed on Thr306 in the PSD-associated CaMKII. To explore the regulatory consequences, and the conditions, under which inhibitory residues are autophosphorylated, we conducted a series of simulations. Using the same parameters as the previous simulation, the peak fraction of active (S∗ ) and inhibited (Sx ) subunits after 100 spikes at a particular frequency was determined. Figure 3 shows that inhibitory autophosphorylation does not increase with stimulus frequency, but rises from zero to a maximum level (∼ 50%) at approximately 1 Hz and then falls away as the frequency is increased. The simulations are also robust in the face of several orders of magnitude difference in the CaM release rate. Since we have assumed that subunits engage in nearest-neighbor autophosphorylation in the four-subunit model, a maximum of approximately 50% inhibitory autophosphorylation is expected. That is, two activated subunits are required to inhibit the remaining two subunits. However, if we assume that an activated subunit may inhibit any inactive subunit (remote autophosphorylation), a greater percentage (∼ 70%) of inhibited subunits is achieved for LFS. This is confirmed by our simulations (data not shown). Figures 2 and 3 together indicate that two frequency-dependent modes of autophosphorylation may take place. In other words, as evidenced by Dosemeci et al. (1994) and illustrated by a related probabilistic model (Dosemeci & Albers, 1996), two classes of autophosphorylation sites may exist in CaMKII subunits. One class (Thr286) is preferentially autophosphorylated under high [Ca2+ ], and the other (possibly Thr306 and others) is autophosphorylated under low [Ca2+ ]. In the first case, HFS would generate a large influx of Ca2+ , which would saturate the CaM-binding sites of CaMKII subunits. This in turn would yield significant Ca2+ -dependent autophosphorylation at Thr286 and subunit autonomy. However, in the second case, LFS may cause only sparse binding of Ca2+ /CaM and allows the remaining nonautonomous subunits to become autophosphorylated at Thr306, thereby locking them in an inhibited state. This self-regulatory scheme may have interesting consequences for experimental memory models, such as LTP and LTD. There is strong evidence that CaMKII activation is necessary for the induction of LTP in the postsynaptic neuron (Pettit, Perlman, & Malinow, 1994), especially for early LTP. In the hippocampus, LTP is typically expressed following a short burst of HFS, which is linked to a rise in [Ca2+ ] (Bliss & Collingridge, 1993). However, it is still a matter of debate as to whether a rise in Ca2+ is sufficient in itself, since other factors, such as activation of G-protein-coupled metabotropic glutamate receptors, appear to be important (Bliss & Collingridge, 1993).
1662
C. J. Coomber
Figure 3: Plots showing the dependence of CaMKII inhibition on calcium spike frequency. The stimulation protocol was the same as that in Figure 2, along with the range of values for the release rate of trapped CaM. The plot indicates that Thr286 is selectively autophosphorylated during high-frequency stimulation, whereas the inhibitory site, Thr306 of a nonautonomous subunit, is selectively autophosphorylated during low-frequency stimulation.
In some studies, LTP is induced apparently without elevating [Ca2+ ] (Frey, Huang, & Kandel, 1993). Therefore, under certain conditions, the fraction of activated CaMKII would be expected to exceed that of inhibited CaMKII, as predicted by our model. LTD, on the other hand, is typically induced by sustained low-frequency stimulation, which according to our simulation results, would serve to disable CaMKII by enabling selective autophosphorylation of inhibitory sites, such as Thr306, possibly permitting LTD-specific phosphatases to downgrade synaptic efficacy. Furthermore, inhibition of CaMKII activity by prior LFS may change the response of the kinase to subsequent HFS. The degree to which this occurs would depend on the rate at which Thr306 is dephosphorylated by endogenous protein phosphatases. For that matter, recent studies have shown that prior LFS may suppress the induction of LTP (Huang, Colino, Selig, & Malenka, 1992; Fujii et al., 1996). 3.3 Persistent Activity and Inhibition. It has been proposed that persistent autophosphorylation in the absence of Ca2+ /CaM may allow CaMKII to act as a molecular switch (Lisman, 1985; Miller & Kennedy, 1986), even in the presence of phosphatase activity. That is, as long as Ca2+ -independent autophosphorylation outpaces dephosphorylation, then kinase activity would be prolonged. Furthermore, whether the switch is activated might depend on the stimulation frequency. That is, a minimal degree of Ca2+ -dependent
Synaptic Encoding by a Brain Enzyme
1663
Figure 4: Simplified CaMKII regulatory model proposed by Matsushita et al. (1995). An initial Ca2+ -dependent event triggers autophosphorylation throughout the enzyme. The model was shown to implement a molecular switch dependent on the intensity of Ca2+ spikes. K0 -inactive enzyme. K0 C-enzyme with bound Ca2+ /CaM. Ki -autonomous enzyme with i autophosphorylated sites.
autophosphorylation might not be sufficient to cross the threshold needed to oppose phosphatase activity. In this section we investigate the ability of the model to express a molecular switch and the function of inhibitory autophosphorylation in this process. There are three ways CaMKII activity may be prolonged following cessation of the activating stimulus. The first involves CaM trapping, which can prolong kinase activity for many seconds (Meyer et al., 1992; Putkey & Waxham, 1996). The second involves persistent Ca2+ -independent autophosphorylation, which can prolong kinase activity for hours (Fukunaga et al., 1993; Suzuki, 1994). The third involves limited proteolysis of the kinase by calpain (Rich, Schworer, Colbran, & Soderling, 1990). In our model, persistent kinase activity is described by Reactions 4 and 7, which model Ca2+ -independent intrasubunit and intersubunit autophosphorylation, respectively. Proteolytic activation is not considered here. An investigation of persistent kinase activity was undertaken by Matsushita et al. (1995) using a simple chain of phosphorylation and dephosphorylation reactions. However, the stimulation protocol in this modeling study was based on the magnitude of the Ca2+ signal rather than its frequency. They show that under certain conditions and in a manner dependent on the intensity of synaptic input, a chain of phosphorylation and dephosphorylation reactions can implement quasi-switching behavior. Their model is based on the observation that after an initial, limited Ca2+ -dependent event, autophosphorylation may proceed throughout the holoenzyme in a Ca2+ -independent manner (Miller & Kennedy, 1986; Lou & Schulman, 1989). A simplified representation of their model is shown in Figure 4, in which the maximum number of autophosphorylation sites is restricted to 10. The nonautonomous enzyme K0 is activated by binding C (Ca2+ /CaM) to yield K0 C. Upon dissociation of this complex, the enzyme proceeds through
1664
C. J. Coomber
a series of autophosphorylation steps, which compete with a local phosphatase. Their model omits two important factors that may influence the expression of a molecular switch. First, autophosphorylation is associated only with activation of subunits and does not consider inhibitory autophosphorylation, which may have a significant bearing on switching behavior. Second, the model does not explicitly consider the arrangement of subunits in the oligomeric enzyme, which is believed to play an important role in regulating enzymatic activity by promoting cooperative subunit activation. Our model was used to examine persistent kinase activity and its dependence on stimulation frequency. As in previous experiments, 100-spike trains of stimuli were delivered to the model at different frequencies. In this case 1, 5, 10, 20, and 100 Hz spike trains were chosen. The results of this experiment are shown in Figure 5. The essential finding is that a threshold exists for expression of a molecular switch. Figure 5A indicates that attenuation of kinase activity after cessation of the spike train is dependent on the frequency of the stimuli. Furthermore, kinase activity is correlated with the fraction of subunits with trapped CaM. A 100-second train of 1 Hz stimuli activates less than 20% of the subunits, which deactivate rapidly (within 60 seconds) upon removal of the stimulus. The same number of stimuli at 20 Hz leads to activation of approximately 50% of the subunits. However, even after removal of the spike train, kinase activity persists via Ca2+ -independent autophosphorylation (Reactions 4 and 7) for more than 10 minutes. At a higher frequency of 100 Hz, kinase activity outlasts the initial trigger by almost 20 minutes. Therefore, increasing the frequency of the stimuli has the effect of prolonging kinase activity, which may last for hours if phosphatase activity is suppressed. This behavior may be viewed as an analog molecular switch in which LFS leads to a partial, brief activation of the switch, and HFS leads to robust, persistent activation. When we use the term switch, we refer to the ability of the kinase to prolong its activity following a threshold level of frequency-dependent stimulation. The switch may not be permanent, but it may remain activated for minutes to hours. The kinase is fully activated (switched on) only by HFS. Figure 5B shows the fraction of inhibited subunits for each of the above stimulus frequencies. In this case, dephosphorylation of Sx operates at 1.0 × 104 /M/s. The greatest inhibitory autophosphorylation occurs with LFS and the least with HFS. Moreover, inhibitory autophosphorylation continues as long as autonomous subunits remain and subsequently influences the number of subunits available for Ca2+ -dependent activation. The extent inhibited subunits affect subsequent kinase activation will depend on the rate at which they are dephosphorylated. To explore this situation further, the rate of dephosphorylation was varied. Figures 6A and 6B show the result of changing the rate of dephosphorylation to 1.0×103 /M/s. The time course of the plot in Figure 6A is largely the same as that in Figure 5A. However, there is a significant difference between
Synaptic Encoding by a Brain Enzyme
1665
Figure 5: Plots showing the time course of CaMKII activation and inhibition for 1, 5, 10, 20, and 100 Hz calcium spikes. Fraction of (A) active and (B) inhibited subunits. Ca2+ -independent autophosphorylation significantly prolongs kinase activity for high-frequency spikes. However, a greater magnitude of subunit inhibition occurs for low-frequency spikes. The dephosphorylation rate is 104 /M/s.
the plots of inhibitory autophosphorylation. For example, in Figure 5B, following the 1 Hz spike train, inhibited subunits are dephosphorylated in just over 100 seconds. However, this period is extended to over 1500 seconds in Figure 6B. There is a commensurate attenuation of inhibitory activity for other frequencies. This behavior has the potential to influence later synaptic activity and may partly explain how prior LFS, which fail to induce LTP, can prevent the induction of LTP by subsequent HFS (Huang et al., 1992). Therefore, if dephosphorylation of inhibited subunits were reasonably slow, then LFS would disable CaMKII activity for some time.
1666
C. J. Coomber
HFS would still enable significant, persistent Ca2+ -independent activity to occur, as there is little difference between the plots in Figures 5A and 6A. If the rate at which inhibited subunits are dephosphorylated is increased to 1.0 × 105 /M/s, as in Figure 7, then the inhibitory activity does not significantly outlast the Ca2+ -independent activity. Although inhibitory sites are selectively autophosphorylated during LFS (see Figure 7b), the rapid rate of their dephosphorylation permits an increasing amount of Ca2+ -dependent activity to occur during the length of the spike train. This is evidenced by the increasing fraction of activated subunits for the 1 Hz spike train in Figure 7A. Therefore, this simulation associates LFS with LTP, which apparently contradicts previous observations if we assume that CaMKII activation is essential for LTP and that CaMKII is activated significantly only by HFS. However, there is evidence that LTP can be elicited without HFS (Gustaffson, Wigstrom, Abraham, & Huang, 1987). We can say that depending on which local phosphatases are triggered and their specific kinetic properties, it may be possible for LFS or HFS to activate CaMKII. The model predicts that if prior synaptic activity is to have a bearing on subsequent activity, then the rate at which inhibitory sites are dephosphorylated should be reasonably slow, or possibly phosphatase activity should be suppressed. In the above simulations, the fraction of subunits with bound phosphatase (PP) is always less than about 0.06. This also is typical of subsequent simulations. Therefore, S0 + Sx ≈ Stotal − S∗ , where Stotal is the total concentration of CaMKII in the system. This is a consequence of the chosen phosphatase binding kinetics. The initial concentration ratio of CaMKII-specific phosphatase to CaMKII (R = [PP]0 /[K]0 ) was varied to explore the expression of activated and inhibited subunits. In this experiment, 100 calcium spikes at frequencies 1, 5, 10, 25, 50, and 100 Hz were delivered to the model, R was varied from 0.1 to 1.0, and K was varied from 0.1 µM to 0.5 µM. The robustness of the suggestion that at low frequencies inhibitory sites are selectively autophosphorylated was confirmed. Figure 8 shows two extreme cases immediately after the spike train ceases for a 1 Hz (Figure 8A) and a 50 Hz (Figure 8B) spike train. In each graph, the fraction of corresponding activated and inhibited subunits is shown for three initial concentrations of CaMKII over the specified range of R. The frequency dependence of autophosphorylation is clearly illustrated. In Figure 8A, inhibited subunits surpass activated subunits, whereas in Figure 8B, the opposite is true. For other intervening frequencies, the balance between inhibited and activated subunits changes in favor of the latter as the stimulation frequency increases. The relative difference between these two subunit states is determined by the rate constants controlling phosphorylation and dephosphorylation. The ratio R is most important in controlling persistent activity and quasi-switching behavior (Matsushita et al., 1995), which we will now show.
Synaptic Encoding by a Brain Enzyme
1667
Figure 6: Plots showing the time course of CaMKII activation and inhibition in the presence of a decreased dephosphorylation rate (1.0 × 103 /M/s) at Thr306 of a nonautonomous subunit. Other parameters are identical to those in Figure 5. Fraction of (A) active and (B) inhibited subunits. The time course of active subunits is largely the same as that in Figure 5A. However, more pronounced and longer inhibitory activity is present.
1668
C. J. Coomber
Figure 7: Plots showing the time course of CaMKII activation and inhibition in the presence of an increased dephosphorylation rate (1.0 × 105 /M/s) at Thr306 of a nonautonomous subunit. Other parameters are identical to those in Figure 5. Fraction of (A) active and (B) inhibited subunits. The fraction of active subunits is prolonged for all spike frequencies, and inhibitory activity is suppressed. This plot associates low-frequency spikes with CaMKII activation, in contrast to previous simulations.
Synaptic Encoding by a Brain Enzyme
1669
Figure 8: Plots showing the fraction of active and inhibited subunits for lowand high-frequency calcium spikes for several initial concentrations of CaMKII ([K]0 ) and ratios of phosphatase ([PP]0 ) to CaMKII. For all initial conditions, low-frequency stimulation promotes kinase inhibition, whereas high-frequency stimulation promotes kinase activation.
3.4 Switching Behavior. Huber, Mauk, Thompson, and Kelly (1995) show that kinase activity is essential for approximately the first 20 minutes after HFS for inducing stable LTP. However, subsequent data from the same laboratory indicate that the activity of postsynaptic CaMKII is essential for induction and maintenance of LTP (Wang & Kelly, 1996). In the following simulation, we initially assume that 20 minutes is the cutoff point for critical kinase activity. The same 100-spike stimulation frequencies as used previously were applied to the model for different initial concentrations of kinase [K]0 and
1670
C. J. Coomber
phosphatase [PP]0 , and the model allowed to evolve over time for 20 minutes. If after 20 minutes, at least 10% of subunits were autonomous, we deemed the molecular switch to be activated. The results of this simulation are shown in Figure 9. The choice of 10% as a cutoff point for autonomy is purely arbitrary. If a smaller percentage were chosen, more expansive activated (light-colored) regions would appear, whereas a greater percentage would mean retraction of these regions. Nevertheless, at identical points in time, plots originating from HFS will display more persistent activity for a given [K]0 and [PP]0 than those originating from LFS. Therefore, in the chosen case, the light area in each plot indicates that at least 10% of subunits are autonomous for a particular [K]0 and initial ratio R = [PP]0 /[KK]0 . The interface between light (activated) and dark (not significantly activated) colors may be interpreted as the transition zone for turning the molecular switch on or off. The plots record activity 20 minutes into the simulation for frequencies 1, 10, 25, and 100 Hz, respectively. The most prominent kinase activity is associated only with the 25 and 100 Hz stimuli. As time advances, the bands of kinase activity retreat toward small R and small [K]0 for all frequencies. As would be expected, kinase activity can be prolonged by restricting the concentration of available phosphatase. The quickest decay occurs for low-frequency stimuli, which fail to promote sufficient Ca2+ -independent autophosphorylation to outpace dephosphorylation. The previous remark concerning the involvement of CaMKII in LTP maintenance (beyond 20 minutes) may give grounds for revising interpretation of the simulation experiment. If we relax the 20-minute cutoff restriction and consider the conditions that yield extended kinase activity (hours to days) for LTP maintenance, then we require small R and HFS. Under these conditions, the bands of kinase activity depicted in the 100 Hz plot decay very slowly and can be made to persist for days. Consequently, in this situation, where persistent CaMKII activity is essential for LTP maintenance, it is more appropriate to view the kinase as a simple on-off switch instead of a dynamic switch. Further experimental work is needed to clarify this issue. These results suggest that CaMKII may behave as a molecular switch that relies on the frequency of synaptic input. Furthermore, by changing the ratio R, the time course of kinase activity (how long the switch remains on) can be significantly influenced. These results extend previous research (Matsushita et al., 1995), which investigated the intensity of stimuli rather than its frequency. As in that related work, the critical parameters for controlling the expression of a switch are the rate constants associated with dephosphorylation, in particular, k11 , k14 , and k17 , which specify the forward rate of phosphatase binding to a subunit. If these rate constants are large (at least 1.0 × 105 /M/s), switching behavior is clearly expressed, as in Figure 5A. However, smaller rate constants make the curves taper off more gently, rather than the abrupt decay observed at the tail of the curves in Figure 5A. Nevertheless, higher-frequency stimuli extend the duration
Synaptic Encoding by a Brain Enzyme
1671
Figure 9: Expression of a molecular switch. The duration for which the switch remains on is subject to the Ca2+ spike frequency, the initial concentration of CaMKII, and the initial ratio of phosphatase to CaMKII. The plots show kinase activity 20 minutes after cessation of 100 calcium spikes at different frequencies. Light shade indicates that at least 10% of kinase is active; dark shade indicates less than this percentage is active. High-frequency stimulation gives rise to the most persistent kinase activity.
of kinase activity by promoting strong Ca2+ -independent autophosphorylation to oppose dephosphorylation. 3.5 Subunit Inhibition, Prior Synaptic Activity, and LTP. Persistent kinase activity may also be accompanied by varying degrees of inhibitory autophosphorylation. This affects the number of subunits available for autonomous activity and may therefore abbreviate the expression of switching behavior. Furthermore, inhibitory autophosphorylation of CaMKII may affect LTP, since the enzyme is known to play a critical role in the induction of this form of synaptic plasticity. In particular, by increasing the fraction of inhibited subunits, it may be possible to raise the threshold for induction of LTP. In a recent study (Mayford, Wang, Kandel, & O’Dell, 1995), a transgene was inserted at Thr286 of the α subunit of CaMKII with the aim of enhancing Ca2+ -independent activity. Although HFS evoked LTP akin to that observed in wild-type mice, LFS that would normally produce a slight depression engendered strong depression. This phenomenon was attributed to the fact that the autophosphorylated CaMKII would sequester a large amount of CaM, thereby reducing the pool of free CaM for activating other CaM-dependent enzymes (Bear, 1995). Therefore, only enzymes with a high affinity for CaM, such as phosphatase 2B, would become activated and consequently lead to LTD. Alternatively, our model predicts that prior LFS may disable CaMKII so that subsequent HFS would not be capable of inducing LTP. To test this hypothesis, a weak tetanus (four 0.2 second bursts at 30 Hz) every 60 seconds was administered to the model, which was followed by an HFS (100 Hz for 1 second). The results of this simulation are shown in Figure 10.
1672
C. J. Coomber
Figure 10: Quenching the consequence of a high-frequency tetanus after a prior weak tetanus. A weak tetanus (30 Hz, 0.2 second) was repeated four times at 60-second intervals. A subsequent strong tetanus (100 Hz, 1 second), which would normally evoke persistent kinase activity (and LTP), failed to produce a significant response. The prior weak tetanus increases the fraction of inhibited subunits, thereby limiting the availability of subunits for Ca2+ -dependent activation and intersubunit Ca2+ -independent autophosphorylation.
Normally, the HFS alone would be sufficient to induce LTP and a strong activation of CaMKII (see Figure 5A). However, the prior weak stimulus causes approximately one-third of the subunits to undergo inhibitory autophosphorylation, which effectively annuls the brief HFS. This result is qualitatively the same as that reported by Huang et al. (1992), in which a prior weak tetanus (30 Hz for 0.1–0.2 seconds) repeated four to six times at 2 minute intervals prevents a subsequent HFS from inducing LTP. More recently, it was found that LFS (200–1000 pulses at 1 Hz) delivered 60 minutes prior to HFS evoked only short-term potentiation (Fujii et al., 1996). In both of these studies, the involvement of NMDA receptor–gated channels is implicated in the suppression of LTP. The inhibition may be overcome by a stronger than usual HFS that would enlist the remaining uninhibited subunits. In other words, the frequencydependent threshold for induction of LTD could be either raised or lowered by modifying the level of subunit inhibition by prior synaptic activity. Therefore, some of the enhanced Ca2+ -independent activity associated with transgene insertion (Mayford et al., 1995) may belong to inhibitory autophosphorylation. This offers an alternative explanation for the change in the frequency-response function of the transgenic animals. However, other studies have reported bidirectional long-term modification of synaptic efficacy by alternating stimulation protocols (Dudek & Bear, 1993). That is, synaptic effectiveness could be upgraded or downgraded in
Synaptic Encoding by a Brain Enzyme
1673
succession without prior LFS suppressing the induction of LTP. The point to be made is that in any experiment, the characteristics of the stimulation protocol, experimental conditions, brain region, and animal species may all have a bearing on the type of synaptic modification observed. In this regard, our simulations support one particular set of experiments. Furthermore, we are considering only one enzyme among many that influence the induction of LTP. It is certainly possible that under specific conditions, other enzymes may be activated, which neutralize the effect of prior LFS. Our simulations suggest that if prior LFS does disable CaMKII, then this regulatory mechanism may provide a way to suppress the induction of LTP. This suggestion deserves further investigation by both experimentalists and simulationists. 4 Discussion In this study, we investigated a model of CaMKII in which autophosphorylation may either prolong kinase activity or inhibit it. We have shown that both subunit inhibition and persistent kinase activity may be manifested in the one model in a manner subject to Ca2+ -spike frequency. LFS is associated with subunit inhibition, whereas HFS is associated with subunit potentiation. Our model unifies previous efforts (Matsushita et al., 1995; Dosemeci & Albers, 1996; Coomber, in press) by considering physiologically based methods of autophosphorylation, kinetic constraints, and subunit geometry. An interesting result is that a limited number of autophosphorylation sites is sufficient to sustain persistent kinase activity in the presence of a cellular phosphatase. Specifically, the model has an autonomy site, which is autophosphorylated in an intersubunit reaction, and two other sites, which are autophosphorylated in both intra- and intersubunit reactions. If intersubunit autophosphorylation at the latter two sites is prevented, there is still considerable persistent kinase activity (Coomber, in press). Therefore, a chain of only three autophosphorylation sites can facilitate Ca2+ independent activity and is sufficient to express switching behavior. This is in contrast to an earlier report (Matsushita et al., 1995) in which ten interacting sites were used to implement a molecular switch. That only a small number of autophosphorylation sites can prolong kinase activity may have implications for other protein kinases with a significantly smaller number of subunits than CaMKII—for example, protein kinase C (PKC), which, like CaMKII, has a role in LTP (Abeliovich et al., 1993). In addition to supporting the view that cooperative autophosphorylation may serve to detect the frequency of Ca2+ spikes (Hanson et al., 1994), this study suggests that prolonged LFS may disable CaMKII activity. This would permit modulation of the response to subsequent HFS. Dosemeci and Albers (1996) first tested this hypothesis by investigating a simple probabilistic model of CaMKII activity, but neglected the contribution of Ca2+ -independent autophosphorylation. Nevertheless, their main conclu-
1674
C. J. Coomber
sion that site-selective autophosphorylation may enable CaMKII to discriminate between synaptic activation frequencies is confirmed by our results. If inhibitory autophosphorylation of a nonautonomous subunit is removed, then a brief HFS can have the same outcome as a prolonged LFS. That is, in both cases, strong kinase activation occurs. This is a consequence of CaM trapping, since even at low frequencies (0.5 Hz) the rate at which trapped Ca2+ /CaM dissociates is slower than the rate at which Ca2+ /CaM is recruited. This is also an inherent problem of the 10-site model (Matsushita et al., 1995) and is at odds with experimental observations, since to our knowledge LFS is not generally associated with CaMKII activation in the postsynaptic neuron (Fukunaga, Muller, & Miyamoto, 1995). If the number of subunits in the model is increased from four to six, the same situation prevails. Thus, the phenomenon is not an artifact of the restricted number of subunits. The only way to achieve frequency-dependent CaM trapping is to have a rate-limiting concentration of free CaM (Hanson et al., 1994). There is evidence that a sizable fraction of CaM may be bound to neurogranin and other CaM binding proteins and that release is facilitated by tetanic stimulation (Pasinelli et al., 1994). This constraint is not required for our model as long as inhibitory autophosphorylation is permitted. Furthermore, CaM trapping (without attendant inhibitory autophosphorylation) weakens the ability of CaMKII to serve solely as a Ca2+ frequency detector because the large time constant of Ca2+ /CaM dissociation precludes discrimination of fine-grained LFS (Dosemeci & Albers, 1996). The inclusion of inhibitory autophosphorylation solves this problem by allowing synaptic activity to be encoded as some balance between active and inhibited subunits, which may then also influence subsequent synaptic activity. In accord with experimental observations, only HFS associated with LTP elicits strong and persistent kinase activity in our model, whereas LFS elicits only short-term activity, most often constrained to the time course of the stimulus. The inhibition of CaMKII activity during LFS may also be bolstered by the selective activation of certain phosphatases. For example, both calcineurin and phosphatase 1 are involved in dephosphorylation of CaMKII subunits. The former is indirectly involved as it dephosphorylates inhibitor-1, thereby relieving inhibition of phosphatase 1, which may then dephosphorylate CaMKII subunits (Mulkey, Endo, Shenolikar, & Malenka, 1994). These phosphatases are also implicated in LTD and hence may become activated under low-frequency stimulation (Mulkey, Herron, & Malenka, 1993). Therefore, the stimulation frequency may target activation (or inhibition) of particular kinase and phosphatase systems. The balance between phosphorylation and dephosphorylation would then determine whether synaptic efficacy is upgraded or downgraded. We propose that CaMKII may serve as a frequency-dependent switch that is triggered preferentially by HFS and records prior LFS by undergoing various degrees of inhibition. Once strongly activated, the enzyme would
Synaptic Encoding by a Brain Enzyme
1675
continue to phosphorylate its substrate proteins, such as glutamate receptors, for some time while changes that are more permanent consolidate LTP (Desmond & Levy, 1990). 5 Conclusion This study has yielded some interesting hypotheses about the sort of information processing that might be performed by CaMKII. Many of these ideas could be tested by designing markers that target particular phosphorylated subunit types or by experimenting on mutant mice that express an inability to phosphorylate certain residues. In the future, the model will be extended to include slow, inhibitory basal autophosphorylation (Colbran, 1993). A simplified model is also being designed for single-neuron and network compartmental simulations. References Abeliovich, A., Chen, C., Goda, Y., Silva, A. J., Stevens, C. F., & Tonegawa, S. (1993). Modified hippocampal long-term potentiation in PKCγ -mutant mice. Cell, 75, 1253–1262. Antoine, M., Gaiddon, C., & Loeffler, J. P. (1996). Ca2+ /calmodulin kinase types II and IV regulate c-fos transcription in the AtT20 corticotroph cell line. Molec. Cell. Endocrinology, 120, 1–8. Bear, M. F. (1995). Mechanism for a sliding synaptic modification threshold. Neuron, 15, 1–4. Benfenati, F., Valtorta, F., Rubenstein, J. L., Gorelick, F. S., Greengard, P., & Czernik, A. J. (1992). Synaptic vesicle-associated Ca2+ /calmodulindependent protein kinase II is a binding protein for synapsin I. Nature, 359, 417–420. Bliss, T. V. B., & Collingridge, G. L. (1993). A synaptic model of memory: Longterm potentiation in the hippocampus. Nature, 361, 31–39. Colbran, R. J. (1993). Inactivation of Ca2+ /calmodulin-dependent protein kinase II by basal autophosphorylation. J. Biol. Chem., 268, 7163–7170. Colbran, R. J., & Soderling, T. R. (1990). Calcium/calmodulin-dependent protein kinase II. Curr. Top. Cell. Reg., 31, 181–221. Coomber, C. J. (in press). Simulation of Ca2+ /calmodulin-dependent protein kinase II. In Computers and Chemistry. Amsterdam: Elsevier, North-Holland (in press). Desmond, N. L., & Levy, W. B. (1990). Morphological correlates of long-term potentiation imply the modification of existing synapses, not synaptogenesis, in the hippocampal dentate gyrus. Synapse, 5, 139–143. Dosemeci, A., & Albers, R. W. (1996). A mechanism for synaptic frequency detection through autophosphorylation of CaM kinase II. Biophys. J., 70, 2493–2501. Dosemeci, A., Gollop, N., & Jaffe, H. (1994). Identification of a major autophosphorylation site on postsynaptic density-associated Ca2+ /calmodulindependent protein kinase. J. Biol. Chem., 269, 31330–31333.
1676
C. J. Coomber
Dudek, S. M., & Bear, M. F. (1993). Bi-directional long-term modification of synaptic effectiveness in the adult and immature hippocampus. J. Neurosci., 13, 2910–2918. Fersht, A. (1985). Enzyme structure and mechanism. New York: Freeman. Frey, U., Huang, Y-Y., & Kandel, E. R. (1993). Effects of cAMP simulate a late stage of LTP in hippocampal CA1 neurons. Science, 260, 1661–1664. Fujii, S., Kuroda, Y., Miura M., Furuse, H., Sasaki, H., Kaneko, K., Ito, K-I., Chen Z., & Kato H. (1996). The long-term suppressive effect of prior activation of synaptic inputs by low-frequency stimulation on induction of long-term potentiation in CA1 neurons of guinea pig hippocampal slices. Exp. Brain Res., 111, 305–312. Fukunaga, K., Muller, D., & Miyamoto, E. (1995). Increased phosphorylation of Ca2+ /calmodulin-dependent protein kinase II and its endogenous substrates in the induction of long term potentiation. J. Biol. Chem., 270(11), 6119–6124. Fukunaga, K., Stoppini, L., Miyamoto, E., & Muller, D. (1993). Long-term potentiation is associated with an increased activity of Ca2+ /calmodulin-dependent protein kinase. J. Biol. Chem., 268, 7863–7867. Gustafsson, B., Wigstrom, H., Abraham, W. C., & Huang, Y.-Y. (1987). Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. J. Neurosci., 7, 774–780. Hanson, P. I., Meyer, T., Stryer, L., and Schulman, H. (1994). Dual role of calmodulin in autophosphorylation of multifunctional CaM kinase may underlie decoding of calcium signals. Neuron, 12, 943–956. Hanson, P. I., & Schulman, H. (1992). Neuronal Ca2+ /calmodulin-dependent protein kinases. Ann. Rev. Biochem., 61, 559–601. Huang, Y-Y, Colino, A., Selig, D. K., & Malenka, R. C. (1992). The influence of prior synaptic activity on the induction of long-term potentiation. Science, 255, 730–733. Huber, K. M., Mauk, M. D., Thompson, C., and Kelly, P. T. (1995). A critical period of protein kinase activity after tetanic stimulation is required for the induction of long-term potentiation. Learning and Memory, 2, 81–100. Ikeda, A., Okuno, S., & Fujisawa, H. (1991). Studies on the generation of Ca2+ /calmodulin independent activity of calmodulin-dependent protein kinase II by autophosphorylation. J. Biol. Chem., 266, 11582–11588. Kanaseki, T., Ikeuchi, Y., Sugiura, H., & Yamauchi, T. (1991). Structural features of Ca2+ /calmodulin-dependent protein kinase II revealed by electron microscopy. J. Cell Biol., 115, 1049–1060. Kirtley, M. E., & Koshland, D. E. (1967). Models for cooperative effects in proteins containing subunits. J. Biol. Chem., 242, 4192–4205. Koch, C., & Zador, A. (1993). The function of dendritic spines: Devices subserving biochemical rather than electrical compartmentalization. J. Neurosci., 13, 413–422. Lisman, J. E. (1985). A mechanism for memory storage insensitive to molecular turnover: a bistable autophosphorylating kinase. Proc. Natl. Acad. Sci. USA, 82, 3055–3057.
Synaptic Encoding by a Brain Enzyme
1677
Lisman, J. (1994). The CaM kinase II hypothesis for the storage of synaptic memory. Trends Neurosci., 17, 406–412. Lisman, J. E., & Goldring, M. A. (1988). Feasibility of long-term storage of graded information by the Ca2+ /calmodulin-dependent protein kinase molecules of the postsynaptic density. Proc. Natl. Acad. Sci. USA, 85, 5320–5324. Lou, L. L., & Schulman, H. (1989). Distinct autophosphorylation sites sequentially produce autonomy and inhibition of the multifunctional Ca2+ /calmodulin-dependent protein kinase. J. Neurosci., 9, 2020–2032. Malinow, R., Madison, D. V., & Tsien, R. W. (1988). Persistent protein kinase activity underlying long-term potentiation. Nature, 335, 820–824. Matsushita, T., Moriyama, S., & Fukai, T. (1995). Switching dynamics and the transient memory storage in a model enzyme network involving Ca2+ /calmodulin-dependent protein kinase II in synapses. Biol. Cybern., 72, 497–509. Mayford, M., Wang, J., Kandel, E. R., & O’Dell, T. J. (1995). CaMKII regulates the frequency-response function of hippocampal synapses for the production of both LTD and LTP. Cell, 81, 891–904. McGlade-McCulloch, E., Yamamoto, H., Tan, S.-E., Brickey, D. A., & Soderling, T. R. (1993). Phosphorylation and regulation of glutamate receptors by calcium/calmodulin-dependent protein kinase II. Nature, 362, 640–642. Meyer, T., Hanson, P. I., Stryer, L., & Schulman, H. (1992). Calmodulin trapping by calcium-calmodulin-dependent protein kinase. Science, 256, 1199–1202. Miller, S. G., & Kennedy, M. B. (1986). Regulation of brain type II Ca2+ /calmodulin-dependent protein kinase by autophosphorylation: A Ca2+ -triggered molecular switch. Cell, 44, 861–870. Molloy, S. S., & Kennedy, M. B. (1991). Autophosphorylation of type II Ca2+ /calmodulin-dependent protein kinase in cultures of postnatal rat hippocampal slices. Proc. Natl. Acad. Sci. USA, 88, 4756–4760. Mukherji, S., & Soderling, T. R. (1994). Regulation of Ca2+ /calmodulindependent protein kinase II by inter- and intrasubunit-catalyzed autophosphorylations. J. Biol. Chem., 269, 13744–13747. Mulkey R. M., Endo, S., Shenolikar, S., & Malenka, R. C. (1994). Involvement of a calcineurin/inhibitor-1 phosphatase cascade in hippocampal long-term depression. Nature, 369, 486–488. Mulkey, R. M., Herron, C. E., & Malenka, R. C. (1993). An essential role for protein phosphatases in hippocampal long-term depression. Science, 261, 1051–1055. Otmakhov, N., Griffith, L. C., & Lisman, J. E. (1997). Postsynaptic inhibitors of calcium/calmodulin-dependent protein kinase type II block induction but not maintenance of pairing-induced long-term potentiation. J. Neuroscience, 17 (14), 5357–5365. Pasinelli, P., Ramakers, G. M. J., Urban, I. J. A., Hens, J. J. H., Oestreicher, A. B., de Graan, P. N. E., & Gispen, W. H. (1994). Long-term potentiation and synaptic protein phosphorylation. Behavioural Brain Research, 66, 53–59. Pettit, D. L., Perlman, S., & Malinow, R. (1994). Potentiated transmission and prevention of further LTP by increased CaMKII activity in postsynaptic hippocampal neurons. Science, 266, 1881–1885.
1678
C. J. Coomber
Putkey, J. A., & Waxham, M. N. (1996). A peptide model for calmodulin trapping by calcium/calmodulin-dependent protein kinase II. J. Biol. Chem., 271, 29619–29623. Rich, D. P., Schworer, C. M., Colbran, R. J., & Soderling, T. R. (1990). Proteolytic activation of calcium/calmodulin-dependent protein kinase II. Molec. Cell. Neurosci., 1, 107–116. Silva, A. J., Stevens, C. F., Tonegawa, S., & Wang, Y. (1992). Deficient hippocampal long-term potentiation in alpha-calcium-calmodulin kinase II mutant mice. Science, 257, 201–206. Suzuki, T. (1994). Protein kinases involved in the expression of long-term potentiation. Int. J. Biochem., 6, 735–744. Wang, J.-H., & Kelly, P. T. (1996). The balance between postsynaptic Ca2+ dependent protein kinase and phosphatase activities controlling synaptic strength. Learning and Memory, 3, 170–181. Wilson, M. A., Bhalla, U., & Bower, J. (1990). Genesis: A system for simulating neural networks Report. Pasadena: California Institute of Technology. Zador, A., Koch, C., & Brown, T. H. (1990). Biophysical model of a Hebbian synapse. Proc. Natl. Acad. Sci. USA, 87, 6718–6722. Received April 8, 1997; accepted January 6, 1998.
LETTER
Communicated by Christof Koch
Ion Channel Stochasticity May Be Critical in Determining the Reliability and Precision of Spike Timing Elad Schneidman Department of Neurobiology, Institute of Life Sciences, Institute of Computer Science, and Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel
Barry Freedman Idan Segev Department of Neurobiology, Institute of Life Science and Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel
The firing reliability and precision of an isopotential membrane patch consisting of a realistically large number of ion channels is investigated using a stochastic Hodgkin-Huxley (HH) model. In sharp contrast to the deterministic HH model, the biophysically inspired stochastic model reproduces qualitatively the different reliability and precision characteristics of spike firing in response to DC and fluctuating current input in neocortical neurons, as reported by Mainen & Sejnowski (1995). For DC inputs, spike timing is highly unreliable; the reliability and precision are significantly increased for fluctuating current input. This behavior is critically determined by the relatively small number of excitable channels that are opened near threshold for spike firing rather than by the total number of channels that exist in the membrane patch. Channel fluctuations, together with the inherent bistability in the HH equations, give rise to three additional experimentally observed phenomena: subthreshold oscillations in the membrane voltage for DC input, “spontaneous” spikes for subthreshold inputs, and “missing” spikes for suprathreshold inputs. We suggest that the noise inherent in the operation of ion channels enables neurons to act as “smart” encoders. Slowly varying, uncorrelated inputs are coded with low reliability and accuracy and, hence, the information about such inputs is encoded almost exclusively by the spike rate. On the other hand, correlated presynaptic activity produces sharp fluctuations in the input to the postsynaptic cell, which are then encoded with high reliability and accuracy. In this case, information about the input exists in the exact timing of the spikes. We conclude that channel stochasticity should be considered in realistic models of neurons.
c 1998 Massachusetts Institute of Technology Neural Computation 10, 1679–1703 (1998) °
1680
Elad Schneidman, Barry Freedman, and Idan Segev
1 Introduction Following the formulation of the Hodgkin-Huxley (HH) equations for modeling spike initiation in the squid giant axon (Hodgkin & Huxley, 1952), research into the electrical activity of single neurons followed two main paths: the attempt to discover further macroscopic equations governing different membrane currents (e.g., Yamada, Koch, & Adams, 1989), and the attempt to investigate, and mathematically describe, the behavior of the ion channels underlying these currents (see Hille, 1992; Sakmann & Neher, 1995). Although part of the same general problem, mathematically these two areas of investigation are entirely different. In the HH formulation, the ion conductances are modeled by means of deterministic differential equations, and their values range continuously from zero to a given maximum. However, because individual ion channels are discrete elements whose properties can be given only probabilistically, the electrical activity of nerve cells is most accurately described as resulting from the interaction of stochastic, discrete units. It is commonly assumed that a large collection of such discrete units practically forms a continuous deterministic system, as is the case in numerous large physical systems. Because the number of excitable channels in the axon’s spike initiation zone is estimated to be large (on the order of tens of thousands of ion channels; Hille, 1992), models for spike generation in neurons typically use deterministic rather than stochastic equations (Mainen, Joerges, Huguenard, & Sejnowski, 1995; Rinzel & Ermentrout, 1989; Rapp, Yarom, & Segev, 1996). A few theoretical studies did consider the effect of channel stochasticity, focusing on the question, “When does the stochastic model converge to the corresponding deterministic model?” The pioneering work of Fitzhugh (1965) used a kinetic (stochastic) model for the conductance change associated with the HH equations; others used stochastic HH equations to investigate the effect of various parameters (e.g., number of channels, membrane area) on the dynamics of the membrane voltage. The main message of these studies is that the stochastic system differs considerably from the deterministic HH system when a small number (a few hundred) of channels and small membrane areas are involved (Skaugen & Walløe, 1979; Strassberg & DeFelice, 1993; DeFelice & Isaac, 1992). Other aspects of this problem— spontaneous spiking due to channel noise and the effect of channel stochasticity on spike propagation in axons—were recently explored (Rubinstein, 1995; Chow & White, 1996; Horikawa, 1991, 1993). In a more general perspective, the effect of different kinds of noise on the firing threshold of neurons was examined by Lecar and Nossal (1971a,b). Recently Jensen and Gartner (1997) have dealt with the effect of additive white noise on the firing reliability of different neuron models (see also Longtin & Hinzer, 1996, and Braun, Huber, Dewald, Schafer, & Voigt, 1998). A recent experimental study by Mainen and Sejnowski (1995) on the reliability of spike firing time in neocortical pyramidal cells motivated us
Ion Channels and the Reliability of Spike Timing
1681
to readdress the question, “How good an approximation is the deterministic model, given a large number of excitable channels?” focusing on the timing of the spikes during the train. Mainen and Sejnowski showed that spike timing is highly unreliable for repeated DC current inputs, whereas fluctuating current inputs significantly improve the firing reliability and the firing precision up to a millisecond range (see also the recent studies by Nowak, Sanches-Vives, & McCormick, 1997 and Tang, Bartels, & Sejnowski, 1997 and studies on the reliability of neuronal spike firing times in behaving animals, e.g., de Ruyter van Steveninck, Lewen, Strong, Koberle & Bialek, 1997, and Reich, Victor, Knight, Ozaki, & Kaplan, 1997). We were interested in exploring if, for realistic membrane area and number of excitable channels, a biophysically inspired noise that is generated by channel stochasticity plays an important role in determining the reliability of spike firing times in the spike initiation zone. In this context, it is noteworthy that channel noise was shown to be significant for the macroscopic behavior of neurons (Volgushev, Chistiakova, & Singer, 1998), and in other excitable membranes (Berzukov & Vodyanoy, 1995). We therefore modeled membrane patches of areas of a few hundred square micrometers, comprising a total of a few tens of thousands ion channels that receive both DC inputs and the more biologically realistic fluctuating current inputs. We show that for a broad range of inputs, the stochastic equations generate results that are strikingly different from those obtained from the corresponding deterministic HH equations. In addition to its significant effect on the timing of spike firing, channel noise also produces three additional experimental observations: voltage-dependent subthreshold membrane voltage oscillations for DC input, occasional “missing” spikes for suprathreshold inputs and “spontaneous” spikes for subthreshold inputs. In section 4, we speculate on the functional implications of channel stochasticity for neural coding. A preliminary account of this work has appeared as an abstract (Schneidman, Freedman, & Segev, 1997). 2 Basic Model and Simulation Scheme The membrane dynamics of the HH equations is given by Cm
dV = −gL (V − VL ) − gK (V − VK ) − gNa (V − VNa ) + I, dt
(2.1)
where V is the membrane potential; VL , VK , VNa are the reversal potentials of the leakage, potassium, and sodium currents, respectively. gL , gK , gNa , are the corresponding specific ion conductances; Cm is the specific membrane capacitance; and I is the specific current injected into this membrane patch. The voltage-dependent conductances for the potassium and sodium channels are given by gK (V, t) = gK n4 ;
gNa (V, t) = gNa m3 h,
(2.2)
1682
Elad Schneidman, Barry Freedman, and Idan Segev
Table 1: Hodgkin-Huxley Parameters and Rate Functions Used in the Simulations. Cm T VL gL VK gK γK DK VNa gNa γNa DNa
1 µF/cm2 6.3◦ C 10.6 mV 0.3 mS/cm2 −12 mV 36 mS/cm2 20 pS 18 channels/µm2 115 mV 120 mS/cm2 20 pS 60 channels/µm2
Specific membrane capacitance Temperature Leakage reversal potential Leakage conductance Potassium reversal potential Maximal potassium conductance Potassium channel conductance Potassium ion channel density Sodium reversal potential Maximal sodium conductance Sodium channel conductance Sodium ion channel density
0.01(10−V) exp[(10−V)/10]−1 0.1(25−V) exp[(25−V)/10]−1
αn (V) αm (V) αh (V) βn (V) βm (V) βh (V)
0.07 exp[−V/20] 0.125 exp[−V/80] 4.0 exp[−V/18] 1 exp[(30−V)/10]+1
where the dynamics of n (and similarly for m and h) is given by dn = αn (1 − n) − βn n. dt
(2.3)
The rate functions α’s, β’s, and the maximal conductances g ’s, as formulated by Hodgkin and Huxley, are given in Table 1. Equations 2.1 through 2.3 give an extremely successful description of the mean behavior of the voltage and currents in the squid giant axon, without the need to treat the activity of individual ion channels that underlie this behavior. Hodgkin and Huxley suggested a physical interpretation of their equations—namely, that the term n4 indicates that there are four separate gates and that a K+ channel is opened only when all these gates are open. The term m3 h indicates that there are three m-gates and one h-gate that must be open to enable the Na+ current to flow via the sodium channel. A corresponding kinetic model (Fitzhugh, 1965; Clay & DeFelice, 1983) can be formalized as a Markov model that explicitly incorporates the internal workings of the ion channels. The model, which is equivalent in the limit of large number of channels to the HH model (Skaugen & Walløe, 1979), suggests that the K+ channel can exist in five different states and that the kinetic scheme describing the behavior of this channel is given by [n0 ]
4αn βn
[n1 ]
3αn 2βn
[n2 ]
2αn 3βn
[n3 ]
αn 4βn
[n4 ] ,
(2.4)
Ion Channels and the Reliability of Spike Timing
1683
where [ni ] is the number of the channels with i open gates and, hence, [n4 ] labels the single open state of the K+ channel. αn , βn are identical to the original HH rate functions. Similarly, in this description, each Na+ channel can exist in eight different states, as in the following scheme, [m0 h1 ]
3αm βm
αh ↑↓ βh
[m0 h0 ]
[m1 h1 ]
2αm 2βm
αh ↑↓ βh 3αm βm
[m1 h0 ]
[m2 h1 ] αh ↑↓ βh
2αm 2βm
[m2 h0 ]
αm 3βm αm 3βm
[m3 h1 ] αh ↑↓ βh
(2.5)
[m3 h0 ],
where [mi hj ] refers to the number of channels within the population that are currently in the state mi hj , [m3 h1 ] labels the single open state of the Na+ channel, and αh , βh , αm , and βm are the rate functions in HH formalism. The potassium and sodium membrane conductances are given by gK (V, t) = γK [n4 ];
gNa (V, t) = γNa [m3 h1 ],
(2.6)
where γK and γNa are the conductances of the single potassium and sodium ion channel at their open state, respectively. By switching from the standard HH model to Fitzhugh’s (1965) model, channel stochasticity is incorporated into the voltage dynamics. Instead of keeping track of each of the channels separately, we have used a more efficient scheme to track only the total populations of channels in each of their possible states (see Skaugen & Walløe, 1979, and Chow & White, 1996, for a discussion of possible simulation methods for populations of channels). Specifically, if at time t there are nA channels in state A and nB channels in state B and the transfer rate of channels from state A to state B is r, then each of the channels in state A might transfer to state B between time t and t + 1t with probability p = r1t. Hence, for each time step, we determine 1nAB , the number of channels that move from A to B, by choosing a random number from a binomial distribution (Press, Teukolsky, Vetterling, & Flannery, 1992), µ Prob(1nAB ) =
nA 1nAB
¶
p1nAB (1 − p)(nA −1nAB ) .
(2.7)
In this study, we used the forward Euler integration method with 1t = 0.01 msec, as in Chow and White (1996). It is important to note that the spatially independent (space-clamped) HH equations were utilized; we simulate isopotential membrane patches of varying areas. Clearly, this is a severe oversimplification of the realistic case, and its implications will be addressed in section 4. Finally, in order to make the transition from the deterministic to the stochastic model, we need to know how many channels there are in the modeled membrane patch. Once we choose the conductance of the
1684
Elad Schneidman, Barry Freedman, and Idan Segev
individual channel, the number of channels can be calculated directly from the channel densities and from the maximal conductances, g ’s, given in the HH model (see Table 1). 3 Results Before we proceed to the actual simulation results, we first try to estimate the effect of introducing stochasticity into the HH model. Suppose that the area of the membrane patch is 200 µm2 . With the parameters of Table 1, this membrane patch bears 3600 K+ channels and 12,000 Na+ channels. Considering the large number of modeled channels, one would naively estimate the number of fluctuating channels about the mean to be on the √ order of NN channels. Hence, for N = 3600 K+ channels, the size of the fluctuation is 1.7%, and we would expect rather small deviations from the deterministic model. An even smaller effect would be expected for the Na+ channels. Surprisingly, this is not the case, as shown in Figure 1. 3.1 Encoding Reliability and Precision: Input Current Versus Channel Fluctuations. The response of a stochastic isopotential HH compartment to repeated presentation of suprathreshold currents is shown in Figure 1. When the same suprathreshold DC current pulse (10 µA/cm2 , 250 msec; A, top frame) is repeatedly presented to the modeled membrane patch, the resulting spike trains vary considerably from trial to trial; the spike firing time is neither reliable nor accurate (see Figure 1A, bottom frame). This should be compared with the response of the corresponding deterministic model shown in the middle frames. In contrast, when the stimulus is fluctuating (simulating the current that presumably reaches the site of spike generation following the activation of many synaptic inputs impinging on the dendritic tree, B top frame; see the caption of Figure 1 for details), the reliability and accuracy of the spike train in the stochastic HH model are improved compared to DC case (see Figure 1B, bottom frame). As in the study of Mainen and Sejnowski (1995), two measures of the spike timing, the reliability and the precision, were calculated from the peristimulus time histogram (PSTH, not shown) for a wide range of input patterns (see caption of Figure 2 for details). The reliability and precision of the spike patterns were strongly correlated with the amplitude of the fluctuations in the input current, σinput (see Figures 2A and 2B); the reliability and precision dropped as the input was filtered with larger time constants (see Figures 2C and 2D). In the stochastic HH model, both the reliability and the precision, which for most of the responses was in the range of 1 to 2 msec, are in close agreement with the results of Mainen and Sejnowski (1995). It is noteworthy that there is no clear dependence of the reliability and precision on the mean value of the injected current, as was also found experimentally by Mainen and Sejnowski (personal communication).
Ion Channels and the Reliability of Spike Timing
1685
Figure 1: Reliability of firing patterns in a model of an isopotential HodgkinHuxley membrane patch in response to both DC and fluctuating current input. (A) Ten superimposed responses to repeated suprathreshold DC current pulses (10 µA/cm2 , 250 msec; top frame) evoked a train of regular firing in the deterministic HH model (middle frame) and a jitter in the firing in the stochastic HH model (bottom frame). (B) The same patch was again stimulated 10 times repeatedly, this time with a fluctuating stimulus (low-pass gaussian white noise with a mean I, of 10 µA/cm2 , and a standard deviation σinput of 7 µA/cm2 , which was convolved with an alpha function with a time constant τinput = 1 msec, top frame; see Mainen & Sejnowski, 1995). As can be clearly seen, the jitter in spike timing in the stochastic model is significantly smaller in B than in A (i.e., increased accuracy for the fluctuating current input). Patch area used was 200 µm2 , with 3, 600 K+ channels and 12, 000 Na+ channels. (Compare to Figure 1 in Mainen & Sejnowski, 1995.)
1686
Elad Schneidman, Barry Freedman, and Idan Segev
Figure 2: Dependence of reliability and precision on stimulus parameters. The reliability and accuracy of the spike train was calculated in a similar manner to that of Mainen & Sejnowski (1995). The peristimulus time histogram (PSTH) of 20 successive presentations of a particular stimulus was smoothed using an adaptive filter, yielding an estimate for the instantaneous firing rate. Significant elevations in the instantaneous firing rate (“events”) were selected from the PSTH using a threshold of two times the mean firing rate over a given block of responses. The reliability of the response to a particular stimulus is defined as the average of the fraction of spikes that occurred in the events in that stimulus’ PSTH. The temporal precision of the response is defined by the average of standard deviation of spike timing within the events in that stimulus’s PSTH. (A) Estimates of the reliability of the spike train in a 200µm2 stochastic HH membrane patch, for stimuli with various fluctuation amplitudes, σinput . Each curve is for a different mean value of the stimulus (I = 7 − 20 µA/cm2 , τinput = 1 msec). (B) The temporal precision of the same responses as in A. (C) The reliability for stimuli filtered with different time constants (τinput = 1 − 10 msec). Each curve is for a different mean value of the stimulus and a given σinput . (I = 7 − 20 µA/cm2 , σinput = 3 − 12 µA/cm2 ). (D) The temporal precision of the same responses as in C.
Hence, with a realistically large number of channels, when incorporating their unavoidable stochasticity, one obtains an effect that is qualitatively similar to the behavior of real neurons and is significant from both biophysical and computational viewpoints. Clearly, the effect of stochasticity depends on the number of ion channels and the membrane area. It increases
Ion Channels and the Reliability of Spike Timing
1687
when decreasing the number of channels and decreases when increasing the membrane area. Still, the effect of channel stochasticity was significant even when the membrane area was increased by a factor of five (to 1000 µm2 ) as well as when, for a given membrane patch, the channel density was increased by the same factor (not shown; see section 4). But why is the result of the stochastic model so different from that obtained from the corresponding deterministic HH model? The apparent error in the previous estimation (see the beginning of section 3) of the size of the effect of channel stochasticity lies in failing to realize that the relevant number of channels is not the total number of channels in the membrane patch, but rather the number of channels that are open near the threshold for spike firing. If this number is relatively small, the size of the fluctuations in the number of open channels in this regime is not negligible. Mathematically, the correct estimation for the size of the fluctuation should rely on the binomial statistics. For a total population of N channels and a probability p of a channel to be open, the size of the fluctuations is q p (1−p) Np(1 − p) and the fluctuation relative to the mean, Np, is Np . If p is small, as in the case of near threshold for spike firing, the relative size of the fluctuations is rather large. In this case, the inherent stochasticity of the channels is expected to have a significant effect on the voltage dynamics and, specifically, on the time of threshold crossing. When this is the case, the firing behavior of the stochastic model is expected to be considerably different from that of the corresponding deterministic model. Figure 3 shows that this is indeed the situation. As in Figure 1A, 10 repeated 10 µA/cm2 DC current inputs were applied, this time to a 600 µm2 membrane patch consisting of a total of 10, 800 K+ channels and 36,000 Na+ channels. The voltage response is shown in Figure 3B, whereas the numbers of open K+ and Na+ channels near the threshold for spike firing are shown in panels C and D, respectively. A surprisingly small number of ion channels— approximately 300 K+ channels and 50 Na+ channels—are opened in this voltage regime. With these small numbers, channel fluctuations become significant and critically determine the exact time in which a sufficient number of additional Na+ channels is recruited to initiate a regenerative response. When injecting the same DC current repeatedly, the fluctuations vary significantly from one trial to the other, and, consequently, the time of spike firing for this input is unreliable. In principle, this channel-noise-induced unreliability can be mostly overridden by injecting a current that fluctuates significantly. If the input fluctuations are sufficiently large, the voltage dynamics will be dominated by the transients in the current input rather than by the channel noise. This effect is demonstrated in Figure 4, where the response to a fluctuating input in a specific time window is shown. The current input is depicted in A, the voltage response is shown in panel B, and the number of open K+ and Na+ channels is shown in panels C and D, respectively. In contrast to the DC-
1688
Elad Schneidman, Barry Freedman, and Idan Segev
Figure 3: Channel fluctuations ruin the reliability of spike timing in the case of DC current input. (A) 10 µA/cm2 DC current injected to a 600 µm2 stochastic HH model (10, 800 K+ channels and 36, 000 Na+ channels) results with dispersed spike timings for repeated simulation (10 superimposed voltage traces) in B. (C, D) the number of open Na+ and K+ channels, respectively, corresponding to the voltage traces presented in B.
Ion Channels and the Reliability of Spike Timing
1689
input case (see Figure 3), here the transients in the input current partially overcome the channel fluctuations and enforce 9 out of the 10 spikes to occur within an approximately 1 msec time window (see Figure 4B). The reason for this relatively high reliability of spike timing becomes clear by observing panels C and D. The accuracy is determined by two parameters. The first is the variability in the time where the number of open K+ channels reaches a sufficiently small value (note that a large outward K+ current impedes the initiation of the spike). This variability should be small in order to obtain high accuracy. Indeed, in 9 of 10 repetitions, this condition is satisfied (see Figure 4C). The second parameter is the rate of the buildup of the Na+ channel population toward threshold. For an accurate spike timing, this buildup, which is determined by the amplitude and rate of the depolarizing input current, should be sufficiently large to overcome the channel fluctuations (see Figure 4D). For a given voltage, the size of the channel fluctuation is as large in the fluctuating-input case as in the DC-input case, but in the former these channels fluctuations are “lost in the crowd.” To examine the relative contribution of the K+ and Na+ channels to the reliability and precision, we simulated a hybrid system in which one of the channel populations was stochastic and the other was deterministic. Both channel types contribute to the complex reliability nature of the system. However, as expected from the larger number of K+ channels that are open near threshold for spike firing (see Figures 3C and 3D and Figures 4C and 4D), as well as from their slower kinetics, the noise introduced by the K+ channels is more dominant in determining the reliability and accuracy of this system (not shown). 3.2 Subhreshold Oscillations, “Spontaneous” Spikes, and “Missing” Spikes. Along with the effect of channel fluctuations on spike timing, incorporating channel stochasticity in the HH model gives rise to three additional phenomena that were observed experimentally: (1) considerable subthreshold oscillations in the membrane voltage for DC inputs, (2) “spontaneous” spikes for “subthreshold” inputs, and (3) “missing” spikes for suprathreshold inputs. These phenomena cannot be reproduced in the deterministic HH model. In the stochastic model, oscillations in the membrane voltage are already observed for zero current input (see Figure 5A). Occasionally these oscillations are sufficiently large to generate “spontaneous” spikes, which would not have occurred in the corresponding deterministic model (current threshold for spike firing in the deterministic model is I = 7 µA/cm2 ). An example of “spontaneous” spikes in the case of I = 4 µA/cm2 is shown in Figure 5B; detailed analysis of spontaneous spiking in the stochastic HH model for zero current input was recently performed by Chow and White (1996). Compared to the deterministic model, where regular repetitive firing occurs (see Figure 1), “missing” spikes (with respect to the corresponding deterministic model) are observed in the stochastic model for suprathreshold currents (see Figures 5C and 5D).
1690
Elad Schneidman, Barry Freedman, and Idan Segev
Ion Channels and the Reliability of Spike Timing
1691
It is important to note that both the amplitude and the frequency of the membrane oscillations observed in the stochastic model are voltage dependent (e.g., compare A to B in Figure 5). This is also the case with the membrane voltage oscillations in neocortical neurons reported by Gutfreund, Yarom, and Segev (1995), Klink and Alonso (1993), as well as in other neuron types, for example, Hutcheon, Miura, Yarom, and Puil (1994) and Lampl and Yarom (1997). We suggest that in addition to the deterministic macroscopic mechanisms that were proposed to explain the generation of the subthreshold oscillations, the stochastic nature (and the limited number) of the ion channels may have a dominant effect on the nature of these oscillations (see also Longtin & Hinzer, 1996, and Braun et al., 1998). Channel stochasticity has such a dramatic effect on the voltage dynamics because it exploits a peculiar, and largely neglected, aspect of the deterministic HH equations: its two stable states for suprathreshold current input (see the discussion of the bistability in the HH equations in Cooley, Dodge, and Cohen, 1965, and Guttman, Lewis, and Rinzel, 1980). For a DC input, one state is the well-known repetitive firing behavior (the light trace in Figure 6A) whereas the other state is a nonfiring behavior of early damped voltage oscillations that converges to a steady voltage (see Figure 6A, dark trace). In both cases, a 7 µA/cm2 DC current was injected, and the marked difference between the two curves is the result of minute perturbation in the initial conditions (see the caption of Figure 6 for details). These two different behaviors can be better appreciated in the phase plane diagram in panels B and C of Figure 6. Translating the ion conductances to the corresponding
Figure 4: Facing page. Fluctuating input current partially overrides the channel stochasticity and increases the reliability of spike timing. A small time window of the system behavior for the fluctuating input case is presented. (A) The input current with a mean value of 10 µA/cm2 (horizontal dotted line) and with σinput = 5 µA/cm2 and τinput = 1 msec, injected to a 600µm2 stochastic HH membrane patch is depicted. (B) Ten superimposed voltage-traces responses to repeated injection of the fluctuating current in A. In 9 out of 10 of the cases, a spike was fired within approximately 1 millisecond time window. (C, D) The number of open Na+ and K+ channels, respectively, for the voltage traces presented in B, reflecting how the fluctuations of both Na+ and K+ channels are overridden by the fluctuations in the input current. When a sufficient number of K+ channels close (C), the depolarizing transient in the input current, starting at t = 143 msec, results in the nearly synchronous buildup of Na+ channels at t = 146 msec (D). The result is spike firing at t = 147.3 − 148.4 msec. In the one case where an insufficient number of K+ channels was closed in time, the spike is initiated somewhat later due to the next fluctuation in the input current.
1692
Elad Schneidman, Barry Freedman, and Idan Segev
Figure 5: The response of the stochastic model to injected DC input currents. DC currents of different amplitudes were injected to a stochastic HH model of an isopotential membrane patch of area 600 µm2 (10, 800 K+ channels and 36, 000 Na+ channels). (A, B) Membrane voltage oscillations are the dominant effect of the stochastic nature of the ion channels, with occasional spontaneous spiking. (C, D) Suprathreshold DC input currents result with irregular spiking, occasional “missing” spikes and membrane voltage oscillations. This is not expected in the corresponding deterministic HH model where the threshold is 7 µA/cm2 . Below this value, smooth subthreshold voltage response is observed in the deterministic model (not shown); above this value, regular firing is obtained (not shown).
number of open ion channels, these panels show the very different paths in phase space taken by the firing (light curve) and the nonfiring (dark curve) trajectories. The bottom panel shows the convergence of the nonfiring behavior to a fixed point. It also shows that the distance, in terms of number of open channels, between the continuous firing cycle and the nonfiring voltage behavior is very small. Although small, the deterministic nature of the HH equations implies that for a DC input, the system remains in one stable state or the other. However, introduction of channel noise could, in principle, flip the system between these two states.
Ion Channels and the Reliability of Spike Timing
1693
Figures 6D–F show that in the stochastic model, channel fluctuations occasionally bridge the small distance in phase space between the two stable states. The stochastic opening (or closing) of a few extra K+ and/or Na+ channels pushes the system spontaneously from the continuous firing stable state (light line) to the nonfiring stable state (dark line), where it stays for a while, and vice versa (at t = 772 msec in Figure 6D). This spontaneous transition between the two states is the cause for the “missing” spikes and the subthreshold membrane voltage as well as for occasional “spontaneous spikes” (see Figure 5B). Panels E and F depict the corresponding phaseplane behavior of the system. It clearly shows that fluctuations due to only
1694
Elad Schneidman, Barry Freedman, and Idan Segev
a few channels are responsible for the transition between these two stable states. We conclude that the nonfiring stable state in the deterministic HH model becomes a key player in the stochastic HH model. Experimentally, the coexistence of the two stable solutions in the the squid giant axon, as well as in the corresponding HH model, was demonstrated by Guttman et al. (1980) (see also Cooley et al., 1965). Considering the subthreshold membrane oscillations, the role of channel fluctuations is twofold. First, they drive the system from the firing state into the basin of attraction of the nonfiring stable state. Second, the fluctuations prevent the system from converging into the fixed point of the nonfiring stable state of the corresponding deterministic model. As a result, the system is cycling around this fixed point, and the subthreshold membrane voltage oscillations thus emerge. The frequency of the subthreshold oscillations is set by the period of these cycles. Based on this observation, we can predict analytically the power spectrum of the oscillations with a fair degree of accuracy, and we can also quantify the rate of transfer between the two states in the case of DC input (unpublished observations). Questions regard-
Figure 6: Previous page. Channel fluctuations cause flipping between firing and nonfiring stable states in the stochastic HH model. (A–C) The two stable states of the deterministic HH model. In A, two traces of the membrane potential are shown for a 600 µm2 membrane patch, injected with a 7 µA/cm2 DC current. The difference between the light and the dark traces results from the minute difference in the initial conditions. In the continuous firing case (light trace), the initial values are: V = 4.21 mV; gK = 4121 pS (corresponds to 206.05 open K+ channels) and gNa = 195.8 pS (corresponding to 9.79 open Na+ channels) assuming a single-channel conductance of 20 pS (Table 1). In the nonfiring case (dark trace) the initial values are: V = 4.23 mV; gK = 4399.4 pS (corresponds to 219.97 open K+ channels) and gNa = 197.4 pS (corresponding to 9.87 open Na+ channels). (B) The 3D phase plane of these two behaviors of the system. The light curve is for the spiking behavior, and the dark curve is for the nonfiring stable state. A magnification of B is presented in C, reflecting the small basin of attraction of the nonfiring state and the short distance in terms of number of open channels between the two states. (D–F) The corresponding behavior of A–C in the stochastic model. Channel fluctuations in the stochastic model spontaneously flip the system between the firing and the nonfiring states. (D) a typical voltage trace of the stochastic HH patch. Light and dark lines were used to emphasize the different segments of the trace. The corresponding phase-plane traces are shown in E and F. As can be seen in F, the system flips from the firing stable state (light trace) to the nonfiring stable state (dark trace) , where it stays for a few cycles. Hence, the subthreshold oscillations in the top trace translate to small-size loops in the phase-plane. The system then flips back to the firing stable state (light trace).
Ion Channels and the Reliability of Spike Timing
1695
ing the effect of various parameters on the subthreshold oscillations, such as the area of the membrane patch and the properties of the channels, will be addressed briefly in section 4, but a more complete study is yet to be performed.
4 Discussion 4.1 Stochastic Versus Deterministic HH Model. We have shown that with a realistically large number of ion channels, the inherent noise in channel operation critically determines the timing and dynamics of spike firings for the stochastic HH model. The reason for this strong effect of channel stochasticity is that near the threshold for spike firing, only a very small percentage of Na+ and a small percentage of K+ channels is open (the activation variables, m and n, are small near threshold). Consequently, the variability in membrane voltage near threshold for excitation is large, and this is reflected in the variability of spike firing time. We conclude that for a wide range of input parameters, the stochastic model captures important features of real neurons; these features are neglected in the deterministic model. In agreement with the experimental results Mainen and Sejnowski (1995) and Nowak et al. (1997), the reliability and precision of spike timing in the stochastic HH model are very sensitive to the properties of the current input. The reliability and precision of the spike timing are high for strongly fluctuating inputs and decreases for smoother (e.g., DC) inputs. This study shows that this effect could be explained in terms of the relation between the instantaneous shape and amplitude of the input signal and the amplitude of channel fluctuations. Strongly fluctuating inputs “override” the inherent channel fluctuations, and the spike timing is dictated primarily by the input rather than by channel stochasticity. In contrast, channel fluctuations become relatively more significant for smooth inputs, and spike firing time becomes less reliable. In addition to its effect of spike timing, channel stochasticity produces three additional phenomena that do not occur in the deterministic HH model but were all observed experimentally (e.g., see Guttman et al., 1980). Voltage membrane oscillations are seen for subthreshold current inputs, and they also occur between spikes for suprathreshold inputs. A detailed study of the amplitude and frequency of these voltage-dependent oscillations will be performed elsewhere. “Spontaneous” spikes (for subthreshold inputs) and “missing” spikes (for suprathreshold inputs) were also observed in this model. These three phenomena result from the “unmasking” of the nonfiring stable state in the HH model by the channel fluctuations. This state, which was largely neglected in the framework of the deterministic HH model, becomes a key player in determining voltage dynamics in the stochastic model.
1696
Elad Schneidman, Barry Freedman, and Idan Segev
4.2 Toward a More Realistic Stochastic Model of Neurons. This study gives only a qualitative explanation for the reliability behaviors of neurons, in particular, of neocortical pyramidal cells. First, the spatial domain of neurons was completely neglected. It is especially important to consider the filtering effect and the impedance load imposed by the soma and dendrites, as well as by the axon, on the excitable channels at the spike initiation zone. A multicompartmental model (possibly composed of an axon with several highly excitable nodes of Ranvier, separated by passive internodes and a few dendritic compartments) should be used to understand better the effect of channel stochasticity on the reliability and accuracy of spiking in neurons. In such a model, the input should impinge onto the dendritic compartments and be simulated by a barrage of synaptic conductance changes (rather than by current inputs). In this context, it is important to emphasize that in many neuron types, the dendritic membrane is endowed with excitable channels in low density, and this may imply a large variability (fluctuations) already in the receptive region of the neuron. In contrast, we expect that in the axon, most of the variability will arise in the compartment where the spikes are initiated and that, downstream along the axon, spike timing would be encoded very reliably and with high precision. 1 Still, complete failure may occur in axons at regions with a low safety factor for propagation (e.g., see Grossman, Parnas, & Spira, 1979). The other severe simplification in this study is that it relies exclusively on the standard HH equations. Although important as a reference, one ought to inquire to what extent the results can be generalized to other excitable systems. After all, most neurons consist of a large variety of ion channel types (e.g., A current, persistent and slow-inactivating Na+ currents, low threshold Ca+2 current) each with different density and kinetics. Moreover, based on direct measurements of single ion channels, different kinetic models could be assigned to each of the channel types (see Patlak, 1991; Vandenberg & Bezanilla, 1991; Marom, Salman, Lyakhov, & Braun, 1996) although their main features are usually similar. Indeed, a more realistic model of the spike-generating mechanism that takes into account various channel types, kinetic schemes, and voltage dependence, as well as long-term (memory) effects, should be explored (see Vandenberg & Bezanilla, 1991; Sigworth, 1993; 1 One might wonder if the uncertainty engendered by the bistability in the HH model would make spike propagation along the axon impossible. If, at each site in the axon, there is some probability that the system will go into a stable nonfiring state, the spike may fail somewhere along the axon. Also, a significant noise in the axon (see Rubinstein, 1995) may destroy temporal correlations between the output synapses. However, except for the compartment where the spike may, or may not, be initiated as a result of the depolarizing synaptic current, all other axonal compartments downstream receive relatively sharp and large current input from the spike in the previous node. For such inputs, channel stochasticity will be masked, and, consequently, the axon is expected to act as a highly reliable delay line, adding only relatively small jitter (see Horikawa, 1991, 1993; Abeles & Lass, 1975; Lass & Abeles, 1975).
Ion Channels and the Reliability of Spike Timing
1697
Toib, Lyakhov, & Marom, 1998; Abbott, Turrigiano, LeMasson, & Marder, 1996; Fleidervish, Friedman, & Gutnick, 1996). Still, we can state with confidence that the surprisingly large effect of channel stochasticity is likely to persist for other models. The important parameter that determines the size of fluctuations near threshold for spike firing is the number of open channels in this voltage regime. To the best of our knowledge, in all existing models of excitability, only a small percentage of the total number of excitable channels is open near threshold. Consequently, a large variability in spike firing time is also expected in these models. Clearly the exact nature of spike firing reliability will depend on channel properties. What about the subthreshold membrane oscillations, spontaneous spikes, and missing spikes? The nature of the bistability of the HH model, which is set by its inverted-Hopf bifurcation, is what “enables” the channel noise to switch the system spontaneously between its two stable states. This is the source of the subthreshold membrane oscillations, the “missing” and the “spontaneous” spikes. These phenomena may not occur in models with different types of stabilities (e.g., those with saddle-node bifurcation), and other phenomena may then arise (see Rinzel & Ermentrout, 1989; White, Budde, & Kay, 1995; Longtin & Hinzer, 1996). Indeed, our initial simulations of the spike initiation zone in a cortical pyramidal neurons using parameters taken from Mainen et al. (1995) and Rapp et al. (1996) better replicate the responses of cortical pyramidal cells to the different stimuli given in Mainen and Sejnowski (1995) and Nowak et al. (1997). 4.3 Sources of Noise in Neurons. In this study we considered only the effect of one source of noise in neurons: the intrinsic stochastic nature of the ion channels. A variety of other sources of noise exists, such as spontaneous synaptic release and variability in the number of transmitter molecules and the number of available receptors. Other possible sources of neuronal noise are changes in intracellular and extracellular ion concentrations and in the concentration of neuromodulators, as well as in the activity of ion pumps. Ephaptic interactions (electric field effect) of one neuron on other neurons is yet another possible source of noise. Our study shows that the intrinsic channel stochasticity should be considered as a key source of the variability of action potential timing. Clearly the other possible sources of noise should also be considered in order to quantify the relative contribution of each of these sources or their possible synergistic effect. Experimental studies should be performed to clarify this issue by using different manipulations—for example, blocking synaptic receptors (as did Mainen & Sejnowski, 1995), using a dynamic clamp to “replace” the noisy channel conductance with a deterministic conductance (Sharp, O’Neil, Abbott, & Marder, 1993), blocking specific ion channels and observing the resultant changes in membrane noise under voltage-clamp conditions (see initial results in this direction by Volgushev et al., 1998),
1698
Elad Schneidman, Barry Freedman, and Idan Segev
and blocking ion pumps. From a theoretical viewpoint, as shown by Jensen and Gartner (1997), a simple additive noise could qualitatively reproduce the differences in reliability and accuracy of spike timing in response to DC versus the fluctuating input found experimentally. However, because the neuronal noise is both voltage and activity dependent, it is clear that a simple additive noise is only a first-order approximation to the real case. The biophysically inspired model of channel noise is inherently voltage and activity dependent. The difference between models with simple additive noise and models with more realistic noise in terms of the fine temporal structure of spike firing requires further exploration. 4.4 Implication for Neural Coding. The reliability and accuracy of the neuron firing, in both the stochastic HH model and in cortical neurons (Mainen & Sejnowski, 1995; Bair & Koch, 1996), as well as in other neurons (de Ruyter van Steveninck et al., 1997), range between an unreliable response to DC inputs and a very reliable response to large-amplitude, highly fluctuating inputs. The actual current that reaches the site of spike initiation in neurons varies between these two extreme input patterns; its exact nature is determined by the degree of correlation among the synaptic inputs that impinge onto the neuron. Highly correlated synaptic activity gives rise to sharp current transients, whereas uncorrelated synaptic activity gives rise to smoother current traces. Our modeling results suggest that the neuron’s most basic machinery—the ion channels—enables it to act as a “smart” encoder. Slowly varying inputs are coded with low reliability and accuracy, and, hence, the information about such inputs is encoded almost exclusively by the spike rate. Trying to decode information about such an input, using the exact temporal structure of the spike train, would result in decoding the internal noise of the cell rather than decoding the input. On the other hand, correlated inputs are encoded with higher reliability and accuracy, giving more of a “temporal” code, that is, information about the input exists in the exact timing of the spikes. It is clear that in such a system, correlated activity of a population of neurons is likely to propagate within the network with high temporal precision, as suggested by Abeles (1991) in his “synfire” model and by the recent work by Marsalek, Koch, & Maunsell, (1997). In contrast, weakly correlated activity would propagate in an imprecise temporal manner and is more likely to decay within the network. The fact that the intrinsic noise of neurons may serve as a mechanism to destroy propagation of random correlations and, at the same time, allow for an accurate chains of activity to persist within the network has no bearing on the question of whether such chains do exist. Still, it is tempting to hypothesize that neurons do use their intrinsic channel stochasticity to exploit temporal code in addition to using rate code (see Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; de Ruyter van Steveninck et al., 1997; Hopfield, 1995; Theunissen & Miller, 1995; Abbott, 1994; Softky, 1995; Heller, Hertz, Kjaer, & Richmond, 1995).
Ion Channels and the Reliability of Spike Timing
1699
In addition to its significance for information coding, the relatively small size of the channel pool in the spike initiation zone has further computational implications. One clear advantage of such a limited channel pool was demonstrated in the work of Toib et al. (1998) (see also Marom, 1998), which shows that channel inactivation and reactivation kinetics have a significant, long-lasting (minutes) effect on the “availability” of channels, providing the neuron with an effective memory. Thus, the output spike train depends on both the properties of the instantaneous synaptic input and the history of the presynaptic and postsynaptic activity. This memory is embedded in the distribution of channel states in the spike initiation site. The nature and resolution of this memory depend on the size of the channel pool and on the kinetics and number of states of the channels. We hypothesize that the number of channels in the spike initiation zone may be optimized in some sense to give the reliability and accuracy discussed above, together with a short-term memory of the neuron’s activity. In this context, it is interesting to mention the work of Marder, Abbott, Turrigiano, Liu, and Golowasch (1996) and Abbott et al. (1996), which demonstrates activity-dependent long-term changes in the properties of intrinsic membrane currents. Another important effect of stochasticity in a limited pool of channels are the subthreshold and suprathreshold membrane oscillations. Such oscillations were observed in neocortical neurons (see Gutfreund et al., 1995) as well as in other neuron types (Hutcheon et al., 1994; Lampl & Yarom, 1997) and were suggested to serve as the underlying clock for neurons firing and even as a synchronizing and binding mechanism for neuronal activity (Hopfield, 1995; Volgushev et al., 1997). In the stochastic HH model, these voltage oscillations result from the channel noise; in other systems, other mechanisms may be responsible for these oscillations (e.g., Gutfreund et al., 1995; White et al., 1995). Finally, we suggest that channel stochasticity is likely to be a key player in setting neurons’ firing patterns, and thus it should be incorporated in models that explore the firing variability and spike timing of cortical neurons. It seems that channel stochasticity would be dominant in models that assume balanced excitation-inhibition (see Shadlen & Newsome, 1994, 1995; Softky, 1995; Bell, Mainen, Tsodyks, & Sejnowski, 1995), in which the effective resting membrane voltage of the cell is near threshold. Bell et al. (1995) and Troyer and Miller (1997) suggest complex repolarization and refractoriness schemes as another source for the high firing variability in neocortical neurons, which in many ways coincide with the effect of channel fluctuations. Whatever the correct model might be, the main message of this study is that the noise inherent in the activity of ion channels must be considered if one wishes to understand what determines the firing patterns of neurons and, consequently, the nature of the neural code.
1700
Elad Schneidman, Barry Freedman, and Idan Segev
Acknowledgments We are grateful to our friends and colleagues, Miki London, Yosi Yarom, John Rinzel, David Hansel, Shimon Marom, Zach Mainen, Henry Markram, Tali Tishby, and Moshe Abeles, for their insightful input during various stages of this work. This study was supported by grants from the ONR, the Human Frontiers, and the Israel Academy of Science. References Abbott, L. (1994). Decoding neuronal firing and modeling neural networks. Quarter. Rev. Biophys., 27, 291–331. Abbott, L., Turrigiano, G., LeMasson, G., & Marder, E. (1996). Activitydependent conductances in model and biological neurons. In D. Waltz (Ed.), Natural and artificial parallel computing (pp. 43–68). Philadelphia: SIAM. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., & Lass, Y. (1975). Transmission of information by the axon: II. The channels capacity. Biol. Cybern., 19, 121–125. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Comp., 8, 1185–1202. Bell, A., Mainen, Z., Tsodyks, M., & Sejnowski, T. (1995). “Balancing” of conductances may explain irregular cortical spiking (Tech. Rep. No. INC-9502). San Diego: Institute for Neural Computation, University of California at San Diego. Berzukov, S., & Vodyanoy, I. (1995). Noise-induced enhancement of signal transduction across voltage-dependent ion channels. Nature, 378, 362–364. Bialek, W., Rieke, F., de Ruyter van Steveninck, R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Braun, H., Huber, M., Dewald, M., Schafer, K., & Voigt, K. (1998). Computer simulations of neuronal signal transduction: The role of nonlinear dynamics and noise. Int. J. Bifurc. Chaos, in press. Chow, C., & White, J. (1996). Spontaneous action potentials due to channel fluctuations. Biophys. J., 71, 3013–3021. Clay, J., & DeFelice, L. (1983). Relationship between membrane excitability and single channel open-close kinetics. Biophys. J., 42, 151–157. Cooley, J., Dodge, F., & Cohen, H. (1965). Digital computer solutions for excitable membrane models. J. Cell. Comp. Physiol., 66, 99–100. de Ruyter van Steveninck, R., Lewen, G., Strong, S., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. DeFelice, L., & Isaac, A. (1992). Chaotic states in random world. J. Stat. Phys., 70, 339–352. Fitzhugh, R. (1965). A kinetic model of the conductance changes in nerve membrane. J. Cell. Comp. Physiol., 66, 111–118. Fleidervish, I., Friedman, A., & Gutnick, M. (1996). Slow inactivation of Na+
Ion Channels and the Reliability of Spike Timing
1701
current and slow cumulative spike adaptation in mouse and guinea-pig neocortical neurons in slices. J. Physiol., 493, 83–97. Grossman, Y., Parnas, I., & Spira, M. (1979). Differential conduction block in branches of a bifurcating axon. J. Physiol., 295, 283–305. Gutfreund, Y., Yarom, Y., & Segev, I. (1995). Subthreshold oscillations and resonant frequency in guinea-pig cortical neurons: Physiology and modeling. J. Physiol., 483, 621–640. Guttman, R., Lewis, S., & Rinzel, J. (1980). Control of repetitive firing in squid axon membrane as a model for a neuroneoscillator. J. Physiol., 305, 377–395. Heller, J., Hertz, J., Kjaer, T., & Richmond, B. (1995). Information flow and temporal coding in primate pattern vision. J. Comput. Neurosci., 2, 175–193. Hille, B. (1992) Ionic channels of excitable membrane (2nd ed.). Sunderland, MA: Sinauer Associates. Hodgkin, A., & Huxley, A. (1952). A quantative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London), 117, 500–544. Hopfield, J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Horikawa, H. (1991). Noise effects on spike propagation in the stochastic Hodgkin Huxley modles. Biol. Cybern. 66, 19–30. Horikawa, H., (1993). Simulation study on effects of channel noise on differential conduction at an axon branch. Biophys. J., 65, 680–686. Hutcheon, B., Miura, R., Yarom, Y., & Puil, E. (1994). Low threshold calcium current and resonance in thalamic neurons: A model of frequency preference. J. Neurophysiol., 71, 583–594. Jensen, R., & Gartner, D. (1997). Synchronization of randomly driven nonlinear oscillators and the reliable firing of cortical neurons. Paper presented at the Computational Neuroscience Meeting, Big Sky, MT. Klink, R., & Alonso, A. (1993). Ionic mechanisms for the subthreshold oscillations and differential electroresponsiveness of medical enthorinal cortex layer II neurons. J. Neurophysiol., 70, 144–157. Lampl, I., & Yarom, Y. (1997). Subthreshold oscillations and resonant behavior: Two manifestations of the same mechanism. Neuroscience, 78 (2), 325–341. Lass, Y., & Abeles, M. (1975). Transmission of information by the axon: I. Noise and memory in the myelinated nerve fiber of the frog. Biol. Cybern., 19, 61–67. Lecar, H., & Nossal, R. (1971a). Theory of threshold fluctuations in nerves. I. Relationships between electrical noise and fluctuations in axon firing. Biophys. J., 11, 1048–1067. Lecar, H. & Nossal, R. (1971b). Theory of fluctuations in nerves. II. Analysis of various sources of membrane noise. Biophys. J., 11, 1068–1084. Longtin, A., & Hinzer, K. (1996). Encoding with bursting, subthreshold oscillations, and noise in mammalian cold receptors. Neural Comp., 8, 215–255. Mainen, Z., Joerges, J., Huguenard, J., & Sejnowski, T. (1995). A model of spike initiation in neocortical pyramidal neurons. Neuron, 15, 1427–1439. Mainen, Z. F., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1508.
1702
Elad Schneidman, Barry Freedman, and Idan Segev
Marder, E., Abbott, L., Turrigiano, G., Liu, Z., & Golowasch, J. (1996). Memory from the dynamics of intrinsic membrane currents. Proc. Natl. Acad. Sci. USA, 93, 13481–13486. Marom, S. (1998). Slow changes in the availability of voltage-gated ion channels: Effects on the dynamics of excitable membranes. J. Mem. Biol., 161, 105–113. Marom, S., Salman, H., Lyakhov, V., & Braun, E. (1996). Effects of density and gating delayed-rectifier potassium channels on resting membranes potential and its fluctuations. J. Membrane Biol., 154, 267–274. Marsalek, P., Koch, C., & Maunsell, J. (1997). On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci. USA, 94, 735–740. Nowak, L., Sanches-Vives, M., & McCormick, D. (1997). Influence of low and high frequency inputs on spike timing in visual cortical neurons. Cerebral Cortex, 7, 487–501. Patlak, J. (1991). Molecular kinetics of voltage-dependent Na+ channels. Physiological Rev., 71(4), 1047–1080. Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (1992). Numerical recipes in C: The art of scientific computing (2nd ed.). Cambridge: Cambridge University Press. Rapp, M., Yarom, Y., & Segev, I. (1996). Modeling back propagating action potential in weakly excitable dendrites of neocortical pyramidal cells. Proc. Natl. Acad. Sci. USA, 93, 11985–11990. Reich, D., Victor, J., Knight, B., Ozaki, T., & Kaplan, E. (1997). Response variability and timing precision of neuronal spike trains in vivo. J. Neurophysiol., 77, 2836–2841. Rinzel, J., & Ermentrout, B. (1989). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 135–169). Cambridge, MA: MIT Press. Rubinstein, J. (1995). Threshold fluctuations in an N sodium channel model of the node of ranvier. Biophys. J., 68, 779–785. Sakmann, B., & Neher, E. (1995). Single-channel recording (2nd ed.). New York: Plenum. Schneidman, E., Freedman, B., & Segev, I. (1997). Spike timing reliability in a stochastic Hodgkin-Huxley model. Paper presented at the Computational Neuroscience Meeting, Big Sky, MT. Shadlen, M., & Newsome, W. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M., & Newsome, W. (1995). Is there a signal in the noise? Curr. Opin. Neurobiol., 5, 248–250. Sharp, A., O’Neil, M, Abbott, L, & Marder, E. (1993). The dynamic champ: Artificial conductances in biological neurons. Trends in Neurosci, 16, 389–394. Sigworth, F. (1993). Voltage gating of ion channels. Quarter. Rev. Biophys., 27, 1–40. Skaugen, E., & Walløe, L. (1979). Firing behavior in a stochastic nerve membrane model based upon the Hodgkin-Huxley equations. Acta Physiol. Scand., 107, 343–363.
Ion Channels and the Reliability of Spike Timing
1703
Softky, W. (1995). Simple codes versus efficient codes. Curr. Opin. Neurobiol., 5, 239–247. Strassberg, A., & DeFelice, L. (1993). Limitations of the Hodgkin-Huxley formalism: Effects of single channel kinetics on transmembane voltage dynamics. Neural Computation, 5, 843–856. Tang, A., Bartels, A., & Sejnowski, T. (1997). Cholinergic modulation preserves spike timing under physiologically realistic fluctuating input. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 866–872). Cambridge, MA: MIT Press. Theunissen, F., & Miller, J. (1995). Temporal encoding in nervous systems: A rigorous definition. J. Comp. Neurosci., 2, 149–162. Toib, A., Lyakhov, V., & Marom, S. (1998). Interaction between duration of activity and rate of recovery from slow inactivation in mammalian brain Na+ channels. J. Neurosci., 18, 1893–1903. Troyer, T., & Miller, K. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9. 733–745. Vandenberg, C., & Bezanilla, F. (1991). A sodium channel gating model based on single channel, macroscopic ionic, and gating currents in the squid axon. Biophys. J., 60, 1511–1533. Volgushev, M., Christiakova, M., & Singer, W. (1998). Modification of discharge patterns of neocortical neurons by induced oscillations of the membrane potential. Neuroscience, 83, 15–25. White, J., Budde, T., & Kay, A. (1995). A bifurcation analysis of neuronal subthreshold oscillations. Biophys. J., 69, 1203–1217. Yamada, W., Koch, C., & Adams, P. (1989). Multiple channels and calcium dynamics. in C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 97– 134). Cambridge, MA: MIT Press. Received September 4, 1997; accepted January 12, 1998.
LETTER
Communicated by Raphael Ritz
Fast Temporal Encoding and Decoding with Spiking Neurons David Horn Sharon Levanda School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 69978, Israel
We propose a simple theoretical structure of interacting integrate-and-fire neurons that can handle fast information processing and may account for the fact that only a few neuronal spikes suffice to transmit information in the brain. Using integrate-and-fire neurons that are subjected to individual noise and to a common external input, we calculate their first passage time (FPT), or interspike interval. We suggest using a population average for evaluating the FPT that represents the desired information. Instantaneous lateral excitation among these neurons helps the analysis. By employing a second layer of neurons with variable connections to the first layer, we represent the strength of the input by the number of output neurons that fire, thus decoding the temporal information. Such a model can easily lead to a logarithmic relation as in Weber’s law. The latter follows naturally from information maximization if the input strength is statistically distributed according to an approximate inverse law.
1 Introduction A fundamental question in neuroscience is the understanding of the neural code (Ferster & Spruston, 1995). The standard doctrine is that information is transmitted by action potentials. Since the sizes of these signals are uniform, the question arises how they code the information about the strength of the input that triggered them. The common belief is that the firing rate reflects this strength. However, firing rate evaluation takes a long time, whereas evidence exists that information passes quite quickly. In the hierarchical structure of the visual cortex, one can estimate (Thorpe, Fize, & Marlot, 1996) that during the time that the information has passed through several synaptic connections, most of the neurons in the different layers fired only once. This can lead to the conclusion that the exact timing, or latency, of the firing is of importance. For a debate on whether latency or rate plays the important role, we refer to Shadlen and Newsome (1994) and Softky (1995). In the visual system (Thorpe et al., 1996) one can imagine that a single action potential carries important information because it can specify the c 1998 Massachusetts Institute of Technology Neural Computation 10, 1705–1720 (1998) °
1706
David Horn and Sharon Levanda
location and orientation of the stimulus. Moreover, action potentials from different neurons can specify relative strengths of the inputs according to their latency—the later the spike, the weaker is the signal. In an abstract problem, we consider a neuron, or a group of neurons, subject to some input signal. To measure its strength through a single spike, one needs some reference system. Hopfield (1995) suggested employing a subthreshold periodic potential. An external constant input then causes the neuron to fire periodically, with some time lag with respect to the underlying oscillation. Hence, in this model, information is carried by the relative phase of the neuron’s firing with respect to the underlying periodic potential. The fact that latency can have computational meaning is brought out clearly in the models by Maass (1997). Working with coupled integrate-andfire (I & F) neurons, and allowing the synaptic input Wij Sj from neuron j to neuron i to be effectively constant from the time that neuron j fired until the time neuron i fires, leads to a temporal representation of analog amplitudes. The reference frame has to be defined in some manner, for example, by the firing pattern of one neuron. In our study we will not employ any external reference frame, so we will have to work with two spikes per neuron in an effort to obtain information about the input signal from the interspike interval. We investigate systems of I & F neurons with instantaneous interactions. To imitate the biological environment, we use random noise to generate a subthreshold potential. We ask how the timing of the neuron’s firing can carry the information about the input in such a system. In the following section we calculate the first passage time (FPT) for this system: the average interval between two consecutive spikes. This is the inverse of the firing rate of the neuron. It therefore carries the same analog information as the firing rate does. The question still remains how computation can be carried out in view of the fluctuations that every neuron is subjected to. The answer we propose is averaging over a large number of neurons that receive the same input. This is elaborated in the third section. We look also at the effect of excitatory lateral interactions in this neuronal assembly, leading to better synchronization and lowering the FPT. After understanding how to encode the strength of the stimulus as temporal information in the FPT, we construct a secondary neural layer that decodes this information, translating it into a number of firing neurons. Such a neural system can easily account for Weber’s law, which implies a logarithmic relationship between the strength of the stimulus and the reaction of the central nervous system. This follows naturally from an information maximization approach if the distribution of inputs obeys a statistical inverse law.
Encoding and Decoding with Spiking Neurons
1707
2 First Passage Time Consider an I & F neuron under the influence of noisy input. Its membrane potential obeys the equation, τ
dV = −(V − V0 ) + I, dt
(2.1)
where V0 is a resting potential and I represents the incoming current (in voltage units). When V reaches its threshold value Vth , a spike is emitted, and V is reset to V0 . Let us now assume that the input is composed of signal and noise, I = IS + IN . The noise is assumed to be gaussian, averaged around zero, so that in the absence of any signal, V does not usually reach the threshold. Hence, action potentials that are caused by noise are very rare. Once a signal is turned on, one may ask what the time period is between two consecutive action potentials. This is called the FPT since it is also the time it will take the neuron to fire if the membrane potential starts out at its resting value V0 . Clearly this time period is statistically distributed due to the noise. Only in the absence of noise is it deterministic. In general we ask for the average value only. This value is the inverse of the firing rate of the neuron, and it depends on both the amplitude of the signal, IS , and the characteristics of the noise, IN . This problem can be solved in a closed form (Ricciardi & Sacerdote, 1979). Our derivation follows Gardiner (1983). Defining a fluctuating normally distributed input for equation 2.1 with E{I(t)} = ηI , E{(I(t1 ) − ηI )(I(t2 ) − ηI )} = Aδ(t1 − t2 ),
(2.2)
one finds that V will also be normally distributed, with moments ³ ´ µ(t) = E {V(t)} = V0 e−(t−t0 )/τ + ηI 1 − e−(t−t0 )/τ , n o ´ A ³ 1 − e−2(t−t0 )/τ . σ 2 (t) = E (V(t) − µ)2 = 2τ
(2.3)
(2.4)
Hence, the probability density of V(t), given V(t0 ) = v0 = V0 , is 2 1 − (v−µ) e 2σ 2 . p (v, t | v0 , t0 ) ≡ P (V(t) = v | V(t0 ) = v0 ) = √ 2π σ
(2.5)
The arguments of this function are limited in range. The upper bound is defined by the threshold for firing, Vth . We may set also a lower bound, Vmin , that corresponds to the lowest potential value that the membrane can
1708
David Horn and Sharon Levanda
sustain. This probability density obeys a Fokker Planck equation, which one can use to obtain the following expression for the average FPT, T: T=2
τ2 A
Z
Vth
τ
dv · e A (v−ηI )
2
Z
V0
v
τ
e− A (z−ηI ) dz. 2
(2.6)
Vmin
The details of this calculation are described in the appendix. In Figure 1 we present a comparison between the theoretical formula and numerical simulations. The average noise term vanishes, so that ηI = IS . For large √ input currents, IS À A, the signal term dominates the FPT, in which case one can replace it with the noiseless solution T → −τ log(1 −
Vth − V0 ). IS
(2.7)
This can be further approximated by T→
τ (Vth − V0 ) , IS
(2.8)
for IS À Vth − V0 . In this study we look at the case when the size of the stimulus is larger than, or at least comparable to, that of the noise. The other extreme, when the stimulus is small compared to the noise, was recently studied by Stemmler (1996), who pointed out that a stochastic resonance behavior follows. Other works that have discussed the FPT in the context of neuronal interspike intervals are Ricciardi and Sacerdote (1979), Tuckwell and Cope (1980), Tuckwell (1989) and Lansky, Sacerdote, & Tomasetti (1995). 3 Population Average in a Neural Assembly Suppose a population of N neurons, subjected to random noise with given characteristics, starts out with random initial conditions. At a certain time, a stimulus IS is turned on—the same stimulus for all neurons. Let us denote by t1 the time that passes from the onset of the stimulus IS until the first firing. Can we use < t1 >, the average calculated over this population, to determine IS ? Clearly not, since we still lack the information when the stimulus was turned on. If, however, we include t2 in our computational scheme, we may do it, because the average value of t2 − t1 over a population of neurons is equal to the FPT, T =< t2 − t1 >≡
N 1 X t2 (i) − t1 (i). N i=1
(3.1)
Clearly this value depends on IS . Thus, if we are interested in fast information transfer, we can resort to a population average that conveys the same
Encoding and Decoding with Spiking Neurons
1709
Figure 1: Comparison of the theoretical evaluation of FPT with numerical simulations as a function of input strength, IS . The upper curve corresponds to the case of equation 2.6, with A = 0.5. The lower curve is a solution of equation 3.2 with A = 0.5 and Nw = 0.4. We have set V0 = 0 Vth = 1 Vmin = −∞ and τ = 1 in all figures.
information as a long temporal average needed to estimate the firing rate of any single neuron. A simple calculational scheme for performing such averages could be to let all neurons feed some output unit that can calculate < t1 > and < t2 > separately and train a simple network to associate the difference between the two with the relevant size of the input. The problem is the separation between the probability distributions of t1 and t2 . For weak stimuli or high noise, some of the neurons will fire for the second time before others fire for the first time. This problem can be alleviated by introducing synchronizing interactions between the neurons. The example that we use in our simulations is that of excitatory lateral interactions. The interactions are triggered by the occurrence of an action potential in the presynaptic neuron, and the synaptic transmission is assumed to be instantaneous.1 It is well known (Mirollo & Strogatz, 1990) that under these conditions, neurons tend to synchronize. 1 Synaptic delays are known to spoil synchronization. However, it was recently found that synaptic adaptation reinstates it (van Vreeswijk & Hansel, 1997).
1710
David Horn and Sharon Levanda
Figure 2: Firing activity of a set of N = 1000 I & F neurons, as a function of time after signal onset. The input signal is IS = 6 from t = 0 to t = 1000 and changes to IS = 12 from t = 1000 onward. The system starts out at t = 0 with all V = V0 ; hence, the first action potential is more synchronous than the rest. Synchrony deteriorates with time due to the random noise, whose variance is A = 0.2. The three frames correspond as follows: (a) spiking activity without lateral interactions, (b) spiking activity with lateral interactions of wN = 0.33, (c) strength of the input IS . The instantaneous excitatory interactions lead to improved synchrony and shortened FPT.
This effect can be seen in Figure 2, where we compare neuronal firing activity for cases with and without interactions. The effect of the coupling is twofold: in addition to improving synchrony, it shortens the resulting FPT. The new FPT can be estimated theoretically by a mean-field approximation. This approximation is valid when the accumulated effect of the noise overcomes synchronization, as at the end of the time scale in Figure 2. For simplicity, let us assume that all neurons interact with all others with the same synaptic weight w. We look for a self-consistent solution of the FPT, T, in terms of an input that is changed to I + w·N T where N is the total number of neurons: ¶ µ wN . (3.2) T = T IS + T The second term in the input represents the average effect of all other neurons. T remains a function of IN , the noise whose average vanishes. This
Encoding and Decoding with Spiking Neurons
1711
theoretical estimate is compared in Figure 1 with simulation results. For large inputs, it corresponds to a uniform reduction of the FPT, as can be seen from the asymptotic expression that is valid in this case: T→
τ (Vth − V0 ) − wN . IS
(3.3)
The simulations displayed in Figure 2 started with V = V0 at t = 0. As time goes on, we see how the accumulated effect of the noise desynchronizes the spiking activity of the neurons, even if synchronizing interactions exist among them. The mean-field approximation, used to derive equation 3.2, is valid in the limit when desynchronization takes place. However, its qualitative behavior describes correctly even the changes in the average time difference between the first few synchronized volleys of action potentials. 4 Decoding the Temporal Information In the previous section we discussed a way by which the intensity of the incoming stimulus can be encoded in the temporal domain. Now we wish to construct a simple neuronal model that decodes this information and translates the temporal coding back into one of firing intensity of a (new) neuronal layer. This system will work as a quantizer, representing the analog input signal IS by a number n of firing neurons. Hopfield (1995) has suggested employing delays in the decoding stage. This idea, which is backed by biological evidence, can clearly translate temporal information into a number of firing neurons by employing the latter as coincidence detectors. In the structure proposed in this section, we attempt to solve the problem without employing delays, relying on appropriate distributions of synaptic weights. Clearly a general system may employ distributions of both delays and synaptic strengths in the decoding stage. Consider the following feedforward network of I & F neurons. The first layer is a neuronal assembly of the type described in the previous section, subjected to subthreshold noise and acting under the influence of some external input IS . Introduce now a second layer of N neurons, connected by excitatory weights Wij to neurons in the first layer. These weights vary within some given range. We will specify below how this variation affects the decoding task. For simplicity, we assume that no synaptic delays occur. We choose the parameters of our model so that the first action potentials on the first layer are insufficient to bring the neurons on the second layer to their firing threshold. However, the second action potential of neurons on the first layer will cause some of the neurons on the second layer to fire. The number of second-layer neurons that will fire after t2 , n(IS ), is a function of the strength of the input. The membrane potential of neuron number i on the second layer is given by Vi = Wi (e−T(IS )/τ + 1),
(3.4)
1712
David Horn and Sharon Levanda
P where Wi = j Wij . Clearly all neurons for which Vi ≥ 1 will fire. This relation is valid in the limit when the signal IS dominates the noise. We will return to the case of noisy input in section 6. Let us enumerate the neurons on the second layer, i = 1, 2, . . . , N, such that Wi is an increasing function of i. Each neuron can then be associated with a particular value of input IS = Ji that will be the one that is just sufficient to bring that neuron to its firing threshold. It therefore has to obey 1 = Wi (e−T(Ji )/τ + 1).
(3.5)
The values Jn can be regarded as quantizer demarcation points that discretize the input into a set of N + 1 values, n = 0, 1, . . . , N, with J0 = ISmin , JN+1 = ISmax . The relation between Wi and Ji is depicted in Figures 3a and 3d. In order to determine the values of Wi , we propose to use maximization of mutual information (Linsker, 1988; Barlow, 1989) between the input IS and the output n(IS ):
I (n; IS ) = H(n) − H(n|IS ).
(3.6)
This is equivalent to maximization of the output entropy H(n) if n(IS ) is a deterministic function. This entropy is defined as X
H(n) = −
P(n)logP(n),
(3.7)
where P(n) is the probability of n neurons firing, Z
Jn+1
P(n) =
Jn
p(IS )dIS .
(3.8)
p(IS ) is the probability distribution of the input. The output entropy is maximized by a uniform distribution, P(n) =
1 . N+1
(3.9)
This leads to a relation between n and Jn , n+1 = N+1
Z
Jn+1 J0
p(IS )dIS ,
(3.10)
thus defining our (ideal) quantizer (Cover & Thomas, 1991) for the given problem. In Figures 3a and 3b, we present the quantization choices suitable for an input that follows the gaussian distribution of Figure 3c. We also display the
Encoding and Decoding with Spiking Neurons
1713
resulting relation between n and Jn , which, for high N, is well approximated by a smooth curve: Z n(IS ) ≈ N
IS
ISmin
p(I0 )dI0
(3.11)
Such a relation between response and stimulus is a general consequence of efficient information transfer (Atick, 1992). For the gaussian input, the result is a sigmoidal relation, as shown in Figure 3b. This analysis provides an algebraic solution of the maximization of mutual information. Alternatively, one can solve this problem iteratively by a gradient ascent method. The task is finding the quantization points Ji , or the connection strengths Wi . Starting with an initial random choice of Wi one can obtain the desired solution through the iterative process, 1Wi ∼
∂H ∂H ∂ Ji = . ∂Wi ∂ Ji ∂Wi
(3.12)
4 Weber’s Law Weber’s law is the psychophysical observation that neuronal response is proportional to log(IS ), or 1R = 1IS /IS ,
(5.1)
where R denotes the response of the central nervous system. This is observed in various modalities and is analyzed best in vision (Cornsweet, 1970). In trying to understand this general observation, one may think of it as a system’s property, or look for an interpretation on the level of the single neuron. Hopfield (1995) opted for the latter. Assuming the existence of an underlying periodic potential, he aimed at obtaining delay times that are proportional to log(IS ). A different approach, still restricted to the single neuron, has been recently proposed by Tal and Schwartz (1997). They work with I & F neurons and look for a logarithmic relationship between the firing rate of the neuron and its input. The logarithm of equation 2.7 does not help, since over most of the range the approximation of equation 2.8 dominates. Tal and Schwartz show, however, that if one allows for an absolute refractory period t0 in the range of values 0.13 < t0 /τ < 0.23, the neuron acts in a quasi-logarithmic fashion. In our model, it is easy to account for a logarithmic rule on the system’s level, shifting the burden of proof from the first-layer neuron to the analysis carried out by the second layer. To obtain n ∼ log(IS ) one has to make the choice of Wi values as indicated in Figure 3d, now covering the entire range of IS with a varying density. This can be motivated by the
1714
David Horn and Sharon Levanda
Figure 3: Decoding characteristics of a second layer of N = 50 neurons. Upper frames: Weights of output neurons as a function of input strength IS . Middle frames: Number of firing neurons on the second layer as a function of input strength. Lower frames: Probability distribution of the input. (a) Wi vs. Ji , and (b) ni vs. IS are depicted for an input gaussian distribution (c). Similarly, d and e represent the weights and the number of firing neurons for a 1/IS type distribution shown in f. (e) The logarithmic behavior expected from Weber’s law.
Encoding and Decoding with Spiking Neurons
1715
information-maximization principle if the distribution of the incoming intensity is p(IS ) ∼ 1/IS , as in Figure 3f. It seems quite reasonable to assume that in most modalities, we are subjected to input distributions that peak at low strengths. If the latter approximately resemble an inverse law distribution, the information-maximization principle will lead to an approximate logarithmic relation. It should be noted that our model is a simplified construction, described here over one order of magnitude variation of IS . The central nervous system clearly has a much more complicated architecture involved in making response decisions over ranges of many decades of input strength. Nonetheless, our model can serve as an example, implementing the principles by which this system operates. 6 Temporal Decoding in the Presence of Noise The quantizer systems of the previous two sections were described and analyzed as coding-decoding neuronal layers in the absence of noise. Moreover, the analysis has also made the assumption that all neurons on the first layer start out with the same initial values of their membrane potential; otherwise they would not fire simultaneously. Relaxing these assumptions makes it difficult to carry out the analysis. Nonetheless, as we will show here, small changes in initial conditions, as well as low noise values, do not change the qualitative behavior. To understand these effects, let us consider the hypothetical case that the first-layer neurons fire according to a uniform distribution over a range of time 1 that is small with respect to T. This problem is still noiseless but corresponds to a (very particular) modification of the initial conditions. It leads to the following modified relation, 1
1 = Wi (e−T(Ji )/τ + 1)
1 − e− τ
1
N − Ne− Nτ
,
(6.1)
which is an effective rescaling of the Wi , resulting in a shift of the curve of W vs. J as shown in Figure 4a. We consider this shift to be the important effect of any small variation due to either noise or initial conditions. Thus, in the presence of low noise we search for the best effective 1 that will give a W vs. J curve that leads to a match between P(n) and p(IS ). This suffices to provide an adequate solution to the problem, as shown in Figure 4b. 7 Discussion The dynamics of encoding a signal in a population of neurons has been a subject of interest in the neuroscience community for a long time. In his seminal paper, Knight (1972) introduced the I & F neuron as a tool to study such questions, and he has shown its response to oscillatory input with and without noise. The I & F neuron has become a standard simplified version of
1716
David Horn and Sharon Levanda
Figure 4: (Left) Weights of output neurons as a function of input strength IS for 1 =0 (upper curve) and .12τ (lower curve). (Right) Number of firing neurons on the second layer for the gaussian distribution of Figure 3c with (circles) and without (line) noise (A = 0.1). The noisy case was processed with 1 = 1.1τ .
a biological neuron and can be used to confront the problem that interests us in this article: How can a neural system encode and decode the strength of an input signal in a very short time? We suggest a model system composed of two layers of neurons. The first one encodes the amplitude of the stimulus by a time interval, and the second one translates it into a number of firing neurons. The operation of the second layer should terminate after its first volley of action potentials. This can be brought about by a self-inhibition mechanism. It is preferable if this inhibition also acts on the first layer in order to bring all its neurons to the baseline value before a new stimulus arrives. One may wonder why one needs the two layers at all. Can’t we connect the second layer directly to the stimulus, arranging some clever choice of weights that will lead to the same result? The problem encountered in such an architecture is lack of uniformity. It is difficult to construct a solution that leads to a coherent peak of activity. Allowing for neurons that react with different sensitivity, such as by factors of 2 and more, we will end up with outputs at different times, not to mention the problem of different numbers of spikes in the same time span. The two-layer structure allows us to introduce coherence in the first layer, leading to a manageable computation.
Encoding and Decoding with Spiking Neurons
1717
In the absence of noise, one can encode the input IS by the interspike interval of a single neuron. Assuming noise has to be taken into account, we have considered the case of many neurons on the first layer, each subjected to individual noise drawn from a common distribution. Clearly the average of this population serves the purpose of defining, with some accuracy, the desired time interval, or FPT. This approach replaces the insufficient time span, required for neural measurement of firing rate, by a population average of interspike intervals, which is the inverse of the instantaneous firing rate. Thus, although temporal acuity is important in our model, it still carries information that is equivalent to rate. The decoding stage needs many neurons for fine quantization even if one uses a single neuron for encoding (which is possible in the noiseless case). The number of active neurons, n, is a function of T and, as such, of the input signal IS . Such systems can be employed also for comparing the latency of different neurons on the first layer. As an example, imagine taking the limit of a very narrow gaussian distribution in Figure 3c. This would result in a very sharp profile of n(IS ) in Figure 3b. Such a system serves as a threshold device for IS > I0 or T < T0 , where I0 is the center of the gaussian, and T0 its corresponding FPT. For this purpose, even a single neuron on the second layer would suffice. Now let us use the same architecture to measure latency. We assume that the two neurons whose latency we compare constitute the first layer, and they are connected to the second layer with weights corresponding to the threshold function described above. If one neuron emits a spike and the second follows some time later, the system will react only if the time difference between these spikes is smaller than T0 . In order to design a system that will measure a latency of just T0 , we have to split the second layer into two parts—one connected so that it reacts to T < T0 , while the other reacts to T < T0 + 1T. Now one may employ an additional neuron, on a third layer, connected to the two parts of the second layer with opposite synaptic weights, so that it will be activated by just the correct latency within an accuracy of 1T. The minimal number of neurons needed for this purpose is two on the second layer and one on the third. Thus, in the general debate as to whether the neural code depends on rate or latency, we can take an intermediate position. We have the freedom of making use of the FPT, which is equivalent to rate, and was sufficient for the problem discussed in this article. Or we can measure, using the same apparatus, temporal correlations, which are often assumed to play an important role in problems like binding and segmentation. In our problem we have used an information-maximization approach to guide us in the design of the network. This is often regarded as a natural principle for biological development of synaptic connections. The same principle presumably can work for general latency-based structures whenever it can be appropriately defined.
1718
David Horn and Sharon Levanda
Appendix: Derivation of the First Passage Time First passage time can be evaluated by employing conventional means for treating random walk problems. The probability function for the neuron’s potential, equation 2.5, obeys the backward Fokker Planck equation (van Kampen, 1981), ∂ 1 ∂2 ∂p(v, t | v0 , t0 ) = −a1 (v0 ) 0 p(v, t | v0 , t0 ) − a2 2 p(v, t | v0 , t0 ) (A.1) 0 ∂t ∂v 2 ∂v0 with a1 (v) =
1 A (ηI − v) a2 = 2 . τ τ
(A.2)
This defines an Ornstein-Uhlenbeck process (van Kampen, 1981), with the condition p(v, t | v0 , t) = δ(v − v0 ).
(A.3)
We are interested in the time T that passes from the point that the membrane potential is v0 until it reaches the threshold Vth . Following Gardiner (1983) we calculate T by erecting an absorbing barrier at Vth and a reflecting barrier at Vmin , the lowest potential value that the membrane can sustain. The probability that at time t the potential has not passed Vth , that is, the probability that T ≥ t, is given by Z G(v0 , t) ≡
Vth
p(v, t | v0 , 0)dv.
Vmin
Employing the symmetry p(v, t | v0 , 0) = p(v, 0 | v0 , −t), the backward Fokker Planck equation can be written in the form, ∂2 ∂ 1 ∂p(v, t | v0 , 0) = a1 (v0 ) p(v, t | v0 , 0) + a2 p(v, t | v0 , 0). (A.4) ∂t ∂v0 2 ∂v0 2 Hence, G(v0 , t) obeys ∂G(v0 , t) 1 ∂ 2 G(v0 , t) ∂G(v0 , t) = a1 (v0 ) + a2 . ∂t ∂v0 2 ∂v20
(A.5)
Encoding and Decoding with Spiking Neurons
1719
The initial conditions are G(v0 , 0) = 1 Vmin ≤ v0 ≤ Vth = 0 elsewhere
(A.6)
and the boundary conditions are G(Vth , t) = 0, ∂G(V∂vmin ,t) = 0. Since G(v0 , t) is the probability that T≥ t, the mean of the first passage time is Z ∞ Z ∞ ∂G(v0 , t) dt = t· G(v0 , t)dt. (A.7) T = hTi = − ∂t 0 0 Integrating equation A.5 and noting that Z
∞ 0
∂G(v0 , t) dt = G(v0 , ∞) − G(v0 , 0) = −1, ∂t
one obtains a1 (v0 )
∂T 1 ∂ 2T + a2 2 = −1, ∂v0 2 ∂v0
(A.8)
min ) = 0, T(v0 = Vth ) = 0. The solution of with boundary condition ∂T(v0∂v=V 0 this equation is the desired expression for the FPT:
T=2
τ2 A
Z
Vth
V0
τ
dv · e A (v−ηI )
2
Z
v
τ
e− A (z−ηI ) dz. 2
(A.9)
Vmin
References Atick, J. J. (1992). Ecological theory of sensory processing. Network, Computation in Neural Systems, 3, 213–251. Barlow, H. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Cornsweet, T. N. (1970). Visual perception. New York: Academic Press. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Ferster, D., & Spruston, N. (1995). Cracking the neuronal code. Science, 270, 756– 757. Gardiner, C. W. (1983). Handbook of stochastic methods for physics, chemistry and natural sciences. Berlin: Springer-Verlag. Hopfield, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Lansky, P., Sacerdote, L., & Tomasetti, F. (1995). On the comparison of Feller and Ornstein-Uhlenbeck models for neural activity. Biol. Cybern., 73, 457–465.
1720
David Horn and Sharon Levanda
Linsker, R. (1988). Self-organization in a perceptual network. IEEE Computer, 21, 105–117. Maass, W. (1997). Fast sigmoidal networks via spiking neurons. Neural Computation, 9, 279–304. Mirollo, R., & Strogatz, S. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Ricciardi, L. M., & Sacerdote, L. (1979). The Ornstein-Uhlenbeck process as a model for neuronal activity. Biol. Cybern., 35, 1–9. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurob., 4, 569–579. Softky, W. R. (1995). Simple codes versus efficient codes. Current Opinion in Neurob., 5, 239–247. Stemmler, M. (1996). A single spike suffices: The simplest form of stochastic resonance in model neurons. Network: Comp. Neural Systems, 7, 687–716. Tal, D., & Schwartz, E. L. (1997). Computing with the leaky integrate-and-fire neuron: Logarithmic computation and multiplication. Neural Computation, 9, 305–318. Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. Tuckwell, H. C. (1989). Stochastic processes in the neuroscience. Philadelphia: Society for Industrial and Applied Mathematics. Tuckwell, H. C., & Cope, D. K. (1980). Accuracy of neuronal interspike times calculated from a diffusion approximation. J. Theo. Biol., 83, 377–387. van Kampen, N. G. (1981). Stochastic processes in physics and chemistry. Amsterdam: North Holland. van Vreeswijk, C., & Hansel, D. (1997). Rhythmic bursting in networks of adaptive spiking neurons. Paper presented at Sixth Annual Computational Neuroscience Meeting, CNS97, Big Sky, MT. Received August 21, 1997; accepted March 3, 1998.
LETTER
Communicated by Laurence Abbott
Linearization of F-I Curves by Adaptation Bard Ermentrout Department of Mathematics, University of Pittsburgh, Pittsburgh, PA 15260, U.S.A.
We show that negative feedback to highly nonlinear frequency-current (F-I) curves results in an effective linearization. (By highly nonlinear we mean that the slope at threshold is infinite or very steep.) We then apply this to a specific model for spiking neurons and show that the details of the adaptation mechanism do not affect the results. The crucial points are that the adaptation is slow compared to other processes and the unadapted F-I curve is highly nonlinear. 1 Introduction Many computational models of neural networks use a simple thresholdlinear output function for the firing rate, primarily for ease of analysis (Salinas & Abbott, 1996; Ben-Yishai, Hansel, & Sompolinsky, 1997, for example). This type of firing rate function can be a good approximation to regular spiking cells in sensory cortex (Mason & Larkman, 1990; Avoli & Olivier, 1989; McCormick, Connors, Lighthall, & Prince, 1985; Stafstrom, Schwindt, & Crill, 1984) Unlike fast spiking cells, regular spiking cells show adaptation to inputs. (Compare Figures 1 and 6 in McCormick et al., 1985.) The initial firing rate is fast and then slows to a steady-state rate. The curves of steady-state firing as a function of the input current (F-I curves) are very close to linear, whereas the initial interspike interval (an approximation of the firing rate before adaptation) is quite nonlinear. Wang (1998) has recently described a model for adaptation and, exploiting the fact that the adaptation process is slow, derives the firing rate and many other properties of adaptation. He uses a model for adaptation based on a calcium-dependent potassium channel (see below). He first considers the model without adaptation and fits the steady-state calcium concentration to a straight line. He then uses this to derive a firing rate for the adaptation, which is linear. This last fact is not surprising since his approximation for calcium concentration is also linear. Our goal here is to show that the F-I curve is linearized in any model with slow adaptation, provided that the unadapted firing rate curve is sufficiently nonlinear. The latter condition occurs in simple models such as the integrate-and-fire neuron, as well as in more “realistic” models such as Traub’s Hodgkin-Huxley-type model for spiking dynamics. That is, we make no a priori approximations; the linearization is a consequence of the analysis, not an assumption. We derive a similar model to Wang’s c 1998 Massachusetts Institute of Technology Neural Computation 10, 1721–1729 (1998) °
1722
Bard Ermentrout
“calcium” model but exploit the full nonlinear aspects of the F-I curve in our subsequent analysis. We close with some specific examples. 2 Negative Feedback and Linearization We start with an abstract model. Suppose that in the absence of negative feedback, the firing rate curve is given by a function, f (I), and we can suppose that the threshold current for firing is I = 0 by shifting the threshold. Then f (I) = 0 if I < 0 and f (I) is continuous for I ≥ 0. This last condition says that the firing rate must tend to zero as the current tends to zero. The Hodgkin-Huxley model does not have this property since the firing rate is some nonzero value as soon as the threshold is crossed (see Rinzel & Ermentrout, 1989). The integrate-and-fire model has a firing rate given by fiaf (I) =
K1 , log(1 + K2 /I)
which does vanish as I → 0. The firing rate for many cortical models, including the Traub model, is well approximated by √ (2.1) fsn (I) = A I. Both of these firing rates are highly nonlinear in that they cannot be approximated by a linear function near threshold. In both cases, the derivative of the function tends to infinity as the current decreases to threshold. Now suppose that there is negative feedback, due to adaptation, which slows the firing rate. Let z be the amount of this negative feedback so that the true firing rate is f (I − gz). The parameter g is the degree of negative feedback. This feedback term is in turn proportional to the firing rate so that z = β f (I − gz).
(2.2)
(This is not an unreasonable assumption, as we will see below in the analysis of biophysical models.) We must solve this for z and then use the result to obtain the true firing rate. Note that if I = 0, then z = 0 and there is no adaptive feedback. To see that firing rate will be linearized, we implicitly differentiate this: ¶ µ dz dz 0 = β f (I − gz) 1 − g , dI dI hence, βf0 dz = . dI 1 + βgf 0 Our nonlinearity assumption implies that the derivative of f is very large when I is small, so that we must have 1 dz = , dI g
Linearization of F-I Curves by Adaptation
1723
and from this we conclude that z≈
I g
and that the true firing rate, F=
I z ≈ . β βg
Thus, the negative feedback makes the behavior of the steady-state firing rate linear with the injected current. Furthermore, the details of the firing rate function itself do not even contribute to the slope of this function at low inputs. Only the degree of adaptation and the proportionality of the adaptation to firing rate matter. As a concrete example, suppose that the firing rate is like equation 2.1. Then we can solve for the true firing rate: fadapt (I) =
−A2 gβ +
p (A2 βg)2 + 4A2 I . 2
(2.3)
Note that the slope of this function at I = 0 is 1/(gβ), as noted in general in the above calculation. 3 Application to Biophysical Models The typical biophysical model consists of several compartments and has the form, C
X dV =− gk (t)(Ek − V) + I, dt k
(3.1)
for each compartment, and the time-dependent conductances have the form, p
q
gk (t) = g¯ k mk (t)hk (t), where the gates, mk , hk , obey first-order differential equations that are dependent on voltage or other quantities such as calcium. We want to separate out one of these conductances as having slow dynamics compared to all the others. This will represent the adaptation. Thus, we will write the sum in equation 3.1 as ¯ − Eadapt ), −Iion − gz(V where dz = ²Z(V, z, . . .), dt
1724
Bard Ermentrout
and ² is a small, positive parameter. Now we make our main assumption about the dynamics of the fast subsystem. If g = 0 as the current increases, the dynamics makes a transition from rest to repetitive firing via a saddlenode bifurcation on a circle. This type of bifurcation occurs in most models of cortical neurons when there is no adaptation. Then, it is known (see Rinzel & Ermentrout, 1989; Ermentrout, 1994; Hoppensteadt & Izhikevich, 1997) that the firing rate near the critical current at which the transition is made is p f = A I − Ic . Thus, the square-root firing relation discussed in the previous section arises naturally from the dynamics. In many neurophysiology articles, the F-I curve is depicted as bilinear (cf. Stafstrom et al. 1984), with a steep slope at low currents and a shallower slope at high currents. This is exactly what one would expect with a squareroot relationship. Figure 1A shows the data from their article together with the square-root F-I curve. (The points on this graph were obtained by digitizing their figure of the first interspike interval for cat layer V neurons during current injection. The first interspike interval presumably reflects the instantaneous firing rate before adaptation turns on.) Now we turn to the slow dynamics of the adaptation. We write the dynamics as dz = ²(H(V . . .) − z), dt where H depends only on the “fast” equations. When there is no firing, we assume that H is very close to zero so the adaptation does not contribute much hyperpolarization at rest. When firing repetitively, the potential and the other variables are periodic so that we can formally average the slow equation, obtaining Z 1 T dz = ²( H(V(t), . . .) dt − z). dt T 0 Last, we suppose that the spike width does not vary much over a range of current; thus, the integral of H over one period of the oscillation is roughly a constant, say, β. Since 1/T is just the firing rate, f , we see that dz ≈ ²(β f − z). dt Finally, the firing rate is proportional to the input current and the total outward current due to the adaptation. Near the transition from rest to repetitive firing, the potential spends most of the time near rest, so we can approximate the adaptation current by ¯ ¯ Iadapt ≈ gz(V rest − Eadapt ) ≡ gz1.
(3.2)
Linearization of F-I Curves by Adaptation
1725
We thus obtain the simple model for adaptation: p dz ¯ = ²(βA I − Ic − g1z − z). dt
(3.3)
This allows us to model adaptation quantitatively because each of these parameters is readily computed. Only A, β require us to compute numerically the actual solutions as a function of the input; all other parameters are readily available from the model. Wang (1998) has noted that the sequence of interspike intervals can be fitted to an exponential function and has used his simple linear equation to derive the time constant. Equation 3.3 is not linear, so the approach to steady state is not so readily computed. However, empirically, we find that β is small, so that z will be quite small as long as βA is small compared to g1, and we get the linear approximation µ ¶ p βAg1 dz = ²βA I − Ic − ² 1 + √ z. dt 2 I − Ic The effective time constant is thus µ
1/τadapt
βAg1 =² 1+ √ 2 I − Ic
¶ .
(3.4)
As Wang notes, the actual time constant for adaptation is faster than the time constant of the slow adaptation process. In fact, near threshold, the time constant can be quite small. 4 Examples 4.1 Traub Model with an M Current. We first consider Traub’s model for spiking dynamics (Traub & Miles, 1991) with a slow outward potassium current. The equations are 3.1 where the fast current is: Iion = gNa hm3 (V − ENa ) + gL (V − EL ) + gK n4 (V − EK ). The gating variables, m, h, n, obey equations of the form: dy = ay (V)(1 − y) − by (V)y. dt We used am (V) = .32(54 + V)/(1 − exp(−(V + 54)/4)), bm (V) = .28(V + 27)/(exp((V + 27)/5) − 1), ah (V) = .128 exp(−(50 + V)/18), bh (V) = 4/(1 + exp(−(V + 27)/5)), an (V) = .032(V + 52)/(1 − exp(−(V + 52)/5)), and bn (V) = .5 exp(−(57 + V)/40). The other parameters are C = 1 µF/cm2 , gNa = 100, gK = 80, gL = .1 mS/cm2 , and ENa = 50, EL = −67, EK = −100 mV.
1726
Bard Ermentrout
With these parameters, the critical current for the onset of rhythmicity is 0.45 µA/cm2 . We find (see Figure 1B) that a good fit for the unadapted firing rate is p f (I) ≈ 60 I − Ic , which is accurate except at high firing rates, where it is a little low. (The reason is that at high currents, the unadapted firing rate actually becomes more linear.) We add adaptation via the current: Iadapt = gz(V − EK ), where, dz = 0.01(1/(1 + exp(−(V + 20)/5)) − z). dt In Figure 2A, we show the result of the adaptation when a current pulse of 5 µA/cm2 is injected and g = 5 mS/cm2 . The initial firing rate is about 125 Hz and the steady-state firing rate is about 50 Hz. A numerical calculation shows that the average magnitude of the adaptation is proportional to the firing rate with factor, β = 0.00045. Using equation 2.3 along with equation 3.2 we get an approximation for the steady-state firing rate when there is adaptation. In Figure 1C, we plot the steady-state firing rate along with the computed firing rate from these two formulas. The fit is very good even though the nonadaptive firing rate approximation was somewhat low at high rates of firing. Figure 2B shows the slow variable z for the full model, as well as the solution to equation 3.3 with the parameters as chosen above. The dynamics is captured very closely. As with Wang’s simulations, random inputs into the model are very nicely reproduced with the averaged model even though the inputs occur at a fast time scale. Equation 3.4 shows that for current near criticality, the time constant is quite small, so the system can respond quickly. 4.2 Calcium-Dependent Potassium Current. A more likely type of adaptation is one that arises from a calcium-dependent potassium current. We have computed this model as well. The fast ionic current is as above, with the addition of a high-threshold calcium current: ICa = gCa m∞ (V)(V − ECa ), where m∞ (V) = 1/(1 + exp(−(v + 25)/5)), ECa = 120 mV, and gCa = 1 mS/cm2 . Calcium obeys the dynamics d[Ca2+ ] = −.002ICa − .0125[Ca2+ ], dt
Linearization of F-I Curves by Adaptation
1727
Figure 1: F-I curves for real cells and for models of cells computed numerically along with analytical functions obtained by averaging. (A) F-I curve from Stafstrom et al. (1984) fitted by the square-root function. (B) Traub model with no adaptation is well fitted by a square-root function. (C) Traub model with slow voltage-dependent potassium current with conductance, g = 5 mS/cm2 , and the analytic approximation based on equation 2.3. (D) Calcium-based model with a calcium-dependent potassium current, gahp = 5 mS/cm2 and the analytic approximation. In curves B, C, and D, current is in µA/cm2 .
and the adaptive current is Iahp = gahp ([Ca2+ ]/(30 + [Ca2+ ]))(V − EK ). Here, the calcium concentration is the slow parameter. The same type of calculation above shows that the average calcium concentration in the absence of adaptation is a factor of β = .008 times the firing rate. The unadapted firing rate is about the same as the firing rate for our previous model. Figure 1D shows a plot of the steady-state firing rate for this model for gahp = 5 mS/cm2 using A = 60, β = 0.008 and approximating the nonlinear gate, ([Ca2+ ]/(30 + [Ca2+ ])) by [Ca2+ ]/30 since the calcium concentration stays well below 30. As in the simpler adaptation model, the approximation of the steady-state firing rate fits the computed one very well.
1728
Bard Ermentrout
Figure 2: Response of the Traub model to a current pulse showing adaptation of the firing rate. (A) Voltage. (B) The adaptation gating variable for the full model using a voltage-dependent potassium current along with a solution to the averaged equation (3.3).
5 Discussion We have derived a model for adaptation similar to that derived by Wang (1998). However, we have used no approximations that do not arise directly from formal analysis of the dynamics. In particular, we have have not approximated the amount of adaptation by any ad hoc functions but rather obtained the approximations from the behavior of systems near a saddle-node limit cycle bifurcation. We have used a simple negative feedback argument to show that adaptation always linearizes nonlinear firing rates that are very steep near the transition to repetitive firing. There are some intrinsic heuristics in our simplification. The square-root form of the firing rate is formally exact only in the neighborhood of the saddle-node bifurcation. However, in that neighborhood, the frequency of the oscillation is very small, so that in order to average the equations formally to get equation 3.3, the adaptation would have to be unrealistically slow. Thus, we have used the square root formula for the firing rate beyond the regime where it is formally correct. The result is that we have a model for linearization that is not valid right at the bifurcation or valid far from the bifurcation, but rather at some intermediate range. Nevertheless, the numerical calculations show that the formula is good over a far greater range than would be expected from a formal mathematical point of view. Most neurons are better modeled by multiple compartments. In this case, there will be additional slow variables, and the simple scalar model (see equation 3.3) will not be valid. Instead, we will obtain equations for the
Linearization of F-I Curves by Adaptation
form:
1729
s X dzi = ²(βi I − Ic − gij zj ) − zi ), dt j
where βi and gij are related to how the adaptation in different compartments affects the somatic potential. Here we assume that only the soma is able to spike. The analysis of these equations is not as straightforward and would certainly depend on the number and strength of coupling between compartments. Acknowledgments This work was supported in part by the National Science Foundation and the National Institute of Mental Health. References Avoli, M., & Olivier, A. (1989). Electrophysiological properties and synaptic responses in the deep layers of the human epileptogenic neocortex in vitro. J. Neurophysiol, 61, 589–606. Ben-Yishai, R., Hansel, D., & Sompolinsky, H. (1997). Traveling waves and the processing of weakly tuned inputs in cortical models. J. Comput. Neuro., 4, 57–77. Ermentrout, G. B. (1994). Reduction of conductance-based models with slow synapses to neural nets. Neural Comp., 6, 679–695. Hoppensteadt, F., & Izhikevich, E. (1997). Weakly connected neural nets. Berlin: Springer-Verlag. Mason, A., & Larkman, A. (1990). Correlation between morphology and electrophysiology of pyramidal neurons in slices of rat visual cortex, II. Electrophysiology. J. Neurosci, 10, 1415–1428. McCormick, D. A., Connors, B. W., Lighthall, J. W., & Prince, D. A. (1985). Comparative electrophysiology of pyramidal and sparsely spiny stellate neurons in the neocortex. J. Neurophysiol., 54, 782–806. Rinzel, J. M., & Ermentrout, G. B. (1989). Analysis of neuronal excitability. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 135–170). Cambridge, MA: MIT Press. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. PNAS(USA), 93, 11956–11961. Stafstrom, C. E., Schwindt P. C., & Crill, W. E. (1984). Repetitive firing in layer V neurons from cat neocortex in vitro. J. Neurophysiol., 52, 264–277. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. Wang, X. J. (1998). Calcium coding and adaptive temporal computation in cortical pyramidal neurons. J. Neurophys, 79, 1549–1566. Received November 14, 1997; accepted February 23, 1998.
LETTER
Communicated by Alexandre Pouget
Mutual Information, Fisher Information, and Population Coding Nicolas Brunel Jean-Pierre Nadal Laboratoire de Physique Statistique de I’E.N.S.,† Ecole Normale Sup´erieure, 75231 Paris Cedex 05, France
In the context of parameter estimation and model selection, it is only quite recently that a direct link between the Fisher information and informationtheoretic quantities has been exhibited. We give an interpretation of this link within the standard framework of information theory. We show that in the context of population coding, the mutual information between the activity of a large array of neurons and a stimulus to which the neurons are tuned is naturally related to the Fisher information. In the light of this result, we consider the optimization of the tuning curves parameters in the case of neurons responding to a stimulus represented by an angular variable. 1 Introduction A natural framework to study how neurons communicate, or transmit information, in the nervous system is information theory (see, e.g., Blahut, 1988; Cover & Thomas, 1991). In recent years the use of information theory in neuroscience has motivated a large amount of work (e.g., Laughlin, 1981; Linsker, 1988; Barlow, Kaushal, & Mitchison, 1989; Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Van Hateren, 1992; Atick, 1992; Nadal & Parga, 1994). A neurophysiologist often asks in an informal sense how much information the spike train of a single neuron, or of a population of neurons, provides about an external stimulus. For example, a high activity of a CA3 hippocampal neuron may tell with good precision where a rat is in an environment. Information theory provides mathematical tools for measuring this “information” or “selectivity”: signals are characterized by a probability distribution, and the spike train of a neuron, or of a population, is characterized by a probability distribution conditioned by the signal. The mutual information between the signal and the neural representation is then a measure of the statistical dependency between the signal and the spike train(s).
†
Laboratory associated with CNRS, ENS, and universities Paris 6 and Paris 7.
c 1998 Massachusetts Institute of Technology Neural Computation 10, 1731–1757 (1998) °
1732
Nicolas Brunel and Jean-Pierre Nadal
A related domain, which also belongs to information theory, is the field of statistical parameter estimation. Here one typically has a sample of observations drawn from a distribution that depends on a parameter, or a set of parameters, that one wishes to estimate. The Cramer-Rao inequality then tells us that the mean squared error of any unbiased estimator of the underlying parameter(s) is lower bounded by the inverse of a quantity, which is defined as the Fisher information (Blahut, 1988). This means that the Fisher information is a measure of how well one can estimate a parameter from an observation with a given probability law. Thus in this sense it is also an “information” quantity. In spite of the similar intuitive meanings of these two quantities, an explicit relationship between the Fisher information and information-theoretic quantities has been derived only recently (Clarke & Barron, 1990; Rissanen, 1996), in the limit of a large number of observations. This link was exhibited first in the context of parameter estimation (Clarke & Barron, 1990) for the case of statistically independent and identically distributed observations. Then it was generalized to a broader context within the framework of stochastic complexity, with, as a result, a refined “minimum description length” criterion for model selection (Rissanen, 1996). The first goal of this article is to show that within the framework of information theory, this link manifests itself very naturally in the context of neural coding: • In the limit of a large number of neurons coding for a low-dimensional stimulus (population coding), the mutual information between the activities of the neuronal population and the stimulus becomes equal to the mutual information between the stimulus and an efficient gaussian estimator, under appropriate conditions, detailed in section 3. Here “efficient” means that the variance of this estimator reaches the Cramer-Rao bound. Since this variance is related to the Fisher information, the equality provides a quantitative link between mutual and Fisher informations. • This equality is also shown to hold for a single cell in the case of a gaussian noise with vanishing variance. • The mutual information between the stimulus and an efficient gaussian estimator reaches the mutual information between stimulus and the neuronal activities asymptotically from below. In the light of this relationship between Fisher and mutual information, we examine in section 5 several issues related to population codes, using neurons coding for an angular variable with a triangular or bell-shaped tuning curve. Such neurons are common in many neural structures. Cells of the postsubiculum (Taube, Muller, & Ranck, 1990) and anterior thalamic nuclei (Taube, 1995) of the rat are tuned to its head direction. Cells in MT cortex (Maunsell & Van Essen, 1983) of the monkey are tuned to the direction
Mutual Information, Fisher Information, and Population Coding
1733
of perceived motion. Cells in motor cortex of the monkey (Georgopoulos, Kalaska, Caminiti, & Massey, 1982) are tuned to the direction of the arm. We study the case of an array of N neurons, firing as a Poisson process in response to an angular stimulus with a frequency defined by the tuning curve of the neuron, in an interval of duration t. In many cases, Poisson processes are considered to be reasonable approximations of the firing process of cortical neurons (see, e.g., Softky & Koch, 1993). We calculate the Fisher information with an arbitrary density of preferred angles. Next we address the question of the optimization of the tuning curves, making use of the link between mutual information and Fisher information. The optimal density of preferred angles (i.e., the one that maximizes the mutual information) is calculated as a function of the distribution of angles, in section 5.2. As shown by Seung and Sompolinsky (1993), the Fisher information, in the large N limit, diverges when the tuning width of the neurons goes to zero. We show in section 5.3 that a finite tuning width stems from optimization criteria, which consider a finite system in which only a small number of spikes has been emitted by the whole population. We illustrate our results using triangular tuning curves in section 5.4. 2 General Framework 2.1 Parameter Estimation and Population Coding. In the general context of parameter estimation, one wishes to estimate a parameter θ from a set of N observations {xi , i = 1, . . . , N} ≡ xE (where the xi ’s might be discrete or continuous). θ may characterize a model P(E x|θ ), which is expected to be a good description of the stochastic process generating the observations {xi }. In the simplest case, the xi ’s are independent realizations of the same random variable, and P(E x|θ) =
N Y
p(xi |θ).
(2.1)
i=1
It may be the case—but this is not necessary—that the true process p∗ (x) belongs to the family under consideration, so that p∗ (x) = p(x|θt ) where θt is the true value of the parameter. In the context of sensory coding, and more specifically population coding (see, e.g., Seung & Sompolinsky, 1993, Snippe, 1996), θ is a stimulus (e.g,. an angle), and the information about this stimulus is contained in the activities {xi , i = 1, . . . , N} of a population of a large number N of neurons. In the simplest case xi represents the activity of the ith neuron of the output layer of a feedforward network with no lateral connection, so that the probability density function (p.d.f.) P(E x|θ) is factorized: P(E x|θ) =
N Y i=1
pi (xi |θ).
(2.2)
1734
Nicolas Brunel and Jean-Pierre Nadal
Here pi (xi |θ) is the (neuron-dependent) p.d.f. of the activity xi at neuron i when the input stimulus takes the value θ . If the task of the neural system is to obtain a good estimate of the stimulus value, the problem is a particular case of parameter estimation where there exists a true value—the one that generated the observed activity xE. 2.2 The Cramer-Rao Bound. In general one can find different algorithms for computing an estimate θˆ (E x) of θ from the observation of xE. If the chosen estimator θˆ (algorithm) is unbiased, that is, if Z ˆ x) = θ, x|θ)θ(E dN xP(E the variance of the estimator, D E σθ2 = (θˆ − θ)2 , θ
in which h . iθ denotes the integration over xE given θ (a sum in the case of a discrete state vector) with the p.d.f. P(E x|θ ), is bounded below according to (Cramer-Rao bound; see, e.g., Blahut, 1988): σθ2 ≥
1 J (θ)
(2.3)
where J (θ) is the Fisher information: ¿ À ∂ 2 ln P(E x|θ) J (θ) = − . ∂ θ2 θ
(2.4)
For a multidimensional parameter, equation (2.3) is replaced by an inequality for the covariance matrix, with J (θ ), the Fisher information matrix, being then expressed in terms of the second derivatives of ln P(E x|θ ) (Blahut, 1988). For simplicity we will restrict the discussion to the case of a scalar parameter, and consider the straightforward extension to the multidimensional case in section 3.2. An efficient estimator is one that saturates the bound. The maximum likelihood (ML) estimator is known to be efficient in the large N limit. 3 Mutual Information and Fisher Information 3.1 Main Result. We now give the interpretation of the Cramer-Rao bound in terms of information content. First, note that the Fisher information (see equation (2.4)) is not itself an information quantity. The terminology comes from an intuitive interpretation of the bound: our knowledge (“information”) about a stimulus θ is limited according to this bound. This qualitative statement has been turned into a quantitative statement in Clarke
Mutual Information, Fisher Information, and Population Coding
1735
and Barron (1990) and Rissanen (1996). Here we give a different presentation based on a standard information-theoretic point of view, which is relevant for sensory coding, rather than from the point of view of parameter estimation and model selection. We consider the mutual information between the observable xE and the stimulus θ . It can be defined very naturally in the context of sensory coding because θ is itself a random quantity, generated with some p.d.f. ρ(θ), which characterizes the environment. The mutual information is defined by (Blahut, 1988): Z P(E x|θ ) , (3.1) x|θ ) log I[θ, xE] = dθ dN xρ(θ) P(E Q(E x) where Q(E x) is the p.d.f. of xE: Z Q(E x) = dθρ(θ)P(E x|θ).
(3.2)
Other measures of the statistical dependency between input and output could be considered, but the mutual information is the only one (up to a multiplicative constant) satisfying a set of fundamental requirements (Shannon & Weaver, 1949). Suppose there exists an unbiased efficient estimator θˆ = T(E x). It has mean θ and variance 1/J (θ). The amount of information gained about θ in the computation of that estimator is Z ˆ − dθρ(θ )H[θˆ |θ], (3.3) I[θ, θˆ ] = H[θ] ˆ is the entropy of the estimator, where H[θ] Z ˆ = − dθˆ Pr(θ) ˆ ln Pr(θˆ ), H[θ] ˆ ] its entropy given θ . The latter, for each θ, is smaller than the and H[θ|θ entropy of a gaussian distribution with the same variance 1/J (θ ). This implies µ ¶ Z 2π e 1 ˆ − dθρ(θ ) ln ˆ ≥ H[θ] . (3.4) I[θ, θ] 2 J (θ) Since processing cannot increase information (see, e.g., Blahut, 1988, pp. 158–159), the information I[θ, xE] conveyed by xE about θ is at least equal to the one conveyed by the estimator: I[θ, xE] ≥ I[θ, θˆ ]. For the efficient estimator, this means µ ¶ Z 2π e 1 ˆ . (3.5) I[θ, xE] ≥ H[θ] − dθρ(θ) ln 2 J (θ )
1736
Nicolas Brunel and Jean-Pierre Nadal
In the limit in which the distribution of the estimator is sharply peaked around its mean value (in particular, this implies J (θ ) À 1), the entropy of the estimator becomes identical to the entropy of the stimulus. The righthand side (r.h.s.) in the above inequality then becomes equal to IFisher plus terms of order 1/J (θ), with IFisher defined as Z IFisher = H(2) −
µ ¶ 1 2π e dθρ(θ ) ln . 2 J (θ)
(3.6)
In the above expression, the first term is the entropy of the stimulus, Z
H(θ) = −
dθρ(θ) ln ρ(θ).
(3.7)
For a discrete distribution, this would be the information gain resulting from a perfect knowledge of θ . The second term is the equivocation due to the gaussian fluctuations of the estimator around its mean value. We thus have, in this limit of a good estimator, I[θ, xE] ≥ IFisher .
(3.8)
The inequality (see equation 3.8), with IFisher given by equation 3.6, gives the essence of the link between mutual information and Fisher information. It results from an elementary application of the simple but fundamental theorem on information processing, and of the Cramer-Rao bound. If the Cramer-Rao bound was to be understood as a statement on information content, I[θ, xE] could not be strictly larger than IFisher . If not, there would be a way to extract from xE more information than IFisher . Hence the above inequality would be in fact an equality, that is: Z I[θ, xE] = −
Z dθρ(θ) ln ρ(θ) −
µ ¶ 2π e 1 . dθρ(θ) ln 2 J (θ)
(3.9)
However, the fact that the equality should hold is not obvious. The CramerRao bound does not tell us whether knowledge on cumulants other than the variance could be obtained. Indeed, if the estimator has a nongaussian distribution, the inequality will be strict; we will give an example in section 4 where we discuss the case of a single output cell (N = 1). In the large N limit, however, there exists an efficient estimator (the maximum likelihood), and relevant probability distributions become close to gaussian distributions, so that one can expect equation 3.9 to be true in that limit. This is indeed the case, and what is proved in Rissanen (1996) within the framework of stochastic complexity, under suitable but not very restrictive hypotheses.
Mutual Information, Fisher Information, and Population Coding
1737
In the appendix, we show, using completely different techniques, that equation 3.9 holds provided the following conditions are satisfied: 1. All derivatives of G(E x|θ) ≡ ln P(E x|θ )/N with respect to the stimulus θ are of order one. 2. The cumulants (with respect to the distribution P(E x|θ )) of order n of aG0θ + bG00θ are of order 1/Nn−1 for all a, b, n. The meaning of the second condition is that at a given value of N, the cumulants should decrease sufficiently rapidly with n. This is in particular true when xi given θ are independent, as for model 2.2, but holds also in the more general case when the xi are correlated, provided the above conditions hold, as we show explicitly in the appendix using an example of correlated xi . 3.2 Extensions and Remarks. Multiparameter Case and Model Selection. It is straightforward to extend equation 3.8 to the case of a K-dimensional stimulus θE with p.d.f. ρ(θE ), and to derive the equality equation 3.9 for K ¿ N. The Fisher information matrix is defined as (Blahut, 1988) ³ ´ + * ³ ´ ∂ 2 ln P xE|θE Jij θE = − . ∂θi ∂θj E θ
The quantity IFisher for the multidimensional case is then Z IFisher = −
E ln ρ(θE) − dK θ ρ(θ)
Z
¶ µ (2π e)K 1 . dK θ ρ(θE) ln 2 det J (θE)
(3.10)
The second term is now equal to the entropy of a gaussian with covariance matrix J −1 (θE ), averaged over θE with p.d.f. ρ(θE). In the large N limit (K << N), one gets as for K = 1 the equality I = IFisher . One can note that formulas 3.9 and 3.10 are also meaningful in the more general context of parameter estimation, even when θ is not a priori a random variable. Within the Bayesian framework (Clarke & Barron, 1990), it is natural to introduce a prior distribution on the parameter space, ρ(θ ). Typically, this distribution is chosen as the flattest possible one that takes into account any prior knowledge or constraint on the parameter space. Then I tells us how well θ can be localized within the parameter space from the observation of the data xE. Within the framework of MDL (minimum description length) (Rissanen, 1996) the natural prior is the one that maximizes the mutual information— that is, the one realizing the Shannon capacity. Maximizing I = IFisher with
1738
Nicolas Brunel and Jean-Pierre Nadal
respect to ρ, one finds that this optimal input distribution is given by the square root of the Fisher information: √ J (θ) ρ(θ) = R 0 √ dθ J (θ 0 ) (for the multidimensional case, J in the above expression has to be replaced by det J ). This corresponds to the stimulus distribution for which the neural system is best adapted. Biased Estimators. The preceding discussion can be easily extended to the case of biased estimators, that is, for estimators θˆ with < θˆ >θ = m(θ ) 6= θ. The Cramer-Rao bound in such a case reads ³
σθ2 ´2 ≥
dm dθ
1 . J (θ)
(3.11)
This is a form of the bias-variance compromise. One can thus write an inequality similar to equation 3.4, replacing J by J /(dm/dθ )2 . In the limit where the estimator is sharply peaked around its mean value m(θ), one has ˆ θ, ˆ and θˆ ∼ m(θ ), so that ρ(θ)dθ ∼ P(θ)d Z dm ˆ = H[θ] + dθρ(θ) log | |. H[θ] dθ ˆ in the r.h.s. of the inequality 3.4, the terms dm cancel. Upon inserting H[θ] dθ The bound, equation 3.8, is thus also valid even when the known efficient estimator is biased. The Cramer-Rao bound can also be understood as a bound for the discriminability d0 used in psychophysics for characterizing performance in a discrimination task between θ and θ + δθ (see, e.g., Green & Swets, 1966). As discussed in Seung and Sompolinsky (1993), p d0 ≤ δθ J (θ),
(3.12)
with equality for an efficient estimator, and with d0 properly normalized with respect to the bias:
d02 =
³ ´2 δθ dm dθ σθ2
.
(3.13)
4 The Case of a Single Neuron 4.1 A Continuous Neuron with Vanishing Output Noise. We consider the case of a single neuron characterized by a scalar output V with a deter-
Mutual Information, Fisher Information, and Population Coding
1739
ministic function of the input (stimulus) θ plus some noise, with a possibly stimulus-dependent variance, p V = f (θ) + z σ g(θ),
(4.1)
where f and g are deterministic functions, and σ is a parameter giving the scale of the variance of the noise, and z is a random variable with an arbitrary (that is, not necessarily gaussian) distribution Q(z) with zero mean and unit variance. We are interested in the low noise limit, σ → 0. It is not difficult to write the Fisher information J (θ) and the mutual information I[θ, V] in the limit of vanishing σ . One gets, for sufficiently regular Q(.), Z I[θ, V] = H(2) +
0
1 f 2 (θ ) dθρ(θ ) log 2 − H(Z), 2 σ g(θ)
(4.2)
where H(Z) is the entropy of the z-distribution Q: Z
H(Z) = −
dzQ(z) log Q(z).
(4.3)
For the Fisher information one finds 0
f 2 (θ) J (θ) = 2 σ g(θ)
Z
0
Q 2 (z) , dz Q(z)
(4.4)
so that Z IFisher [θ, V] = H(2) +
0
f 2 (θ ) 1 1 + log dθρ(θ ) log 2 2 σ g(θ) 2
Z
0
dz
Q 2 (z) .(4.5) Q(z)
If the noise distribution Q is the normal distribution, one has H(Z) = log 2π e, and the integral in equation 4.4 is equal to 1, so that one has I = IFisher . Otherwise one can easily check that I > IFisher , in agreement with the general result (see equation 3.8). 1 2
4.2 Optimization of the Transfer Function. The maximization of the mutual information with respect to the choice of the transfer function f has been studied in the case of a stimulus-independent additive noise, that is, g ≡ 1, by Laughlin (1981) and Nadal and Parga (1994). The expression for the mutual information, equation 4.2, with g = 1, has been computed by Nadal and Parga (1994). What is new here is the link with the Fisher information. The mutual information is maximized when f is chosen according to the “equalization rule,” that is, when the (absolute value of) the derivative of f is equal to the p.d.f. ρ: the activity V is then uniformly distributed between
1740
Nicolas Brunel and Jean-Pierre Nadal
its min and max values. In the more general case in which g depends on the stimulus, the maximum of I is reached when fˆ, defined by √ fˆ0 ≡ f 0 / g, satisfies the equalization rule, fˆ = A
Z
θ
dxρ(x) + B,
(4.6)
where A and B are arbitrary given parameters (for g = 1, they define the min and max values of f ). An interesting case is g = f , which is relevant for the analysis of a Poisson neuron in the large time limit (see the next p 0 √ subsection). In this case f 0 / g = 2 f , and the maximum of I is reached when the square root of f satisfies the equalization rule. The fact that the mutual information is related to the Fisher information in the case of a single neuron with vanishing noise means that maximizing information transfer is identical to minimizing the variance of reconstruction error. In fact, two different qualitative lines of reasoning were known to lead to the equalization rule: one related to information transfer (the output V should have a uniform distribution; see, e.g., Laughlin, 1981) and one related to reconstruction error. (The slope of the transfer function should be as large as possible in order to minimize this error, and this, with the constraint that f is bounded, leads to the compromise | f 0 | = ρ. A large error can be tolerated for rare events.) We have shown here the formal link between these two approaches, using the link between mutual and Fisher information. 4.3 A Poisson Neuron. A related case is the one of a single neuron emitting spikes according to a Poisson process (in the next section we will consider a population of such neurons). The probability for observing k spikes in the interval [0, t] while the stimulus θ is perceived, is p(k|θ) =
(ν(θ)t)k exp(−ν(θ)t), k!
(4.7)
where the frequency ν is assumed to be a deterministic function ν(θ ) (the tuning curve) of the stimulus θ : θ → ν = ν(θ).
(4.8)
If the stimulus is drawn randomly from a distribution ρ(θ ), the frequency distribution P (ν) is given by Z
P (ν) =
dθρ(θ) δ( ν − ν(θ)).
(4.9)
Mutual Information, Fisher Information, and Population Coding
1741
The information-processing ability of such model neuron has been studied in great detail by Stein (1967). The results of interest here are as follows. At short times, the mutual information between the stimulus and the cell activity is, at first order in t (Stein 1967), Z dν P (ν)ν log
I(t) ∼ t
ν ≡ I1 (t), µ
(4.10)
where µ is the mean frequency. One can easily check that I1 (t) ≥ I(t) for any duration t. In fact at long times, information increases only as log t: in the large time limit, one gets (Stein 1967) Z I(t) =
Ã
r
2π eν dν P (ν) log P (ν) t
! .
(4.11)
From this expression, one gets that the optimal tuning curve is such that √ ν is uniformly distributed between its extreme values νmin and νmax . We can now analyze this result in view of the relationship between Fisher and mutual information. Making the change of variable ν → θ, with ρ(θ)dθ = P (ν)dν, together with equation 4.8, one can rewrite the mutual information at large times precisely as I(t) = IFisher ,
(4.12)
where IFisher is defined as in equation 3.6 with J (θ) the Fisher information associated with this single neuron: 0
J (θ) = t
ν 2 (θ) . ν(θ)
(4.13)
This result can be understood in the following way. In the limit of large t, the distribution of the number of emitted spikes divided by t, V ≡ k/t tends to be a gaussian, with mean ν(θ ) and variance ν(θ)/t, so that the properties of the spiking neuron become similar to those of a neuron having a continuous activity V, given by θ → V = ν(θ) + z
p ν(θ)/t,
where z is a gaussian random variable with zero mean √ and unit variance. This is a particular case of equation 4.1, with σ = 1/ t, f (.) = g(.) = ν(.).
1742
Nicolas Brunel and Jean-Pierre Nadal
5 Population of Direction-Selective Spiking Neurons 5.1 Fisher Information. We now illustrate the main statement of section 3 in the context of population coding. We consider a large number N of neurons coding for a scalar stimulus, (e.g., an angle). Equation 3.9 tells us that to compute the mutual information, we first have to calculate the Fisher information. x|θ ) = When the activities {xi } of the neurons given θ are independent, P(E 5pi (xi |θ), the Fisher information can be written
J (θ) =
N X
*
i=1
1 p2i (xi |θ)
µ
∂pi (xi |θ ) ∂θ
¶2 + ,
(5.1)
i,θ
where h.ii,θ is the integration over xi with the p.d.f. pi (xi |θ ). We restrict ourselves to the case of neurons firing as a Poisson process with rate νi (θ) in response to a stimulus θ ∈ [−π, π ]. νi (θ) therefore represent the tuning curve of neuron i. We make the following assumptions: νi (θ ) has a single maximum at the preferred stimulus θi ; the tuning curve depends on only the distance between the current stimulus and the preferred one and is a periodic function of this distance, νi (θ) = φ(θ − θi ),
(5.2)
through the same function φ. The locations of the preferred stimuli of the neurons are independently and identically distributed (i.i.d.) variables in the interval θ ∈ [−π, π ] with density r(θ ). Since our model neurons fire as a Poisson process, the information contained in their spike trains in an interval of duration t is fully contained in the number of spikes xi emitted by each neuron in this interval. For a Poisson process we have the law pi (xi |θ) =
(νi (θ)t)xi exp(−νi (θ )t). xi !
(5.3)
From equations 5.1 and 5.3 we can easily calculate the Fisher information:
J (θ) = t
N X ν 0 (θ)2 i
νi (θ)
i=1
.
For N large we can replace the sum by the average over the distribution of preferred stimuli, that is, Z
J (θ) = tN
π
−π
dθ 0 r(θ 0 )
φ 0 (θ − θ 0 )2 . φ(θ − θ 0 )
Mutual Information, Fisher Information, and Population Coding
1743
Figure 1: Triangular tuning curve corresponding to a minimal frequency νmin = 0.5 Hz, νmax = 40 Hz, a receptive field half-width a = 40 degrees, a preferred angle θi = 60 degrees.
For an isotropic distribution r(θ) = 1/(2π ) we recover the result of Seung and Sompolinsky (1993). To understand how the Fisher information depends on other parameters of the tuning curve φ, we redefine ¶ µ |θ − θi | , φ(θ − θi ) = {νmin + (νmax − νmin )8 a where νmin and νmax are the minimal and maximal frequency, a is the width of the tuning curve, and 8 is a decreasing function of |θ − θi |/a such that 8 = 1 for the preferred stimulus θ = θi , and 8 = 0 for stimuli far from the preferred stimulus, |θ − θi | À a. In terms of these parameters we have Z 80 (z)2 (νmax − νmin ) dzr(θ + az) νmin . J (θ) = tN a νmax −νmin + 8(z) The particular case of a triangular tuning curve, ½ 8(x) =
(1 − |x|) 0.
x ∈ [−1, 1] |x| > 1,
(5.4)
is shown in Figure 1. It will be considered in more detail below. For this tuning curve, and for a uniform distribution of preferred stimuli, the Fisher information has the simple form,
J (θ) = tN
(νmax − νmin ) νmax ln . πa νmin
(5.5)
1744
Nicolas Brunel and Jean-Pierre Nadal
Thus, as already noted by Seung and Sompolinsky (1993), the Fisher information diverges in different extreme cases: when the maximal frequency νmax goes to infinity and when the tuning width a goes to zero. Moreover, functions √ 8 can be found such that the Fisher information diverges (e.g., 8(x) = 1 − x2 ) for any value of νmin , νmax , and a. Thus, the optimization of the Fisher information with respect to these parameters is an ill-defined problem without additional constraints. Note that in these cases, the equation relating the Fisher information to the mutual information is no longer valid. There is, however, a well-defined optimization problem, which is the optimization with respect to the distribution of preferred orientations. It is considered in section 5.2. Then we show how finite size effects transform the problem of the optimization of both Fisher and mutual information with respect to the tuning width a into a well-defined problem. Last, we present some numerical estimates of these quantities, inserting some real data (Taube et al., 1990) in equation 3.9. 5.2 Optimization over the Distribution of Preferred Orientations. We ask which distribution of preferred orientations r(θ) optimizes the mutual information I. Obviously the optimal r will depend on the distribution of orientations ρ(θ). OptimizingR equation 3.9 with respect to r(θ 0 ) subject to the normalization constraint r(θ 0 )dθ 0 = 1 gives Z ρ(θ) ψ(θ − θ 0 ) = ct for all θ 0 , dθ R 00 00 dθ r(θ )ψ(θ − θ 00 ) in which we have defined ψ(x) =
φ 0 (x)2 . φ(x)
This condition is satisfied when R 0 0 dθ r(θ )ψ(θ − θ 0 ) R . ρ(θ) = dθ 0 ψ(θ 0 )
(5.6)
(5.7)
Thus, the optimal distribution of preferred stimuli is the one that, convolved with ψ (i.e., a quantity proportional to the Fisher information), matches the distribution of stimuli. Of course in the particular case of ρ(θ ) = 1/(2π ), we obtain ropt (θ) = 1/(2π). Note that equation 5.7 is also valid for unbounded stimulus values. This result (equation 5.7) is specific to the optimization of the mutual information. Different results would be obtained for, say, the maximization of the average of the Fisher information or the minimization of the average of its inverse. In fact, there is no optimum for the mean Fisher information, since it is linear in r(.).
Mutual Information, Fisher Information, and Population Coding
1745
Figure 2: (Left) SD of the reconstruction error after a single spike, as a function of a. (Right) Mutual information between the spike and the stimulus as a function of a. Note that minimizing the SD of the reconstruction error is in this case different from maximizing the mutual information.
5.3 Finite Size Effects: The Case of a Single Spike. We have seen that the Fisher information, in the large N limit, diverges when the tuning width a goes to zero. To investigate whether this property is specific to the large N limit, we study the case of a finite number of neurons in a very short time interval in which a single spike has been emitted by the whole population in response to the stimulus θ . In this situation, it is clear that the optimal estimator of the stimulus (the ML estimate in that case) is given by the preferred stimulus of the neuron that emitted the spike. For finite N, the Cramer-Rao bound in general is not saturated, and we have to calculate directly the performance of the estimator. It is a simple exercise to calculate the standard deviation (SD) of the error made by such an estimate for a triangular tuning curve given in equation 5.4, s SD(error) =
4π 3 νmin + a3 (νmax − νmin ) 6(2πνmin + a(νmax − νmin ))
which always has a minimum for 0 < a < π . We show in Figure 2 the SD of the reconstruction error after a single spike as a function of a, for νmax /νmin = 80. It has a minimum for a about 50 degrees, for which the SD of the error is about 35 degrees.
1746
Nicolas Brunel and Jean-Pierre Nadal
The mutual information, on the other hand, is · µ 2 ³ν ´ a 1 νmax max log I= π ν¯ νmax − νmin 2 ν¯ ! 2 ³ν ´ 1 ³ ´ νmin min 2 2 log − ν − νmin + − 2 ν¯ 4 max + (π − a)νmin log
³ν
min
´i
ν¯
where ν¯ = νmin +
a (νmax − νmin ) 2π
It also has a maximum for positive a. The width that maximizes I is different from the width that minimizes the SD of the reconstruction error, as shown in Figure 2. This is the case in general for nongaussian tuning curves. In this case, the half-width maximizing the mutual information is around 20 degrees. Note that in a wide range of a, the first spike brings about 2 bits of information about the stimulus. Thus, a finite optimal a stems from the constraint of already minimizing the error when only a small number of spikes have been emitted by the whole neuronal array. It implies that the largest receptive fields are most useful at very short times when only a rough estimate is possible, while smaller receptive fields will be most useful at larger times, when a more accurate estimate can be obtained. 5.4 Application to the Analysis of Empirical Data. In this section we use the experimental data of Taube et al. (1990) to show how equation 3.9 can be used to estimate both Fisher and mutual information conveyed by large populations of neurons on an angular stimulus (in this case the head direction of a rat). Taube et al. (1990) have shown that in the postsubiculum of rats, tuning curves can be well fitted by triangular tuning curves and that the distribution of preferred orientations is consistent with a uniform distribution. They also determined the distribution of the parameters of the tuning curve, νmax , a and the signal-to-noise ratio (SNR) α = νmax /νmin over the recorded neurons. These data indicate that these parameters have an important variability from neuron to neuron. Equation 5.5, in the case of such inhomogeneities, has to be replaced by tN J (θ) = π
Z
νmax dνmax dadα Pr(νmax , a, α) a
µ
1 1− α
¶ ln α.
(5.8)
in which Pr(νmax , a, α) is the joint probability of parameters νmax , a and α.
Mutual Information, Fisher Information, and Population Coding
1747
Figure 3: (Left) Minimal reconstruction error as given by the Cramer-Rao bound for N = 1000 (full curve), N = 5000 (dashed curve) postsubiculum neurons, using data from Taube et al. (1990) as a function of time. (Right) Mutual information for N = 1000 (full curve), N = 5000 (dashed curve), using the same data and equation 3.10.
Under global constraints, one may expect each neuron to contribute in the same way to the information, that is, (νmax /a)(1 − 1/α) ln α is constant. This would imply that the width a increases with νmax . Figure 9 of Taube et al. (1990) shows that there is indeed a trend for higher firing rate cells to have wider directional firing ranges. We can now insert the distributions of parameters measured in Taube et al. (1990) in equation 5.8 to estimate the minimal reconstruction error that can be done on the head direction using the output of N postsubiculum neurons during an interval of duration t. It is shown in the left part of Figure 3. Since we assume that the number of neurons is large, the mutual information conveyed by this population can be estimated using equation 3.9. It is shown in the right part of the same figure. In the case of N = 5000 neurons, the error is as small as one degree even at t = 10 ms, an interval during which only a small proportion of selective neurons has emitted a spike. Note that one degree is the order of magnitude of the error made typically in perceptual discrimination tasks (see, e.g., Pouget & Thorpe 1991). During the same interval, the activity of the population of neurons carries about 6.5 bits about the stimulus. Doubling the number of neurons or √the duration of the interval divides the minimal reconstruction error by 2 and increases the mutual information by 0.5 bit.
1748
Nicolas Brunel and Jean-Pierre Nadal
6 Conclusion In this article we have exhibited the link between Fisher information and mutual information in the context of neural coding. This link was first derived in the context of Bayesian parameter estimation by Clarke and Barron (1990) and then in the context of stochastic complexity by Rissanen (1996). We have shown that the result of Rissanen applies to population coding— that is, when the number of neurons is very large compared to the dimension of the stimulus. Our derivation of the link uses completely different techniques. The result is that the mutual information between the neural activities and the stimulus is equal to the one between the stimulus and an ideal gaussian unbiased estimator whose variance is equal to the inverse of the Fisher information. The result is true not only for independent observations, but also for correlated activities (see Rissanen, 1996, and the appendix). This is important in the context of neural coding since noise in different cells might in some cases be correlated due to common inputs or to lateral connections. This result implies that in the limit of a large number of neurons, maximization of the mutual information leads to optimal performance in the estimation of the stimulus. We have thus considered the problem of optimizing the tuning curves by maximizing the mutual information over the parameters defining the tuning curves: optimization of the choice of preferred orientations, widths of the tuning curves. In the simple model we have considered, the optimal value for the width is zero, as in Seung and Sompolinsky (1993). However, we have shown that finite size effects necessarily lead to a nonzero optimal value, independent of the decoding scheme. We have discussed in detail the case of a one-dimensional stimulus (an angle). A similar relationship between mutual information and the Fisher information matrix holds for any dimensionality of the stimulus, as long as it remains small compared to the number of neurons. It would be straightforward to consider in the more general case the optimization of the tuning curves. Zhang, Ginzburg, McNaughton, and Sejnowski (1998) have computed the Fisher information matrix for two- and three- dimensional stimuli. Their results imply that optimal tuning curve parameters will depend strongly on the dimensionality of the stimulus. We have briefly discussed the cases of a finite number of neurons and the short time limit. In this case maximization of the mutual information leads in general to different results than does minimization of the variance of reconstruction error, as found also in networks with the same number of input and output continuous neurons (Ruderman, 1994). We are currently working on these limits for which many aspects remain to be clarified. We have not addressed the problem of decoding. In the asymptotic limit, the maximum likelihood (ML) decoding is optimal. Recently Pouget and Zhang (1997) showed that a simple recurrent network can perform the computation of the ML estimate. This suggests that the optimal performance,
Mutual Information, Fisher Information, and Population Coding
1749
from the point of view of both information content and decoding, can be reached by a simple cortical architecture. Appendix Our goal is to derive equation 3.10, that is, to compute the mutual information I = I[P, ρ] between the random variables xE and θ, working in the large N limit. We recall that xE can be seen as either a set of N observations related to the measurement of an unknown parameter θ or the set of responses of N neurons to a stimulus θ. The mutual information I is defined by Z I=
¿
P(E x|θ) dθρ(θ) ln Q(E x)
À θ
,
where Q(E x) is the p.d.f. of xE: Z Q(E x) = dθρ(θ)P(E x|θ).
(A.1)
(A.2)
In equation A.1, h . iθ denotes the integration over xE given θ with the p.d.f. P(E x|θ). We define G(E x|θ) ≡
1 ln P(E x|θ). N
(A.3)
We will make the following hypothesis: 1. All derivatives of G with respect to the stimulus θ are of order 1 in the large N limit. 2. The cumulants of order n of xG0θ + yG00θ are of order 1/Nn−1 in the large N limit. Both properties are verified for the factorized models (see equations 2.1 and 2.2), but also in some cases in which xi given θ are correlated variables, as we show at the end of the appendix. The large N limit allows us to use the saddle-point method (Bhattacharya & Rao, 1976; Parisi, 1988) for the computation of integrals over θ, in particular for the computation of the p.d.f. Q(E x), using the fact that P(E x|θ ) will appear to be sharply peaked around its most probable value, the maximum likelihood (ML) estimator of θ . We will use standard cumulant expansions for the integration over xE in the equivocation part of I, and this will eventually lead to the announced result, equation 3.10. Distribution of xE. The p.d.f. Q(E x) can be written Z Q(E x) = dθρ(θ) exp NG(E x|θ ).
(A.4)
1750
Nicolas Brunel and Jean-Pierre Nadal
For large N, the integral is dominated by the maxima of the integrand. These are defined by the solutions of x|θ) = 0, G0θ (E
(A.5)
which satisfy G00θ (E x|θ) < 0. Above we have denoted by G0θ (resp. G00θ ) the first (resp. second) partial derivative of G with respect to θ . Let us assume that x). The Taylor expansion around G(E x|θ) has a single global maximum at θm (E x) is θm (E 1 x)) + G00θ (E x|θm (E x))(θ − θm (E x))2 + . . . G(E x|θ) = G(E x|θm (E 2 Using standard saddle-point techniques we find, µ µ ¶¶ 1 , x) 1 + O Q(E x) = Qm (E N
(A.6)
with s x) ≡ ρm (E x) Qm (E
2π , exp [NGm (E x)] , N|0(E x)|
(A.7)
where x) ≡ ρ(θm (E x)), ρm (E
(A.8)
x) ≡ G(E x|θm (E x)), Gm (E
(A.9)
x|θm (E x)). 0(E x) ≡ G00θ (E
(A.10)
and
x) is the ML estimator of θ. Note that θm (E The Mutual Information: Integration over θ . Let us start with the following expression of the mutual information: Z Z Z N x) dθ Q(θ|E x) ln Q(θ|E x), I = − dθρ(θ) ln ρ(θ) + d x Q(E |θ )ρ(θ ) . The first term is the entropy of the input distribuwith Q(θ|E x) = P(ExQ(E x) tion. The second term can be written Z Z Z x) ln Q(E x) + dN x dθ P(E x|θ )ρ(θ ) ln P(E x|θ )ρ(θ ). (A.11) − dN x Q(E
Mutual Information, Fisher Information, and Population Coding
1751
In the above expression, the first part is the entropy of xE in which we can x) as given in equation A.7, leading to replace Q(E x) by Qm (E ¸ · Z x)| 1 N|0(E . x) NGm + ln ρm − ln − dN x Qm (E 2 2π The last term in equation A.11 can be written as Z Z dθ A(E x|θ) exp A(E x|θ ), dN x with A(E x|θ) ≡ ln P(E x|θ)ρ(θ). Now
Z
Z
dθ exp λA|λ=1 ,
dθ A(E x|θ) exp A(E x|θ) = ∂λ
which is again computed with the saddle-point method, Z
s
¯ ¯ 2π ¯ exp λ [NGm + ln ρm ] ¯ dθ A(E x|θ) exp A(E x|θ) = ∂λ ¯ λN|0(E x)| λ=1 ¸ · 1 . = Qm NGm + ln ρm − 2
Finally, putting everything together, the mutual information can be written as Z I = − dθρ(θ) ln ρ(θ) s ¶ µ Z x)| 1 N|0(E 2π N exp [NGm (E ln . (A.12) x)) x)] + d x ρ(θm (E N|0(E x)| 2 2π e It is interesting to compare equations 3.10 and A.12. As in equation 3.10, R the first term above is the entropy H[θ ] = − dθρ(θ) ln ρ(θ) of the stimulus distribution; the second term, the equivocation, is given in equation A.12 by the average over the p.d.f. of uE of the logarithm of the variance of the estimator. The Mutual Information: Integration over xE. The last difficulty is to perform in equation A.12 the trace on xE. One cannot apply the saddle-point method directly because the number of integration variables is precisely equal to the number N that makes the exponential large. However, the difficulty is circumvented by the introduction of a small (compared to N) auxiliary integration variables, in such a way that the integration over the
1752
Nicolas Brunel and Jean-Pierre Nadal
xi ’s can be done exactly. Then we again use the fact that N is large to perform the integration over the auxilary variables to leading order in N. First we use the relation Z ¡ ¢ x)) = dθF(θ)|G00θ (E x|θ )|δ G0θ (E x|θ ) F(θm (E x), which is valid for an arbitrary function F. We in order to deal with θm (E then use an integral representation of the delta function: Z ¡ ¢ ¢ ¡ dy exp iyG0θ (E x|θ) = x|θ ) . δ G0θ (E 2π x|θ ), we introduce conjugate variables τ , Similarly, in order to deal with G00θ (E τˆ . For any function F we can write Z ¡ ¢ 1 F(τ ) exp iτˆ (τ − G00θ (E x|θ)) = dτ dτˆ x|θ )) . F(G00θ (E 2π Putting everything together, we get √ |τ |
Z I = H[θ] +
dθdydτ dτˆ √ ρ(θ) 3 N(2π ) 2 µ ¶¶ µ ¡ ¢ N|τ | 1 ln exp iτˆ τ + K(θ, y, τˆ ) , × 2 2π e
in which
¿
K(θ, y, τˆ ) = ln (recall that h. . .iθ = expansion
R
(A.13)
¶À µ ∂G(E x|θ ) x|θ ) ∂ 2 G(E + iy exp −iτˆ ∂θ 2 ∂θ θ dN x exp[NG(E x|θ )] . . .). We now make the cumulant
µ ¶ ® 1 D E exp A θ = exp hAiθ + ( A2 − hAi2θ ) + · · · θ 2
for
A ≡ −iτˆ G00θ + iyG0θ .
(A.14)
The cumulant expansion will be valid if the cumulants of order n of A with the law exp[NG(E x|θ)] decrease sufficiently rapidly with n. A sufficient condition is assumption: the cumulants of order n of A (n = 1, 2, . . .) are of order 1/Nn−1 .
(A.15)
Mutual Information, Fisher Information, and Population Coding
1753
Using the following identities obtained by deriving twice 1 = h1iθ with respect to θ, ® 0 = G0θ θ D¡ ¢ E ® 2 0 = G00θ θ + N G0θ , θ
one gets K = iτˆ J −
µ ¶ y2 J yτˆ Z 1 τˆ 2 12 − + +O 2N 2N N N2
(A.16)
where J, 1, Z are given by D¡ ¢ E ® 2 J ≡ − G00θ θ = N G0θ θ ³D¡ ¢ E 00 ®2 ´ 2 00 2 1 ≡ N Gθ − Gθ θ 0 00 ® θ Z ≡ N G θ Gθ θ . Note that the Fisher information J (θ) is equal to N J, and that 12 and Z are of order 1 because of the assumption A.15. In these terms we have √ µ ¶¶ µ Z N|τ | 1 |τ | ln ρ(θ) I = H[θ] + dθdydτ dτˆ √ 3 2 2π e N(2π ) 2 µ ¶¶ µ 2 2 2 y J yτˆ Z 1 τˆ 1 − + +O . exp iτˆ (τ + J) − 2N 2N N N2 Our last task is to integrate over the remaining auxiliary variables τ , τˆ , 2 y. Using the fact that 12 − ZJ > 0, deduced from the Schwartz inequality, 00 00 2 < G0θ (G00θ − < G00θ >) >2 ≤ < G02 θ >< (Gθ − < Gθ >) >,
the integrations over y and τˆ are simple gaussian integrations, leading to: Z I = H[θ] + dθρ(θ) v s à ! µ ¶ Z u N|τ | N (τ + J)2 |τ | 1 dτ u N ln exp − . × √ t 2 J 2 2π e 2 12 − Z2 2π 12 − ZJ J The integration over τ is with a gaussian weight centered at τ = −J and with a width going to zero as N goes to infinity: v u N (τ + J)2 1 u N exp − = δ(τ + J). lim √ t 2 N→∞ 2 12 − Z2 2π 12 − Z J
J
1754
Nicolas Brunel and Jean-Pierre Nadal
Using the fact that the Fisher information is J (θ ) = NJ, we obtain Z
Z dθρ(θ) ln ρ(θ) −
I=−
µ ¶ 1 2π e (1 + O(1/N)), (A.17) dθρ(θ) ln 2 J (θ )
which is the announced result (equation 3.10). The conditions (see A.15) of validity of the calculation are satisfied when xi given θ are independent, as in equations 2.1 and 2.2, but can also be satisfied when they are correlated. We discuss below these two cases. Conditional Independence of Activities. In the case of independent neurons, the model in equation 2.2, one can easily check that the cumulant expansion at order n gives terms of order 1/Nn−1 . Indeed, in that case, one has G(E x|θ) =
1 X gi (xi |θ), N i
(A.18)
so that A=
∂ 2 gi (xi |θ ) ∂gi (xi |θ ) 1 X . Ai , with Ai = −iτˆ + iy N i ∂θ 2 ∂θ
(A.19)
The cumulant expansion then reads
À ¿ Ai log exp exp A = exp N i ¶ Xµ 1 1 D 2E 2 3 hAi i + 2 ( Ai − hAi i ) + O(1/N ) . = exp N N i ®
X
(A.20)
Thus equation A.16 holds, with J, 1, Z given by *µ ¶ + ¿ À 1 X ∂gi 2 1 X ∂ 2 gi = J=− N i ∂θ 2 θ N i ∂θ θ Ã*µ + ! ¿ ¶ À 2 2 2 2 ∂ gi 1 X ∂ gi 12 = − N i ∂θ 2 ∂θ 2 θ θ ¿ À 1 X ∂gi ∂ 2 gi . Z= N i ∂θ ∂θ 2 θ
(A.21)
Correlated Neurons. The conditions on the cumulants of A, A.15, do not imply that the xi are independent, but they do have the qualitative meaning that they convey of order N independent observations. To see this, we
Mutual Information, Fisher Information, and Population Coding
1755
give an example of correlated activities for which the conditions are satisfied. We consider the following simple model. Each xi can be expressed in terms of the same N independent random variables, ξa , a = 1, . . . , N, as xi =
X
Mi,a ξa .
(A.22)
a
where M is a θ-independent invertible matrix, and the ξ ’s are, given θ, statistically independent variables of arbitrary p.d.f. ρa (ξ |θ ), a = 1, . . . , N. The factorized case is recovered for M diagonal. In the case where the ρ’s are gaussian and M is orthogonal, equation A.22 is the principal component decomposition of the x’s. We show now that the case M invertible with arbitrary ρ’s satisfies the conditions A.15. First, it is obvious that the result (see equation 3.10) holds: with the change of variables xE → M−1 xE = ξE , one recovers the case of independent (given θ ) activities. One can then apply equation 3.10 to I(θ, ξE ). Since P(ξE |θ ) = P(E x|θ)| det M|, with M independent of θ , I(θ, xE) = I(θ, ξE ) and the Fisher information associated to P(ξE |θ ) is equal to the one associated to P(E x|θ ), so that equation 3.10 holds for I(θ, xE). Second, one can check directly that the conditions A.15 hold. For our model, G is ! Ã X ¯ 1 X 1 −1 ¯ ln ρa Ma,i xi θ , (A.23) G(E x|θ) = − ln | det M| + N N a i x|θ) and G00θ (E x|θ ) with respect to the pdf P(E x|θ ) are so that the cumulants of G0θ (E equal to the cumulants of G0θ (ξE |θ ) and G00θ (ξE |θ ) with respect to the factorized Q pdf P(ξE |θ) = a ρa (ξ |θ) for which A.15 holds. Acknowledgments We thank Alexandre Pouget and Sophie Deneve for an interesting discussion, and Sid Wiener for drawing the data of Taube et al. (1990) to our attention. We are grateful to Alexandre Pouget and Peter Latham for pointing out a mistake in an earlier version of the article, and to the referees for comments that helped us to improve the article significantly. References Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251. Barlow, H. B., Kaushal, T. P., & Mitchison G. J. (1989). Finding minimum entropy codes. Neural Comp., 1, 412–423. Bhattacharya, R. N., & Rao, R. R. (1976). Normal approximation and asymptotic expansions. New York: Wiley.
1756
Nicolas Brunel and Jean-Pierre Nadal
Bialek, W., Rieke, F., de Ruyter van Steveninck, R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Blahut, R.E. (1988). Principles and practice of information theory. Reading, MA: Addison-Wesley. Clarke, B. S., & Barron, A. R. (1990) Information theoretic asymptotics of Bayes methods. IEEE Trans. on Information Theory, 36, 453–471. Cover, T. M., & Thomas, J. A. (1991). Information theory. New York: Wiley. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci., 2, 1527–1537. Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York: Wiley. Laughlin, S. B. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturf., C36, 910–912. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21, 105– 117. Maunsell, J. H. R., & Van Essen, D. C. (1983). Functional properties of neurons in middle temporal visual area of the macaque monkey. I. Selectivity for stimulus direction, speed, and orientation. J. Neurophysiol., 49, 1127–1147. Nadal, J.-P., & Parga, N. (1994). Nonlinear neurons in the low noise limit: A factorial code maximizes information transfer. Network, 5, 565–581. Parisi, G. (1988). Statistical field theory, Reading, MA: Addison-Wesley. Pouget, A. & Thorpe, S. J. (1991). Connexionist models of orientation identification. Connection Science, 3, 127–142. Pouget, A. & Zhang, K. (1997). Statistically efficient estimations using cortical lateral connections. In M. C. Moza, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 97–103). Cambridge, MA: MIT press. Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Trans. on Information Theory, 42, 40–47. Ruderman, D. (1994). Designing receptive fields for highest fidelity. Network, 5, 147–155. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neural population codes. P.N.A.S. USA, 90, 10749–10753. Shannon, S. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana, IL: University of Illinois Press. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assesment. Neural Comp., 8, 511–529. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334. Stein, R. (1967). The information capacity of nerve cells using a frequency code. Biophys. J., 7, 797–826. Taube, J. S. (1995). Head direction cells recorded in the anterior thalamic nuclei of freely moving rats. J. Neurosci., 15, 70–86. Taube, J. S., Muller, R. U., & Ranck, J. B. (1990). Head direction cells recorded from the postsubiculum in freely moving rats. I. Description and quantitative analysis. J. Neurosci., 10, 420–435.
Mutual Information, Fisher Information, and Population Coding
1757
Van Hateren, J. H. (1992). Theoretical predictions of spatiotemporal receptive fields of fly LMCS, and experimental validation. J. Comp. Physiology A, 171, 157–170. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: A unified framework with application to hippocampal place cells J. Neurophysiol., 79, 1017–1044. Received November 3, 1997; accepted February 20, 1998.
LETTER
Communicated by P. Reed Montague
Synaptic Pruning in Development: A Computational Account Gal Chechik Isaac Meilijson School of Mathematical Sciences, Tel-Aviv University, Tel Aviv 69978, Israel
Eytan Ruppin Schools of Medicine and Mathematical Sciences, Tel-Aviv University, Tel Aviv 69978, Israel
Research with humans and primates shows that the developmental course of the brain involves synaptic overgrowth followed by marked selective pruning. Previous explanations have suggested that this intriguing, seemingly wasteful phenomenon is utilized to remove “erroneous” synapses. We prove that this interpretation is wrong if synapses are Hebbian. Under limited metabolic energy resources restricting the amount and strength of synapses, we show that memory performance is maximized if synapses are first overgrown and then pruned following optimal “minimal-value” deletion. This optimal strategy leads to interesting insights concerning childhood amnesia. 1 Introduction One of the fundamental phenomena in brain development is the reduction in the amount of synapses that occurs between early childhood and puberty. In recent years, many studies have investigated the temporal course of changes in synaptic density in primates, revealing the following picture. Beginning at early stages of fetal development, synaptic density rises at a constant rate, until a peak level is attained (at 2–3 years of age in humans). Then, after a relatively short period of stable synaptic density (until the age of 5 in humans), an elimination process begins: synapses are being constantly removed, yielding a marked decrease in synaptic density. This process proceeds until puberty, when synaptic density stabilizes at adult levels, which are maintained until old age. The peak level of synaptic density in childhood is 50% to 100% higher than adult levels, depending on the brain region. The phenomenon of synaptic overgrowth and pruning was found in humans (Huttenlocher, 1979; Huttenlocher, De Courten, Garey, & Van der Loos, 1982; Huttenlocher & De Courten, 1987), as well as in other mammals such as monkeys (Eckenhoff & Rakic, 1991; Bourgeois & Rakic, 1993; Bourgeois, 1993; Rakic, Bourgeois, & Goldman-Rakic, 1994), cats (Innocenti, 1995) and rats (Takacs & Hamori, 1994). It was observed throughout c 1998 Massachusetts Institute of Technology Neural Computation 10, 1759–1777 (1998) °
1760
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
widespread brain regions including cortical areas: visual (Bourgeois & Rakic, 1993; Huttenlocher et al., 1982), motor and associative (Huttenlocher, 1979), the cerebellum(Takacs & Hamori, 1994), projection fibers between hemispheres (Innocenti, 1995), and the dentate gyrus (Eckenhoff & Rakic, 1991). The time scale of synaptic elimination was found to vary by cortical area, coarsely following a dorsal to frontal order (Rakic et al., 1994). The changes in synaptic density are not a result of changes in total brain volume, but reflect true synaptic elimination (Rakic et al., 1994). In some cases, synaptic elimination was shown to be correlated with experience-dependent activity (Stryker, 1986; Roe, Pallas, Hahm, & Sur, 1990). What advantage could such a seemingly wasteful developmental strategy offer? Some researchers have treated the phenomenon as an inevitable result of synaptic maturation lacking any computational significance. Others have hypothesized that synapses that are strengthened at an early stage might be later revealed as harmful to overall memory function, when additional memories are stored. Thus, they claim, synaptic elimination may reduce the interference between memories, and yield better overall performance (Wolff, Laskawi, Spatz, & Missler, 1995). This article shows that in associative memory networks models, these previous explanations do not hold, and it puts forward a different explanation. Our proposal is based on the assumption that synapses are a costly resource whose efficient utilization is a major optimization goal guiding brain development. This assumption is motivated by the observation that the changes in synaptic density along brain development are highly correlated to the temporal course of changes in energy consumption (Roland, 1993) and by the fact that the brain consumes a large fraction of total energy consumption of the resting adult (Roland, 1993). By analyzing the network’s performance under various synaptic constraints such as limited number of synapses or limited total synaptic strength, we show that if synapses are properly pruned, the performance decrease due to synaptic deletion is small compared to the energy saving. Deriving optimal synaptic pruning strategies, we show that efficient memory storage in the brain requires a specific learning process characterized by initial synaptic overgrowth, followed by judicious synaptic pruning. The next section describes the models studied and our analytical results, which are verified numerically in section 3. Section 4 discusses the possible benefits of efficient synaptic elimination and its implications for the phenomenon of childhood amnesia. 2 Analytical Results In order to investigate synaptic elimination, we address the more general question of optimal modification of a Hebbian memory matrix. Given previously learned Hebbian synapses, we apply a function that changes the synaptic values and investigate the effect of such a modification function.
Synaptic Pruning in Development
1761
First, we analyze the way memory performance depends on a general synaptic modification function. Then we derive optimal modification functions under different constraints. Finally, we calculate the dependency of performance on the deletion levels. 2.1 The Models. We investigate synaptic modification in two Hebbian models. The first model is a variant of the canonical Hopfield model. M memories are stored in an N-neuron network, forming approximate fixed points of the network dynamics. The initial synaptic efficacy Wij between the jth (presynaptic) neuron and the ith (postsynaptic) neuron is M 1 X µ µ ξi ξj , Wij = √ M µ=1
1 ≤ i 6= j ≤ N;
Wii = 0,
(2.1)
M are ±1 binary patterns representing the stored memories. where {ξ µ }µ=1 The actual synaptic efficacy Jij is
Jij = g(Wij ) 1 ≤ i 6= j ≤ N;
Jii = 0,
(2.2)
where g is a general modification function over the Hebbian weights, such that g(z) has finite moment if z is normally distributed. The updating rule for the state Xit of the ith neuron at time t is Xit+1 = θ( fi ),
fi =
N X
Jij Xjt ,
(2.3)
j=1
where fi is the neuron’s input field, and θ is the function θ ( f ) = sign( f ). The overlap mµ (or similarity) between the network’s activity pattern X and the PN µ ξj Xj . memory ξ µ is mµ = N1 j=1 The second model is a variant of the low-activity, biologically motivated model described by Tsodyks and Feigel’man (1988), in which synaptic efficacies are described by à Jij = g(Wij ) = g
1
√ p(1 − p) M
M X
µ (ξi µ=1
! −
µ p)(ξj
− p) ,
(2.4)
where ξ µ are {0, 1} memory patterns with coding level p (fraction of firing neurons), and g is a synaptic modification function. The updating rule is similar to equation 2.3, Xit+1 = θ( fi ),
fi =
N X j=1
Jij Xjt − T ,
(2.5)
1762
Gal Chechik, Isaac Meilijson, and Eytan Ruppin 1+sign( f )
where θ now denotes the step function θ ( f ) = , and T is the neuronal 2 threshold, set to its optimal value (see equation A.9). The overlap mµ in this model is defined by mµ =
N X 1 µ (ξ − p)Xj . Np(1 − p) j=1 j
2.2 Pruning Does Not Improve Performance. To evaluate the impact of synaptic pruning on the network’s retrieval performance, we study its effect on the signal-to-noise ratio (S/N) of the neuron’s input field in the modified Hopfield model (see equations 2.2 and 2.3). The S/N is known to be the primary determinant of retrieval capacity (ignoring higher-order correlations in the neurons input fields) (Meilijson & Ruppin, 1996) and is calculated by analyzing the moments of the neuron’s field. The network is started at a state X with overlap mµ with a specific memory ξ µ ; the overlap µ µ
ξi ξ
with other memories is assumed to be negligible. Therefore Wij − √ j is the M sum of M − 1 independent variables with zero expectation and standard √ −x2 /2 variation M and is distributed N(0, 1). Denoting φ(x) = e√ , we use the 2π fact that φ 0 (x) = −xφ(x) and write h i £ ¤ £ µ¤ µ E fi |ξi = NE g(Wij )Xj = Nmµ E g(Wij )ξj =
(2.6)
i i 1 h 1 h µ µ = Nmµ E g(Wij )|ξj = +1 − Nmµ E g(Wij )|ξj = −1 = 2 2 µ Z ∞ µ ¶ ξi 1 dWij − g(Wij )φ Wij − √ = Nmµ 2 −∞ M µ Z µ ¶ ξi 1 ∞ dWij ≈ g(Wij )φ Wij + √ −Nmµ 2 −∞ M · ¸ Z µ ξi 1 ∞ g(Wij ) φ(Wij ) − √ φ 0 (Wij ) dWij − ≈ Nmµ 2 −∞ M · ¸ Z ∞ µ ξi 1 g(Wij ) φ(Wij ) + √ φ 0 (Wij ) dWij = −Nmµ 2 −∞ M Z ∞ µ ξi g(Wij ) √ Wij φ(Wij )dWij = = Nmµ M −∞ µ £ ¤ ξi = Nmµ √ E zg(z) M
where z is a random variable with standard normal distribution. The vari-
Synaptic Pruning in Development
1763
ance of the field is similarly calculated to be h i µ V( fi |ξi ) = NE g2 (z) .
(2.7)
Hence the S/N is ¤ £ ¤ r £ ¤ £ E fi |ξ µ = +1 − E fi |ξ µ = −1 N µ E zg(z) q £ m q £ = S/N = ¤. µ¤ M V fi |ξi E g2 (z)
(2.8)
g(z) As z has standard normal distribution, E(z2 ) = V(z) = 1. Assuming £ ¤ is antisymmetric (or at least has zero expectation) we can use V g(z) = £ 2 ¤ E g (z) and write 1 S/N = √ mµ ρ(g(z), z) , α
(2.9)
where α = M/N is the memory load and ρ denotes the correlation coefficient. The S/N is thus a product of independent terms of the load, the initial overlap, and a correlation term that depends on the modification function only. The S/N calculation for the low-activity model is similar but more cumbersome and is described in appendix A. The resulting S/N can again be separated into a similar product of independent terms, 1 S/N = √ mµ ρ(g(z), z). pα
(2.10)
In both models, the only effect of the modification function g on the S/N is through the correlation coefficient; hence, the behavior of the two different models under synaptic modification can be investigated by analyzing ρ(g(z), z) only, regardless of the other parameters.1 The immediate consequence of equations 2.9 and 2.10 is that there is no local synaptic modification function that can improve the performance of the Hebbian network, since ρ has values in the range [−1, 1], and the identity function g(z) = z already gives the maximal possible value of ρ = 1. In particular, no deletion strategy can yield better memory performance than the intact network. A similar result was previously shown by Sompolinsky (1988) in the Hopfield model. However, the use here of signal-to-noise analysis enables us to proceed and derive optimal functions under different constraints on modification functions and evaluate the performance of various modification functions. 1 These results and their following consequences remain valid even when the initial synaptic weights matrix is nonzero but has some gaussian distributed noise, because such a noise can be viewed as additional prestored memories.
1764
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
When no constraints are involved, pruning has no beneficial effect. However, since synaptic activity is strongly correlated with energy consumption in the brain, its resources may be inherently limited in the adult, and synaptic modification functions should satisfy various synaptic constraints. The following two subsections study deletion under two different synaptic constraints: limited number of synapses and limited total synaptic efficacy.2 2.3 Optimal Modification with Limited Number of Synapses. In this section we find the optimal synaptic modification strategy when the number of synapses is restricted. The analysis consists of the following stages. First, we show that under any deletion function, the remaining weights’ efficacies should not be changed. Second, we show that the optimal modification function satisfying this rule is minimal-value deletion. Finally, we calculate the S/N and memory capacity of networks deleted with this strategy as a function of the deletion level. Let gA be a piece-wise equicontinuous deletion function, which zeroes all weights whose values are not in some set A and possibly modifies the remaining weights. To find the best modification function over the remaining £ ¤ q £ 2 ¤ weights, we should maximize ρ(gA (z), z) = E zgA (z) / E gA (z) , which £ 2 ¤ is invariant to scaling. Therefore, R we keep E gA (z) fixed and look for a gA £ ¤ that maximizes E zgA (z) = A zg(z)φ(z). Using the Lagrange method we write (as in Meilijson & Ruppin, 1996), µZ
Z
¶ g2 (z)φ(z)dz − c1
zg(z)φ(z)dz − γ A
(2.11)
A
for some constant c1 . Denoting gi = g(zi ) we approximate equation 2.11 by X
à zi gi φ(zi ) − γ
{i|zi ∈A}
X
{i|zi ∈A}
! g2i φ(zi )
−
c01
.
(2.12)
zi , ∀zi ∈ A; hence, g Differentiating with respect to gi yields that gi = 2γ is linear homogeneous in z. We conclude that the optimal function should leave the undeleted weights unchanged (except for arbitrary linear scaling). To find the weights that should be deleted, we write the deletion function as gA (z) = zRA (z), where
½ RA (z) =
R2A (z)
=
1 0
when z ∈ A otherwise.
2 We do not derive general optimal synaptic matrices, but optimal modifications of previously learned Hebbian synapses. A study of the former can be found in Bouten, Engel, Komoda, and Serneel (1990).
Synaptic Pruning in Development
1765
Figure 1: Different synaptic modification strategies. (a) Minimal value deletion: g(z) = z for all |z| > t, and zero otherwise (see equation 2.13). (b) Clipping: g(z) = sign(z) for all |z| > t and zero otherwise. (c) Compressed deletion: g(z) = z − sign(z)t for all |z| > t and zero otherwise (see equation 2.16).
£ ¤ £ ¤ Since zgA (z) = z2 RA (z) = g2A (z), then E zgA (z) = E g2A (z) and ρ(gA (z), z) qR R z2 RA (z)φ(z)dz. Given a constraint A φ(z)dz = const which holds the = R number of synapse fixed, the term A z2 φ(z)dz is maximized when A supports the larger values of |z|. To summarize, if some fraction of the synapses are to be deleted, the optimal (“minimal value”) pruning strategy is to delete all synapses whose magnitude is smaller than some threshold and leave all others intact, as illustrated in Figure 1a. To calculate ρ(g(z), z) as a function of the deletion level, let gt (z) = zRt (z),
where Rt (z) = R{s|t<|s|} (z),
(2.13)
and t is the threshold below which weights are removed. Using the fact that φ 0 (z) = −zφ(z) and integrating by parts, we obtain Z ∞ h i Z ∞ ¤ £ z2 Rt (z) φ(z)dz = 2 z z φ(z)dz E zgt (z) = E g2t (z) = −∞ t ¤ £ (2.14) = 2 8∗ (t) + tφ(t) and ρ(gt (z), z) =
p 2tφ(t) + 28∗ (t) ,
(2.15)
where 8∗ (t) = P(z > t) is the standard normal tail distribution function. 2.4 Optimal Modification with Restricted Overall Synaptic Strength. As synapses differ by their strength, a possible different goal may be implied
1766
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
by the energy consumption constraints, minimizing the overall synaptic strength in the network. WeRthus wish to maximize the S/N while keeping the total synaptic strength |g(z)| fixed. Using the Lagrange method (the full derivation is brought in the appendix in section A.4), we find that the optimal modification function is z>t z − t when 0 when |z| < t (2.16) gt (z) = z + t when z < −t, that is, the absolute value of all synapses with magnitude above some threshold t is reduced by t, and the rest are eliminated. We denote this modification function “compressed deletion” (see Figure 1c). The S/N under this strategy is calculated using the function Rt (z) described above (see equation 2.13) and then writing g(z) = (z − t)Rt (z) for positive z values. The calculation is done similarly to equation 2.14, yielding 28∗ (t) . ρ(g(z), z) = p 2(1 + t2 )(8∗ (t) − 2tφ(t))
(2.17)
3 Numerical Results To evaluate quantitatively the performance gain achieved by the strategies described in the previous section, the network’s performance is measured by calculating the memory capacity of the network as a function of synaptic deletion levels. The capacity is measured as the maximal number of memories that can be stored in the network and retrieved almost correctly µ (mµ ≥ 0.95), starting from patterns with an initial overlap of m0 = 0.8. Simulations shown below are for the more biologically plausible low-activity network, with N = 800 neurons and coding level p = 0.1; similar results were obtained with Hopfield model simulations. The analytic curves were calculated using the overlap equation (see equations A.6), which enables calculating the overlap after a single step for any memory load by finding the highest load that yields an overlap mµ ≥ 0.95. Simulation results are reported for both a single iteration and for 10 iterations. Figure 2 compares the capacity obtained with three modification strategies: minimal value deletion (see equation 2.13), random deletion (independent of the connections strength), and a clipping deletion. In clipping deletion, all weights with a magnitude below a threshold value are removed, and the remaining ones are assigned a ±1 value, according to their sign (see Figure 1b). Two sets of simulations are presented. The first set was performed with an arbitrary fixed threshold and the second with a threshold optimally tuned for each deletion level.3 Minimal-value deletion is indeed 3
In one-step simulations, the optimal threshold was determined according to equation
Synaptic Pruning in Development
1767
significantly better than the other deletion strategies, but in high-deletion levels, it is almost equaled by the clipping strategy. Figure 3 compares the compressed-deletion modification strategy (see equation 2.16) to random deletion, as a function of the fraction of the total synaptic strength that is deleted from the network. The results show that if a network must be subjected to synaptic deletion, minimal value deletion will minimize the damage, yet deletion reduces performance and is hence unfavorable. We now proceed to show that where the amount of synapses is restricted in the adult organism, an initial overgrowth of synapses followed by deletion is beneficial. Figure 4 compares the memory capacity of networks with the same number of synapses, but with a varying number of neurons. The smallest network (N = 800) is fully connected, while larger networks are pruned by minimal value deletion to end up with the same amount of synapses. The optimal deletion ratio is found around 80% deletion and improves capacity by 45% . This optimally pruned network, which has more neurons, can store three times more information than the fully connected network with the same number of synapses. When the threshold is suboptimal or the energy cost for neurons is nonnegligible, the optimum drifts to a deletion levels of 50% to 60%. The conclusion is that an organism that first overgrows synapses in a large network and then judiciously prunes them can store many more memories than another adult organism that uses the same synaptic resources but settles for the adult synaptic density in infancy. Until now, we have analyzed synaptic deletion of previously established synaptic matrices (storing a given set of memories). We next turn to simulate the continuous process of learning that is superimposed on the profile of synaptic density changes occurring during human development. These changes naturally define a time step equivalent to one year. Within each time step we store some memories and change the synaptic density following the human data. Synapses are incrementally added, increasing synaptic density until the age of 3 years. At the age of 5 years, synaptic pruning begins, lasting until puberty (see the dot-dashed line in Figure 5). Addition of new synapses is done at a constant rate, and synaptic efficacies are determined by the new memories stored in the network. The deletion of synapses is done according to the minimal value deletion strategy. The network is tested for recall of the stored memories twice: once, at the age of 3 years when synaptic density is at its peak, and again at an age of 15 years when synaptic elimination has already removed 40% of the synapses. Figure 5 traces the network’s performance during this experiment, measured by the retrieval acuity (final overlap with the cued memory obtained after one step of the dynamics). It superimposes the synaptic density (dot-
A.9, and in 10-step simulations, the optimal threshold was found numerically to maximize the network’s performance.
1768
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
Figure 2: Capacity of a network with different synaptic modification strategies as a function of the synaptic deletion level. The left column shows results of the low-activity model with a fixed threshold; the results with optimal neural threshold (i.e., threshold that is varied optimally with the deletion level) are shown in the right column. Both analytical and simulation results of singlestep and multiple-step dynamics are presented, showing a qualitatively similar behavior.
Synaptic Pruning in Development
1769
Figure 3: Capacity of a network with different synaptic modification strategies as a function of the fraction of the total synaptic strength removed from the network. The left column shows results of the low-activity model with fixed threshold; results with optimal threshold are shown in the right column.
1770
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
Figure 4: Capacity of networks with different number of neurons but the same total number of synapses as a function of network connectivity. The networks are pruned according to minimal value deletion to keep the total number of synapses (k) constant. Simulation parameters are k = 8002 , p = 0.1, and T is kept optimal.
dashed line) and memory performance data. Two observations should be noted. First is the inverse temporal gradient in the recall performance of memories stored during the synaptic pruning phase. That is, there is a deterioration in the performance of the teen-age network as it recalls more recent childhood memories (see the decline in the dashed line). The second is the marked difference between the ability of the infant network (the solid line) and the teen-age network (the dashed line) to recall memories stored at early childhood. Older networks totally fail to recall any memory before the age of 3 or 4 years, manifesting “childhood amnesia.” 4 Discussion We have analyzed the effect of modifying Hebbian synapses in an optimal way that maximizes memory performance while keeping constant the overall number or total strength of the synapses. The optimal functions found for these criteria use only local information about the synaptic strength, do not depend on the activity level of the network, and are not affected by initial noise in the synaptic matrix. Moreover, they are exactly the same functions in a large family of associative memory networks. We have shown that under a restricted number of synapses, the opti-
Synaptic Pruning in Development
1771
Figure 5: Memory retrieval as a function of storage period. The figure displays both synaptic density and memory performance data. At each time step (“year”), m memories are stored in the network, and the network’s connectivity is changed following human data (dot-dashed line). The network is tested for retrieval twice: in an early (“infant”) stage when network connectivity has reached its peak (solid line), and in a later (“teenage”) phase after more memories have been stored in the network (dashed line). In each such test, we look back in time and examine how well the network currently retrieves memories from its past, as a function of their storage time. Network parameters are N = 800, m = 10, and p = 0.1. The threshold is kept fixed at T = (1/2 − p)p(1 − p).
mal local modification function of a given Hebbian matrix is to delete the small weights and maintain the values of the remaining connections. Under restricted total synaptic strength, the optimal synaptic modification is to delete the small weights and linearly reduce the strength of the remaining synapses. Our results predict that during the elimination phase in the brain synapses undergo weight-dependent pruning in a way that deletes the weak synapses (one should expect that both constraints described above may concomitantly apply in reality, with some relative weighting that would determine the exact desired synaptic modification function). As we have shown, synaptic deletion cannot improve performance of a given network. What then is its role? Several computational answers have been suggested. Some have hypothesized that synaptic elimination can improve network performance, but we prove this argument incorrect in several associative memory models. Others have claimed that the brain can be viewed as a cascade of filters that can be modeled by feedforward networks
1772
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
models (Sharger & Johnson, 1995). In these models it is known that a reduction in the amount of free parameters may improve the ability of the network to generalize if the size of the network is too large (Reed, 1993). This explanation holds when the complexity of the problem is unknown at the time the networks are created (and therefore cannot be preprogrammed genetically) and applies to networks that should generalize well. Another possible argument for justifying synaptic deletion arises if synaptic values are assumed to have ±1 values only (as in the clipping function described above). Under such an assumption (as can be observed in Figure 2), maximal performance is obtained at nonzero deletion levels. However, the biological plausibility of univalued synapses is in doubt. Our proposal is that synaptic overgrowth and deletion emerge because synaptic resources must be scrupulously utilized due to metabolic energy consumption constraints. If we have to use a restricted amount of synapses in the adult, better performance is achieved if the synapses are first overgrown and then cleverly pruned after more memories are stored. The optimally pruned network is not advantageous over the undeleted network (which has many more synapses), but over all other networks, with the same total number of synapses. It should be noted, however, that our results pertain to associative memory networks and that synaptic elimination probably plays different roles in other neural systems. For example, it was shown that in the primary visual cortex, synaptic elimination determines the architecture and the function of the network (Miller, Keller, & Stryker, 1989). Interestingly, although rising from different basic principles, recent studies have found that in the neuromuscular junction, synapses are pruned according to their initial synaptic strength, deleting the weaker synapses (Frank, 1997; Colman, Nabekura, & Lichtman, 1997). In biological networks, synaptic growth and deletion occur in parallel with memory storage. As shown in Figure 5, the implementation of a minimal-value pruning strategy during such process yields two cognitive predictions: one for the rising phase of synaptic density curve and the other for the descending phase. At the descending phase of synaptic density, an inverse temporal gradient is observed. That is, as long as synapses are eliminated, remote memories are easier to recall than recently stored memories (dashed curve in Figure 5). The reason for this inverse gradient is the continuous change in network connectivity: earlier memories are stored in a highly connected network, while memories stored later are engraved into a sparser network. The early memories take a prominent role in determining the synapses pruned by the minimal value algorithm and therefore are only slightly damaged by the synaptic deletion. The more recent memories are engraved in an already deleted network, and hence have little influence on determining which synapses are deleted. From the point of view of recent memories, the network undergoes random deletion. However, adding accumulative noise to the network or assuming synaptic decay damages remote memory retrieval more than recent ones. Therefore, the model pre-
Synaptic Pruning in Development
1773
dicts that the plot of human memory retrieval as a function of storage time within the synaptic elimination period should have a U-shaped form. Interestingly, such a result can be observed in previous studies of long-term memory (Sheingold & Tenney, 1982) but was unnoticed before. A comparison of retrieval quality of early memories by the teenager network versus the infant network (see Figure 5 at the age of 2 years) shows a sharp change in the retrieval quality of early memories. This is reminiscent of the infantile amnesia phenomenon, which is the inability of the adult to recall events from infancy that he could previously recall (Bachevalier, Brickson, & Hagger, 1993; Markievwicz, Kucharski, & Spear, 1986). In our model, this pattern arises from the fact that earlier memories are stored in sparsely connected networks (that is, embedded in less synapses) and hence are more sensitive to the noise known to accumulate in the network as additional memories are stored in it. This scenario may provide a networklevel perspective to infantile amnesia, complementing the previous theories suggesting that maturation of memory-related structures such as the hippocampus is responsible for the amnesia (Nadel, 1986). Synaptic elimination is a broad phenomenon found throughout different brain structures and is not restricted to associative memory areas. We believe that our explanation may be generalized to other network models. For example, feedforward Hebbian projections between consecutive networks share similar properties with a single step of synchronous dynamics of associative memory networks analyzed here. There is also some evidence that synaptic growth followed by deletion is not limited to the developmental stage, but may have a more general scope and can be found in adults (Greenough, Black, & Wallace, 1987). These interesting open issues await further studies in the future. Appendix: Signal-to-Noise Ratio in a Low-Activity Model A.1 Field Moments. The network is initialized with activity p and overµ lap m0 with memory µ. Let ² = P(Xi = 0|ξi = 1) (which implies an initial (1−p−²) overlap of m0 = (1−p) ). We write £ ¤ E( fi |ξi ) = NP(ξj = 1)E g(Wij )|ξj = 1 £ ¤ + NP(ξj = 0)E g(Wij )|ξj = 0 . The first term is calculated as follows: £ ¤ NP(ξj = 1)E g(Wij )|ξj = 1 ! Ã Z (1 − p)(ξi − p) dWij ≈ = Np(1 − ²) g(Wij )φ Wij − √ p(1 − p) M
(A.1)
1774
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
"
# (1 − p)(ξi − p) 0 φ (Wij ) dWij = ≈ Np(1 − ²) g(Wij ) φ(Wij ) − √ p(1 − p) M ¤ £ ¤ (1 − p)(ξi − p) £ E zg(z) = = Np(1 − ²)E g(z) + Np(1 − ²) √ p(1 − p) M £ ¤ £ ¤ (ξi − p) (A.2) = NpE g(z) + Np(1 − p − ²) √ E zg(z) . p(1 − p) M Z
A similar calculation for the second term, where g is antisymmetric, yields E( fi |ξi ) = Np(1 − p − ²)
£ ¤ (ξi − p) √ E zg(z) − T. p(1 − p) M
(A.3)
The variance is calculated following V( fi |ξi ) = NE(g2 (Wij )Xj2 ) − NE2 (g(Wij )Xj ) + N(N − 1)Cov(g(Wij )Xj , g(Wik )Xk ),
(A.4)
in a similar manner to yield E( fi |ξi ) p = V( fi |ξi )
£ ¤ − p)E zg(z) − T q . £ ¤ NpE g2 (z)
√N mo (ξi M
(A.5)
A.2 The Overlap Equation. Given the overlap m0 between the network’s initial state and a pattern ξ µ , we calculate m1 , the overlap in the next step, by 1 N(1 − p)P(ξi = 1)P(Xi = 1|ξi = 1) Np(1 − p) 1 NpP(ξi = 0)P(Xi = 1|ξi = 0) = − Np(1 − p) = P(Xi = 1|ξi = 1) − P(Xi = 1|ξi = 0) = Ã ! Ã ! E( fi |ξi ) E( fi |ξi ) |ξi = 1 − 8 p |ξi = 0 , =8 p V( fi |ξi ) V( fi |ξi )
m1 =
(A.6)
where 8 = 1−8∗ is the standard gaussian cumulative distribution function. A.3 Optimal Threshold. In order to find the threshold that maximizes the overlap, we differentiate m1 (see equation A.6) with respect to T, ∂m1 = ∂T
· ¸ E( f |ξ ) E( f |ξ ) ∂ 8( √ i i |ξi = 1) − 8( √ i i |ξi = 0) V( fi |ξi )
V( fi |ξi )
∂T
= 0,
(A.7)
Synaptic Pruning in Development
1775
which yields £ ¤ £ ¤ √N mo (0 − p)E zg(z) − T − p)E zg(z) − T = φ M q (A.8) q £ ¤ £ ¤ 2 NpE g (z) NpE g2 (z)
φ
√N mo (1 M
and N T= √ M
µ
¶
1 −p 2
£ ¤ m0 E zg(z) .
(A.9)
Using the optimal threshold in equation A.5 yields r S/N =
N 1 m0 √ ρ(gt (z), z). M p
(A.10)
Similarly to the case of Hopfield model, the S/N of the neuron i can be expressed as a product of independent factors: the load M/N, the deletion strategy g, the activity level p, and the activity of the neuron ξi . A.4 Derivation of the Compressed Deletion. Using the Lagrange method, we have Z
µZ
∞
∞
¶ 2
zg(z)φ(z)dz − γ1 g (z)φ(z)dz − c1 −∞ µZ ∞ ¶ |g(z)|φ(z)dz − c2 − γ2 −∞ µZ ∞ ¶ Z ∞ |z||g(z)|φ(z)dz − γ1 |g(z)|2 φ(z)dz − c1 = −∞ −∞ µZ ∞ ¶ |g(z)|φ(z)dz − c2 , − γ2
−∞
(A.11)
−∞
which is approximated by X i
|zi ||gi |φ(zi )−γ1
à X i
! |gi |
2
φ(zi )−c01
−γ2
à X
! |gi |φ(zi )−c02
.
(A.12)
i
Assuming g(z) to be piece-wise equicontinuous and equating to zero the
1776
Gal Chechik, Isaac Meilijson, and Eytan Ruppin
derivative with respect to |gi | we obtain |zi |φ(zi ) − γ1 2|gi |φ(zi ) − γ2 φ(zi ) = 0 or |g(z)| =
1 (|z| − γ2 ) , 2γ1
(A.13)
(A.14)
from where z−t 0 gt (z) = z+t
when when when
z>t |z| < t z < −t.
(A.15)
References Bachevalier, J., Brickson, M., & Hagger, C. (1993). Limbic dependent recognition memory in monkeys develops early in infancy. Neuroreport, 4, 77–80. Bourgeois, J. P. (1993). Synaptogenesis in the prefrontal cortex of the Macaque. In B. do Boysson-Bardies (Ed.), Developmental neurocognition: Speech and face processing in the first year of life (pp. 31–39). Norwell, MA: Kluwer. Bourgeois, J. P., & Rakic, P. (1993). Changing of synaptic density in the primary visual cortex of the rhesus monkey from fetal to adult age. J. Neurosci., 13, 2801–2820. Bouten, M., Engel, A., Komoda, A., & Serneel, R. (1990). Quenched versus annealed dilution in neural networks. J. Phys. A: Math Gen., 23, 4643–4657. Colman, H., Nabekura, J., & Lichtman, J. W. (1997). Alteration in synaptic strength preceding axon withdrawal. Science, 275, 356–361. Eckenhoff, M. F., & Rakic, P. (1991). A quantative analysis of synaptogenesis in the molecular layer of the dentate gyrus in the resus monkey. Developmental Brain Research, 64, 129–135. Frank, E. (1997). Synapse elimination: For nerves it’s all or nothing. Science, 275, 324–325. Greenough, W. T., Black, J. E., & Wallace, C. S. (1987). Experience and brain development. Child Development, 58, 539–559. Huttenlocher, P. R. (1979). Synaptic density in human frontal cortex. Development changes and effects of age. Brain Res., 163, 195–205. Huttenlocher, P. R., & De Courten, C. (1987). The development of synapses in striate cortex of man. J. Neuroscience, 6, 1–9. Huttenlocher, P. R., De Courten, C., Garey, L. J., & Van der Loos, H. (1982). Synaptogenesis in human visual cortex—evidence for synapse elimination during normal development. Neuroscience Letters, 33, 247–252. Innocenti, G. M. (1995). Exuberant development of connections and its possible permissive role in cortical evolution. Trends Neurosci., 18, 397–402. Markievwicz, B., Kucharski, D., & Spear, N. E. (1986). Ontogenic comparison of memory for Pavlovian conditioned aversions. Developmental Psychobiology, 19(2), 139–154.
Synaptic Pruning in Development
1777
Meilijson, I., & Ruppin, E. (1996). Optimal firing in sparsely-connected lowactivity attractor networks. Biological Cybernetics, 74, 479–485. Miller, K. D., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Nadel, L. (1986). Infantile amnesia: A neurobiological perspective. In M. Moscovitch (Ed.), Infant memory: Its relation to normal and pathological memory in humans and other animals. New York: Plenum Press. Rakic, P., Bourgeois, J. P., & Goldman-Rakic, P. S. Synaptic development of the cerebral cortex: Implications for learning, memory and mental illness. Progress in Brain Research, 102, 227–243. Reed, R. (1993). Pruning algorithms—a survey. IEEE Transactions on Neural Networks, 4(5), 740–747. Roe, A. W., Pallas, S. L., Hahm, J. O., & Sur, M. (1990). A map of visual space induced in primary auditory cortex. Science, 250, 818–820. Roland, P. E. (1993). Brain activation. New York: Wiley-Liss. Sharger, J., & Johnson, M. H. (1995). Modeling development of cortical functions. In B. Julesz & I. Kovacs (Eds.), Maturational windows and cortical plasticity. Santa Fe, NM: Santa Fe Institute Press. Sheingold, K., & Tenney, J. (1982). Memory for a salient childhood event. In U. Neisser (Ed.), Memory observed. San Francisco: W. H. Freeman. Sompolinsky, H. (1988). Neural networks with non linear synapses and static noise. Phys. Rev. A., 34, 2571–2574. Stryker, M. P. (1986). Binocular impulse blockade prevents the formation of ocular dominance columns in cat visual cortex. J. Neuroscience, 6, 2117–2133. Takacs, J., & Hamori, J. (1994). Developmental dynamics of Purkinje cells and dendritic spines in rat cerebellar cortex. J. Neuroscience Research, 38, 515–530. Tsodyks, M. V., & Feigel’man, M. (1988). Enhances storage capacity in neural networks with low activity level. Europhys. Lett., 6, 101–105. Wolff, J. R., Laskawi, R., Spatz, W. B., & Missler, M. (1995). Structural dynamics of synapses and synaptic components. Behavioral Brain Research, 66, 13–20. Received June 25, 1997; accepted November 14, 1997.
LETTER
Communicated by Bruno Olshausen
Spatial Decorrelation in Orientation-Selective Cortical Cells Alexander Dimitrov Jack D. Cowan Department of Mathematics, University of Chicago, Chicago, IL 60637, USA
We propose a model for the lateral connectivity of orientation-selective cells in the visual cortex. We study the properties of the input signal to the visual cortex and find new statistical structures that have not been processed in the retino-geniculate pathway. Using the idea that the system performs redundancy reduction of the incoming signals, we derive the lateral connectivity that will achieve this for a set of orientation-selective local circuits, as well as the complete spatial structure of a network composed of such circuits. We compare the results with various physiological measurements. 1 Introduction In recent years much work has been done to determine how the structure of the visual system reflects properties of images (Atick & Redlich, 1990, 1992; Attneave, 1954; Barlow, 1961, 1989). Based on the statistics of natural scenes measured by Field (1987), which were later extended by Ruderman and Bialek (1993), Atick and Redlich (1990, 1992) developed Barlow’s (1961) hypothesis that one of the tasks of early vision is to reduce the redundancy of input signals into a theory, the results of which agree qualitatively with numerous physiological and psychophysical experiments. Their theory was strengthened by research suggesting that such structures can develop via simple correlation-based learning mechanisms (Atick & Redlich, 1993; Dong, 1994). Atick and Li (1994) later used arguments about signal compression to conclude that further higher-order redundancy reduction of the luminosity field in the visual system is unlikely. In this article, we apply similar ideas to a different input signal, which is readily available to the system and whose statistical properties are neglected in the analysis of the luminosity signal. We note that after the application of the retinal “mexican hat” filter, the most obvious salient features that are left in the signal are sharp changes in luminosity-object contours. Atick and Redlich (1992) showed that the retinal filter is optimized to decorrelate the incoming luminosity signal based on its two-point correlation function. Contours, as we show here, have correlations that are very different from the luminosity autocorrelation of natural images (Field, 1987), for which the retinal filter is optimal. We know that this c 1998 Massachusetts Institute of Technology Neural Computation 10, 1779–1795 (1998) °
1780
Alexander Dimitrov and Jack D. Cowan
signal is projected to a set of overlapping orientation-sensitive units in V1 for each distinct retinal position, thereby introducing new redundancy in the representation. Thus, the necessity for compression and use of factorial codes arises once again. Since object contours are defined by sharp changes in the luminosity field, we can use a derivative operation to pick up the pertinent structure. Indeed, if we look at the gradient of the luminosity as a vector field, its magnitude at a point is proportional to the change of luminosity, so that a large magnitude signals the possible presence of a discontinuity in the luminosity profile. Moreover, in two dimensions, the direction of the gradient vector is perpendicular to the orientation of the possible local edge, whose presence is given by the magnitude. These properties define a one-to-one correspondence between large gradients and local edges. The structure of the network we use for processing this signal reflects what is known about the structure of V1. We select as our system a layer of orientation-sensitive cells that are laterally connected to one another, each receiving oriented input from the previous layer. We assume that each unit receives as input the absolute value of the directional derivative of the luminosity signal along the preferred visuotopic axis of the cell. The natural development of units with oriented receptive fields, approximating directional derivatives, has been proposed in many theoretical works (Linsker, 1986; Hancock, Baddeley, & Smith, 1992; Olshausen & Field, 1996; Bell & Sejnowski, 1997) and of course there is ample experimental evidence for such units (Hubel & Wiesel, 1961). Our assumption implies that locally, the input to a cell is proportional to | cos | of the angle between the unit’s preferred orientation and the local gradient (edge). Thus, each unit receives a broadly tuned signal, with a half-width at half-height of approximately 60 degrees. With this feedforward structure, the assumption that the system acts to decorrelate such inputs suggests a way to calculate the lateral connections that will perform this task. The calculation, and a further study of the statistical properties of the input, are the topics of this article. 2 Mathematical Model We introduce a general model for a laterally connected network with a feedforward input V(x), which is attributed to signals from the environment. We shall consider particular forms of V later. There are long-range correlations in the inputs to the network due to both the statistical structure of natural images and the structure of the input. The simplest are captured in the twopoint correlation matrix R(x1 , x2 ) = hV(x1 ) ◦ V(x2 )i, where the averaging is done across images. The operation ◦ is the appropriate operation for the averaged object. It is the usual multiplication for V – scalar, a direct or outer product if V is an N-dimensional vector field, giving an N-dimensional matrix (tensor) field. In this case R is a block matrix, with an N × N matrix at each spatial position (x1 , x2 ).
Spatial Decorrelation in Orientation-Selective Cortical Cells
1781
We formulate the problem in terms of a recurrent kernel W, so that O = V + W ∗ O.
(2.1)
This equation describes the linearized steady state of the dynamical system O˙ = −O + σ [W ∗ O + V], which is one form of the Wilson-Cowan equations (Wilson & Cowan, 1973). The above recurrent system is a Fredholm equation of the second type, and it has a solution for O not an eigenfunction of W, in the form O = (δ − W)−1 ∗ V = K ∗ V. This suggests that there is an equivalent feedforward system with a transfer function K = (δ − W)−1 and we can consider only such systems. The corresponding feedforward system is a linear system that R acts on the input V(x) to produce an output O(x) = (K · V)(x) ≡ K(x, y) · V(y)dy. If we use Barlow’s redundancy reduction hypothesis (Barlow 1961, 1989), this system should decorrelate the output signal. This is achieved by requiring that δ(x1 − x2 ) ∼ hO(x1 ) ◦ O(x2 )i = h(K · V)(x1 ) ◦ (K · V)(x2 )i Z Z ∼ K(x1 , y1 )hV(y1 ) ◦ V(y2 )iKT (y2 , x2 )dy1 dy2 ∼ K · R · KT .
(2.2)
The aim is to solve equation 2.2 for K. Obviously, this is equivalent to KT · K ∼ R−1 (assuming K and R are nonsingular,) which has a solution K ∼ R− 2 1
(2.3)
of unique up to a unitary transformation. Following Atick and Redlich (1992), we fix a solution by requiring that it has the reflection symmetry found in the input signal. The corresponding recurrent filter is then W = δ − K−1 = δ − ρ R 2 . 1
(2.4)
In order to compare our calculations to existing cortical structures, we must include the effects of noise on the system. We need to recognize that this is a different system, which is described by O1 = V + Nv + W ∗ M ∗ (O1 + No ),
(2.5)
where Nv is the input noise and No is the individual unit noise in the recurrently connected layer. Similarly to Atick and Redlich (1990, 1992), we can modify the decorrelation kernel W derived from equation 2.1 to W ∗ M. The form of the correction M, which minimizes the effects of noise in the system, is obtained by minimizing the distance between the states
1782
Alexander Dimitrov and Jack D. Cowan
of equation 2.5 and the noise-free system of equation 2.1. If we define χ 2 (M) = h(O − O1 )2 i = Trh(O − O1 )(O − O1 )T i as the distance function, the 2 (M) solution to ∂χ∂M = 0 will give us the appropriate kernel: W ∗ M = δ − (R + Nv2 + No2 ) ∗ (ρ R1/2 + No2 )−1 .
(2.6)
This has the correct asymptotics as Nv and No approach zero. The filter behaves well for large No , turning mostly into an identity transformation (interpretation: if your neighbors are unreliable, do not trust them). It cannot handle well large Nv and reaches −∞ proportionally to Nv2 . This is plausible, since it is expected that the feedforward filter, responsible for providing the input signal, has decreased its noise level. It is advantageous to write this expression in measurable variables. When working with the filter, we have at our disposal Vm = V + Nv , which generates Rm = R + Nv2 , since V and Nv are not correlated. In these terms, W ∗ M = δ − (Rm + No2 ) ∗ ((Rm − Nv2 ) 2 + No )−1 . 1
(2.7)
Equation 2.5 may be used to model redundancy reduction with lateral interactions for any sensory modality. Here we apply it to describe a layer of the visual cortex. In the equation, O models the activity of a cortical layer with complex orientation-selective cells (e.g., V1 layer 2 in primates), W specifies its lateral connectivity, and V is the effective input to this layer from the previous layer (V1 layer 4) (see Figure 3). Throughout the rest of the article, we shall use V(θ, x) = |
d L(x)|. dnθ
(2.8)
This assumption is based on empirical evidence (Schiller, Finlay, & Volman, 1976; Celebrini, Thorpe, Trotter, & Imbert, 1993) of existing cells that are selective to the orientation, but not the contrast of a local contour or grating. This effect can be achieved for example, by summing the output of a set of on- and off-center lateral geniculate nucleus (LGN) cells along the preferred orientation, where the nonlinearity that we use is already present. It captures just the part of the known properties of cortical complex cells, pertinent to our discussion of lateral connections. 3 Results The structure of the neuronal circuits described in the previous section reflects the structure of the signal that they process. Thus, we need to study the statistical structure of contours in order to uncover the optimal system that decorrelates them.
Spatial Decorrelation in Orientation-Selective Cortical Cells
1783
Figure 1: (A) Local autocorrelation function. (B) Local autocorrelation power spectrum.
3.1 Correlation Structure of the Input Signal. Let us consider the correlation structure of contours in natural images as represented by the input signal (see equation 2.8). In all quantities we use the log-contrast function L(x) = ln(I(x)/Io ) and its derivatives (Ruderman 1994), where I(x) is the luminosity intensity. This formulation provides intrinsic units of measurements. Using the luminosity intensity I(x) for calculations yields substantially similar results. All averaging was done on the set of images by Letraset (1993)—over about 100 different pictures, with 10 to 20 2562 samples taken from each picture. We used a discrete approximation of the derivative operation by convolving the image with ¶ ¶ µ µ −1 1 −1 −1 ∂ 1 ∂ 1 and ∂x2 ≡ 2 . ∂x1 ≡ 2 −1 1 1 1 We first consider the local correlation structure of contours. The correlation function R(θ − θ 0 ) ≡ hVθ (x)Vθ 0 (x)i depends on only the relative angle between preferred orientations (see Figure 1A). This allows us to calculate easily the autocorrelation matrix from its Fourier series. Indeed, in Fourier space, R( f ) is just the square of the power spectrum of the underlying signal. Its power spectrum exhibits an interesting scaling relation, shown in Figure 1B, wherein R(η) ∼ ηm with m ≈ −4.3 (solid line). Next we consider the full correlation structure of contours. The correlation function R(θ1 , θ2 , x − y) = hV(x, θ1 )V(y, θ2 )i is spatially translation invariant, but there is an interaction between the spatial and angular parts, so it is not translation invariant in angles anymore. The form of the autocorrelator implies that certain contour structures (X-crossing, T-crossing) are rare in the set of images (note the weaker correlations at cross-orientations)
1784
Alexander Dimitrov and Jack D. Cowan
Figure 2: The autocorrelation function of a set with eight orientations. The set is indexed by the preferred angles θi , θj in units of π8 , and each Rij has spatial structure, represented as a 32 × 32 square. The horizontal labels denote the preferred angle of the central unit, and the vertical labels denote the preferred angle of the unit to which the central one is correlated. Dark represents high correlation; light is low correlation.
and are most likely processed by higher levels, if at all (recall end stopping and virtual contours). A discrete version of this structure can be seen in Figure 2. The correlation matrix R(θ1 , θ2 , x) is approximately rotationally equivariant: Äα [R(θ1 , θ2 , x)] ≡ R(θ1 − α, θ2 − α, Ä−α [x]) = R(θ1 , θ2 , x), where Ä−α [R] is the action of the rotation group on the correlation field and Ä−α [x] is the action of the rotation group on the spatial domain of R. Because of the
Spatial Decorrelation in Orientation-Selective Cortical Cells
1785
Figure 3: Diagramatic representation of the hypercolumn model.
space-angle interaction, a simple picture of the power spectrum cannot be constructed. 3.2 A Model of a Hypercolumn. Let us now consider the implications of model and signal structure for the connectivity between units in a single cortical hypercolumn. We use the classic definition of a hypercolumn: the set of all orientation-selective units over a common patch in visual space (Hubel & Wiesel, 1961), generally considered to occupy about a mm2 of cortical surface in primates (see Figure 3). We do not include binocular effects. Applying the theory in this case is rather simple because of the symmetries present in the input signal (see section 3.1). We assume (see equation 2.8, V(θ, xo )) that orientation-selective cells receive as input from the previous layer the magnitude directional derivative in their preferred orientation. By virtue of the computation performed, the outputs of the orientation-selective units are correlated. The local correlation function R(θ − θ 0 ) is approximately homogeneous, depending on the relative angle |θ − θ 0 | only (see Figure 1) and not on the spatial location xo of the hypercolumn. Knowing the local correlations, we can find a linear filter that decorrelates the outputs after it is applied. Following Atick and Redlich (1990, 1992), this filter is the solution to (K ∗ R ∗ K)(x) = ρ 2 δ(x).
(3.1)
Since R is translation invariant in angle space, we can apply Fourier
1786
Alexander Dimitrov and Jack D. Cowan
Figure 4: Local recurrent connectivity in the presence of noise. The connection strength W depends only on the relative angle θ between units. (A) High signalto-noise ratio. (B) Low signal-to-noise ratio.
analysis and solve the equation: K( f ) · R( f ) · K( f )† = ρ 2 ,
(3.2)
the solution to which is K( f ) = ρ R( f )− 2 , of unique up to a unitary transformation. The corresponding recurrent filter has certain scaling properties, due to the properties of the autocorrelator. Indeed, Figure 1B suggests that the recurrent filter W( f ) ∼ 1 − | f |2.15 , so that in angle variables W(θ ) ≈ δ(θ) − exp(−|θ|), since F [ δ(θ ) − exp(−|θ |) ] = 1 − 1+1 f 2 ≈ 1 − f −2 , which 1
is close to our result, W( f ) ∼ 1 − | f |2.15 . This structure is mapped spatially in the cortical layer so that the connectivity in angle translates to spatial connectivity in cortical coordinates. If we consider a cortical linear zone, which translates to mapping our modeled hypercolumn approximately along a straight line, the connectivity we calculate is proportional to δ(θ) − exp(−|x|). This is consistent with the data compiled by Sholl (1956). If we include noise in the calculation according to equation 2.6, we obtain a filter that depends on the signal-to-noise ratio of the input level. We model the noise process here as a set of independent noise processes for each unit, with (Nv )i being the input noise and (No )i the output noise for unit i. The result for S/No ∼ 3 is shown in Figure 4A. We observe the broadening of the central connections, caused by the need to average local results in order to overcome the noise. Above a certain level of noise, the connectivity changes from inhibitory to purely excitatory (see Figure 4B).
Spatial Decorrelation in Orientation-Selective Cortical Cells
1787
With this filter, we can directly compare calculations obtained from applying it to a simulated oriented grating with physiological measurements of the orientation selectivity of cells in the cortex. The results of such comparisons are presented in Figure 5, in which we plot the activity of orientationselective cells in arbitrary units versus stimulus angle in degrees. We see very good matches with experimental results of several independent experiments (Celebrini et al., 1993; Schiller et al., 1976; Orban, 1984). We expect some discrepancies, such as in Figures 5D and 5F, which can be attributed to the threshold nature of real neural units. Thus, we can use the model to classify physiologically distinct cells by the value of the No parameter that describes them. Indeed, since this parameter models the intrinsic noise of a neural unit, we expect it to differ across populations. For comparison we have also included in Figure 5A a plot of the input signal (–) and an attempt to fit the data with a simpler curve like cos(θ) (. . . ) on the same domain. We observe that the output tuning is much sharper than the input to our system. We also note that the cos(θ) function is a poor description of the actual tuning curve because of the variable widths, observed in experiments, which cos cannot match without introducing overshoots at the ends (in case we tried to do that by using higher frequency). There are, of course, many models that can explain the behavior of a single hypercolumn. Our model is mostly useful as an exercise in performing the redundancy reduction calculation for the generic recurrent architecture. What is much more interesting is its application in a spatially extended model, which we consider next. 3.3 A Model of a Hypercolumn Network. We can perform a similar analysis on how the spatial statistical structure of contours in natural images is reflected in a model of lateral cortical connectivity. We assume a set of orientation-selective units with inputs V(θ, x) (see equation 2.8). The correlation function R(θ, θ 0 , x − y) is approximately spatially translation invariant. The structure of the correlation matrix depends on both correlations between contours and the architecture of the detectors. The fact that the output units calculate | dnd θ L(x)| creates many local correlations between neighboring units. Any nonlocal structure in the detector set is due to a similar structure, present in the correlations between contours (see section 3.1). The structure of the translation-invariant correlation matrix R(x) is shown in Figure 2. This can be interpreted as the correlation between the input to the center hypercolumn with the input to other hypercolumns. Because of the symmetries present, it suffices to show only the structure of the kernels W0j (x) for j = 0, 1, 2, 3, since the rest of the kernel can be obtained from these parts by symmetry. Applying the expression for an optimal filter in the presence of noise (see equation 2.7) leads to the result shown in Figure 3.3. Since the filter is
1788
Alexander Dimitrov and Jack D. Cowan
Figure 5: Comparison with experimental data. The activity of orientationselective cells in arbitrary units is plotted against stimulus angle in degrees. Experimental points are denoted with circles, calculated results with a solid line. The variation in the forms of the tuning curves could be accounted for by selecting different noise levels in our noise model. (A) Data from a complex cell from Orban (1984) and fit for Nv = 0.3, No = 0.45. Also included here for comparison are the input signal (–) and a fit to (1 − | cos(θ + π/2)|). (B) Data from cell CAJ4 in Celebrini et al. (1993) and fit for Nv = 0.1, No = 0.2. (C) Data from cell CAK2 in Celebrini et al. (1993) and fit for Nv = 0.35, No = 0.1. (D) Data from a simple cell from Orban (1984) and fit for Nv = 1.0, No = 0.45. (E) Data from a simple cells in Schiller et al. (1976) and fit for Nv = 0.06, No = 0.001. (F) Data from a simple cells in Schiller et al. (1976) and fit for Nv = 15.0, No = 0.01.
also assumed to be translation invariant, the pictures can be interpreted as the connectivity of the center hypercolumn to the rest of the network. This is seen to be strongest at the preferred orientation and weakest in the two adjacent bands, which represent connections to units with a perpendicular preferred orientation. As suggested by the analysis in section 2, the connectivity depends on the relative level of noise. For high signal contrast/low noise levels, the connectivity is mostly inhibitory (see Figures 6 and 7A). It changes to facilitory for low contrast/high noise signals (see Figure 7B). Thismodel is too simple to compare connectivity strength with physiological findings directly. We made an attempt to relate the calculated con-
Spatial Decorrelation in Orientation-Selective Cortical Cells
1789
nectivity to some measurements of orientation tuning of lateral connections (Bosking, Zhang, Schofield, & Fitzpatrick, 1997) with the assumption that the number of synaptic buttons that were measured by Bosking et al. is proportional to the strength that we calculate. We used the local activity tuning curves from the same work to obtain the two free parameters in our model and then applied their protocol to the spatially extended case. The results, shown in Figure 8, agree at least qualitatively with the experimental data. The connectivity we calculate is consistent with various physiological and psychophysical findings (Hirsch & Gilbert, 1991; Kisv´arday, Martin, Freund, Magloczky, Whitteridge, & Somogyi, 1986; McGuire, Gilbert, Wiesel, & Rivlin, 1991; Fitzpatrick, 1996; Polat & Norcia, 1996; Kitano, Niiyama, Kasamatsu, Sutter, & Norcia, 1994) in the anisotropy of connections and the signal-dependent changes in interaction we find necessary for optimal processing. The linear decrease in interaction with contrast that Polat and Norcia (1996) observe is consistent with our calculation in equation 2.6 for low recurrent noise and low contrast. Unfortunately the authors stopped their experiment just before observing inhibitory interactions. The inhibitory long-range interactions we calculate for high-contrast stimuli were shown to occur in cats by Kitano et al. (1994) and Hirsch and Gilbert (1991). Our main result, that activity in area V1 is on the average uncorrelated, was confirmed for a particular class of stimuli (Firen, Eckhorn, & Reitboeck, 1996; Gawne, Kajer, Hertz, & Richmond, 1996), though the signals used there were not exactly “natural.” We study more details of this structure in a related article (Mundel, Dimitrov, & Cowan, 1997) and find it very useful for explaining several well-known visual effects. 4 Discussion and Conclusions We have shown that properties of orientation-selective cells in the visual cortex can be partially described by linear systems analysis. Using this, we obtain results that are in good agreement with physiological and anatomical data of single-cell recordings and imaging. We can use the parameters of the model to classify differing cells in the visual cortex. We achieved this by using a recurrent network as the underlying model. This was chosen for several reasons. First we tried to give the model biological plausibility, and recurrency is well established on the cortical level. Another related heuristic argument is that although there exists a feedforward network with equivalent properties, as shown in section 2, such a network will require an additional layer of cells, while the recurrent model allows for both feedforward processing (the input to our model) and manipulation of the output of that (the decorrelation procedure in our model). As we show in section 3.3, the optimal connectivity structure depends on the level of noise in the system and can produce inhibitory as well as
1790
Alexander Dimitrov and Jack D. Cowan
Spatial Decorrelation in Orientation-Selective Cortical Cells
1791
facilitory connections. This seems to contradict several physiological observations suggesting that most of the long-range lateral connections in the cortex are excitatory (Hirsch & Gilbert, 1991). It should be noted, however, that at least 20% of the excitatory fibers end on inhibitory interneurons (Kisv´arday et al., 1986; McGuire et al., 1991), which then synapse with the soma of neighboring orientation-selective cells, providing an opportunity to influence them much more strongly than the excitatory synapse on a distal dendrite (Hirsch & Gilbert, 1991; Kitano et al., 1994). It was suggested by Ben-Yishai, Bar-Or, and Sompolinsky (1995) and Mundel et al. (1997) that neither excitation nor inhibition per se, but their balance, is responsible for the intricacy of cortical operations. This model tries to shed some light on the reason for such behavior—in this case, the change of signal fidelity— which necessitates modification in the cortical circuitry. Similar arguments can be presented, reconciling other suggestions for the function of lateral cortical connectivity (feature linking, contour continuation, endstopping, etc.). This model does not preclude any of these functions. As was shown by several authors (van Vreeswijk, Abbott, & Ermentrout, 1994; Gerstner, van Hemmen, & Cowan, 1996), the cortex can achieve synchronization not only through excitatory connections, but also via inhibitory ones. In fact, phase locking through inhibitory mechanisms is more robust. Our work is based on previous suggestions relating the second-order statistics of the visual environment to the structure of the visual pathway. It was thought before (Atick & Li, 1994) that this particular relation can describe only early visual pathways but is insufficient to account for the structure of the striate cortex. We show here and in a related work (Mundel et al., 1997) that the structures suggested by the redundancy reduction procedure are sufficient to describe many properties of the visual cortex, thus strengthening the possibility that this is a basic organizing principle for the visual system and one should anticipate its appearance in higher regions. Our work suggests that lateral connections may be used to perform decorrelation on any set of features, calculated by the feedforward pathway from the previous layer. Recent work by Penev and Atick (1996) and Olshausen and Field (1996) suggests an alternate interpretation that uses sparse coding. Imposing topographic representation and sparse coding in addition to the condition for least reconstruction error leads to both localized feedforward connections and lateral connections, which sparsify and effectively decorrelate the output.
Figure 6: Facing page. The connectivity function of a set with eight discrete detectors. The kernels are indexed in units of π8 . Only the connections of the 0th central unit to units 0, 1, 2, and 3 are shown, since the rest can be recovered by the symmetries, described in the article. One can observe on the contour plots the nonisotropic connectivity that depends on the relative angle between preferred directions of connected units.
1792
Alexander Dimitrov and Jack D. Cowan
Figure 7: Lateral interactions between units with the same orientation preference for high-contrast (A) and low-contrast (B) signals.
What is even more intriguing is the possibility that this method can account for the structure of other sensory pathways and cortices. We know, for example, that the somatosensory pathway and cortex are similar to the visual pathway and cortex, because of the similar environments that they encounter (luminosity, edges, and textures have analogies in somesthesia). Similar analogies may be expected for the auditory pathway. We expect even better results if we consider a more realistic nonlinear model for the neural units. In fact, this improves tremendously the information-processing abilities of a bounded system, since it captures higher-order correlations in the signal and allows for true minimization of the mutual information in the system rather than just an upper bound of it, as we do here. Very promising results in this direction have been recently described by Bell and Sejnowski (1997), Lin, Grier, and Cowan (1997), and Olshausen and Field (1996), and we intend to consider their implications for our model.
Spatial Decorrelation in Orientation-Selective Cortical Cells
1793
Figure 8: Bouton distribution (+) and calculated average interaction strength (–) between units.
Acknowledgments This work was supported in part by a McDonnell Foundation grant. We wish to thank Trevor Mundel, Dawei Dong, Dan Ruderman, Bruno Olshausen, and the two anonymous reviewers for many helpful comments and suggestions. References Atick, J. J., & Li, Z. (1994). Towards a theory of the striate cortex. Neural Computation, 6, 127–146. Atick, J. J., & Redlich, N. N. (1992). Towards a theory of early visual processing. Neural Computation, 2, 308–320. Atick, J. J., & Redlich, N. N. (1992). What does the retina know about natural scenes? Neural Computation, 4, 196–210. Atick, J. J., & Redlich, N. N. (1993). Convergent algorithm for sensory receptive field development. Neural Computation, 5, 45–60. Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61, 183–193. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In. W. A. Rosenblith (Ed.), Sensory communications. Cambridge, MA: MIT Press. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Bell, A. T. & Sejnowski, T. J. (1997). The “independent components” of natural scences are edge filters. Vision Research, 37, 3327.
1794
Alexander Dimitrov and Jack D. Cowan
Ben-Yisshai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proceedings of the National Academy of Sciences, USA, 92(9), 3844–3888. Bosking, W. H., Zhang, Y., Schofield, B., & Fitzpatrick, D. (1997). Orientation Selectivity and the arrangement of horizontal connections in tree shrew striate cortex. Journal of Neuroscience, 17(6), 2112–2127. Celebrini, S., Thorpe, S., Trotter, Y., & Imbert, M. (1993). Dynamics of orientation coding in area V1 of the awake primate. Visual Neuroscience, 10, 811–825. Dong, D. (1994). Associative decorrelation dynamics: A theory of selforganization and optimization in feedbaack networks. In G. Tesssauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 925–932). Cambridge, MA: MIT Press. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am., 4, 2379–2394. Firen, A., Eckhorn, R., & Reitboeck, H. J. (1996). Fast oscillations in V1 of awake monkey: Synchronization depends on cortical distance, on angle between preferred orientation and on stimulus orientation. Poster at the Society for Neuroscience 26th Annual Meeting, Washington, DC. Fitzpatrick, D. (1996). The functional organization of local circuits in visual cortex: Insights from the study of tree shrew striate cortex. Cerebral Cortex, 6(3), 329–341. Gawne, T. J., Kajer, T. W., Hertz, J. A., & Richmond, B. J. (1996). Adjacent visual cortical complex cells share about 20% of their stimulus-related information. Cerebral Cortex, 6, 482–489. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996). What matters in neuronal locking? Neural Computation, 8(8), 1653–1676. Hancock, P., Baddeley, R., & Smith, L. S. (1992). The principal components of natural images. Network: Computations in Neural Systems, 3, 61–70. Hirsch, J. A., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. Journal of Neuroscience, 11(6), 1800–1809. Hubel, D. & Wiesel, T. (1961). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology (London), 195, 215–243. Kisv´arday, Z. F., Martin, A. C., Freund, T. F., Magloczky, Z., Whitteridge, D., & Somogyi, P. (1986). Synaptic targets of HRP-filled layer III pyramidal cells in the cat striate cortex. Experimental Brain Research, 64, 541–552. Kitano, M., Niiyama, K., Kasamatsu, T., Sutter, E. E., & Norcia, A. M. (1994). Retinotopic and nonretinotopic field potentials in cat visual cortex. Visual Neuroscience, 11, 953–977. Letraset. (1993). Having fun [CD-ROM]. (See http://www.letraset.com.) Lin, J. K., Grier, D. G., & Cowan, J. D. (1997). Faithful representation of separable distributions. Neural Computation, 9(6), 1303–1318. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientations-selective cells. Proc. Natl. Acad. Sci. USA, Neurobiology, 83, 8390–8394. McGuire, B., Gilbert, C. D., Wiesel, T. N., & Rivlin, P. K. (1991). Target morphology and specificity of horizontal connections of pyramidal neurons in the primate visual cortex. Journal of C omparative Neurology, 305, 6, 370–392.
Spatial Decorrelation in Orientation-Selective Cortical Cells
1795
Mundel, T., Dimitrov, A., & Cowan, J. D. (1997). Simple model for cortical orientation selectivity. In J. Bower (Ed.), Computational neuroscience: Trends in research. New York: Plenum. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Orban, G. A. (1984). Neuronal operations in the visual cortex. Berlin: SpringerVerlag. Penev, P. S., & Atick, J. J. (1996). Local feature analysis: A general statistical theory for object representation. Network: Computations in Neural Systems, 7(3), 477– 500. Polat, U., & Norcia, A. M. (1996). Neurophysiological evidence for contrast dependent long-range facilitation and supprerssion in the human visual cortex. Vision Research, 36(14), 2099–2109. Ruderman, D. L. (1994). The statistics of natural images. Network: Computations in Neural Systems, 5, 517–548. Ruderman, D. L., & Bialek, W. (1993). Statistics of natural images: Scaling in the woods. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6, San Mateo, CA: Morgan Kaufman. Schiller, P., Finlay, B., & Volman, S. (1976). Quantitative studies of single-cell properties in monkey striate cortex. II. Orientation specificity and ocular dominance. J. Neuroph., 39(6), 1320–1333. Sholl, D. A. (1956). The organization of the cerebral cortex. London: Metheun. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural firing. Journal of Computational Neuroscience, 1, 313–321. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Received July 29, 1996; accepted February 9, 1998.
LETTER
Communicated by Erkki Oja
Receptive Field Formation in Natural Scene Environments: Comparison of Single-Cell Learning Rules Brian S. Blais N. Intrator H. Shouval Leon N. Cooper Physics Department and Institute for Brain and Neural Systems, Brown University, Providence, RI 02912, U.S.A.
We study several statistically and biologically motivated learning rules using the same visual environment: one made up of natural scenes and the same single-cell neuronal architecture. This allows us to concentrate on the feature extraction and neuronal coding properties of these rules. Included in these rules are kurtosis and skewness maximization, the quadratic form of the Bienenstock-Cooper-Munro (BCM) learning rule, and single-cell independent component analysis. Using a structure removal method, we demonstrate that receptive fields developed using these rules depend on a small portion of the distribution. We find that the quadratic form of the BCM rule behaves in a manner similar to a kurtosis maximization rule when the distribution contains kurtotic directions, although the BCM modification equations are computationally simpler. 1 Introduction Recently several learning rules that develop simple cell-like receptive fields in a natural image environment have been proposed (Law & Cooper, 1994; Olshausen & Field, 1996; Bell & Sejnowski, 1997). The details of these rules are different, as is their computational reasoning; however, all depend on statistics of order higher than two, and all produce sparse distributions. In a sparse distribution, most of the mass of the distribution is concentrated around zero, and the rest of the distribution extends much farther out. In other words, a neuron that has sparse response responds strongly to a small subset of patterns in the input environment and weakly to all others. Bimodal distributions, for which the mode at zero has significantly more mass then the nonzero mode, and exponential distributions are examples of sparse distributions, whereas gaussian and uniform distributions are not considered sparse. It is known that many projections of the distribution of natural images have long-tailed, or exponential, distributions (Daugman, 1988; Field, 1994). It has been argued that local linear transformations such as Gabor filters or center-surround produce exponential-tailed histograms c 1998 Massachusetts Institute of Technology Neural Computation 10, 1797–1813 (1998) °
1798
Brian S. Blais, N. Intrator, H. Shouval, & Leon N. Cooper
(Ruderman, 1994). Reasons given vary from the specific arrangements of the Fourier phases of natural images (Field, 1994) to the existence of edges. Since the exponential distribution is optimal from the viewpoint of information theory under the assumption of positive and fixed average activity (Ruderman, 1994; Levy & Baxter, 1996; Intrator, 1996; Baddeley, 1996), it is a natural candidate for detailed study in conjunction with neuronal learning rules. In what follows we investigate several specific modification functions that have the general properties of BCM synaptic modification functions (Bienenstock, Cooper, & Munro, 1982) and study their feature extraction properties in a natural scene environment. BCM synaptic modification functions are characterized by a negative region for small postsynaptic depolarization, a positive region for large postsynaptic depolarization, and a threshold that moves and switches between the Hebbian and anti-Hebbian regions. Several of the rules we consider are derived from standard statistical measures (Kendall & Stuart, 1977), such as skewness and kurtosis, based on polynomial moments. We compare these with the quadratic form of BCM (Intrator & Cooper, 1992), though this is not the only form that could be used. By subjecting all of the learning rules to the same input statistics and retina/lateral geniculate nucleus (LGN) preprocessing and by studying in detail the single-neuron case, we eliminate possible network-lateral interaction effects and can examine the properties of the learning rules themselves. We start with a motivation for the learning rules used in this study and then present the initial results. We then explore some of the similarities and differences between the rules and the receptive fields they form. Finally, we introduce a procedure for directly measuring the sparsity of the representation a neuron learns; this gives us another way to compare the learning rules and a more quantitative measure of the concept of sparse representations. 2 Motivation We use two methods for motivating the use of the particular rules. One comes from projection pursuit (Friedman, 1987), where we use an energy function to find directions where the projections of the data are nongaussian (Huber, 1985, for review); the other is independent component analysis (ICA) (Comon, 1994), where one seeks directions where the projections are statistically independent. These methods are related, but they provide two different approaches to this work. 2.1 Exploratory Projection Pursuit and Feature Extraction. Diaconis and Freedman (1984) show that for most high-dimensional clouds (of points), most low-dimensional projections are approximately gaussian. This finding suggests that important information in the data is conveyed in those directions whose single-dimensional projected distribution is far from gaussian. There is, however, some indication (Zetzsche, 1997), that for natural images,
Receptive Field Formation in Natural Scene Environments
1799
random local projections yield somewhat longer-tailed distributions than gaussian. We can still justify this approach, because interesting structure can still be found in non-random directions that yield projections farther from gaussian. Intrator (1990) has shown that a BCM neuron can find structure in the input distribution that exhibits deviation from gaussian distribution in the form of multimodality in the projected distributions. This type of deviation, which is measured by the first three moments of the distribution, is particularly useful for finding clusters in high-dimensional data through the search for multimodality in the projected distribution rather than in the original high-dimensional space. It is thus useful for classification or recognition tasks. In the natural scene environment, however, the structure does not seem to be contained in clusters: projection indices that seek clusters never find them. In this work we show that the BCM neuron can still find interesting structure in nonclustered data. The most common measures for deviation from gaussian distribution are skewness and kurtosis, which are functions of the first three and four moments of the distribution, respectively. Rules based on these statistical measures satisfy the BCM conditions proposed by Bienenstock et al. (1982), including a threshold-based stabilization. The details of these rules and some of the qualitative features of the stabilization are different, however. Some of these differences are seemingly important, while others seem not to affect the results significantly. In addition, there are some learning rules, such as the ICA rule of Bell and Sejnowski (1997) and the sparse coding algorithm of Olshausen and Field (1996), which have been used with natural scene inputs to produce oriented receptive fields. We do not include these in our comparison because the learning is not based on the activity and weights of a single neuron, and thus detract from our immediate goal of comparing rules with the same input structure and neuronal architecture. 2.2 Independent Component Analysis. Recently it has been claimed that the independent components of natural scenes are the edges found in simple cells (Bell & Sejnowski, 1997). This was achieved through the maximization of the mutual entropy of a set of mixed signals. Others (Hyvarinen & Oja, 1997) have claimed that maximizing kurtosis, with the proper constraints, can also lead to the separation of mixed signals into independent components. This alternate connection between kurtosis and receptive fields leads us into a discussion of ICA. ICA is a statistical signal processing technique whose goal is to express a set of random variables as linear combinations of statistically independent variables. We observe k scalar variables (d1 , d2 , . . . , dk )T ≡ d, which are assumed to be linear combinations of n unknown statistically independent variables (s1 , s2 , . . . , sn )T . We can express this mixing of the sources s as d = As,
(2.1)
1800
Brian S. Blais, N. Intrator, H. Shouval, & Leon N. Cooper
where A is an unknown k × n mixing matrix. The problem of ICA is to estimate both the mixing matrix A and the sources s using only the observation of the mixtures di . Using the feature extraction properties of ICA, the columns of A represent features, and si represent the amplitude of each feature in the observed mixtures d. These are the features in which we are interested. In order to perform ICA, we first make a linear transformation of the observed mixtures c = Md.
(2.2)
These linearly transformed variables would be the outputs of the neurons in a neural network implementation, and M, the unmixing matrix or matrix of features, would be the weights. Two recent methods for performing ICA (Bell & Sejnowski, 1995; Amari, Cichocki, & Yang, 1996) involve maximizing the entropy of a nonlinear function of the transformed mixtures, σ (c), and minimizing the mutual information of σ (c) with respect to the transformation matrix, M, so that the components of σ (c) are independent. These methods are, by their definition, multineuron algorithms and therefore do not fit well into the framework of this study. The search for independent components relies on the fact that a linear mixture of two nongaussian distributions will become more gaussian than either of them. Thus, by seeking projections c = (d · m) that maximize deviations from gaussian distribution, we recover the original (independent) signals. This explains the connection of ICA to the framework of exploratory projection pursuit (Friedman & Tukey, 1974; Friedman, 1987). In particular it holds for the kurtosis projection index, since a linear mixture will be less kurtotic than its original components. Kurtosis and skewness have also been used for ICA as approximations of the negative entropy (Jones & Sibson, 1987). It remains to be seen if the basic assumption used in ICA, that the signals are made up of independent sources, is valid. The fact that different ICA algorithms, such as kurtosis and skewness maximization, yield different receptive fields could be an indication that the assumption is not completely valid. 3 Synaptic Modification Rules In this section we outline the derivation for the learning rules in this study, using either the method from Projection Pursuit or ICA. Neural activity is assumed to be a positive quantity, so for biological plausibility, we denote by c the rectified activity σ (d · m) and assume that the sigmoid is a smooth, monotone function with a positive output (a slight negative output is also allowed). σ 0 denotes the derivative of the sigmoidal. The rectification is required for all rules that depend on odd moments because these vanish in a symmetric distribution such as that produced by natural scenes. We also
Receptive Field Formation in Natural Scene Environments
1801
demonstrate later that the rectification makes little difference on learning rules that depend on even moments. We study the following measures: Skewness 1. This measures the deviation from symmetry (Kendall & Stuart, 1977, for review) and is of the form S1 = E[c3 ]/E1.5 [c2 ].
(3.1)
A maximization of this measure via gradient ascent gives ∇S1 =
1
h ³ ´ i E c c − E[c3 ]/E[c2 ] σ 0 d
2M h ³ i ´ 1 3 0 E c c − E[c ]/2 d , σ = M 2M 1.5 1.5
(3.2)
where 2m is defined as E[c2 ]. Skewness 2. Another skewness measure is given by S2 = E[c3 ] − E1.5 [c2 ].
(3.3)
This measure requires a stabilization mechanism, because it is not invariant under constant multiples of the activity c. We stabilize the rule by requiring that the vector of weights, which is denoted by m, has a fixed norm, say, kmk = 1. The gradient of this measure is h i h ³ i ´ p p ∇S2 = 3E c2 − c E[c2 ] = 3E c c − 2M σ 0 d ,
(3.4)
subject to the constraint kmk = 1. Kurtosis 1. Kurtosis measures deviation from gaussian distribution mainly in the tails of the distribution. It has the form K1 = E[c4 ]/E2 [c2 ] − 3.
(3.5)
This measure has a gradient of the form h ³ ´ i 1 E c c2 − E[c4 ]/E[c2 ] σ 0 d 2 2M h ³ i ´ 1 2 4 0 E c c − E[c ]/2 d . σ = M 2M 2
∇K1 =
(3.6)
1802
Brian S. Blais, N. Intrator, H. Shouval, & Leon N. Cooper
Kurtosis 2. tion:
As before, there is a similar form that requires some stabiliza-
K2 = E[c4 ] − 3E2 [c2 ].
(3.7)
This measure has a gradient of the form h³ ´ i ∇K2 = 4E c3 − 3cE[c2 ] σ 0 d i h = 4E c(c2 − 32M )σ 0 d , kmk = 1.
(3.8)
In all the above, the maximization of the measure can be used as a goal for projection seeking. The variable c can be thought of as a (nonlinear) projection of the input distribution onto a certain vector of weights, and the maximization then defines a learning rule for this vector of weights. The multiplicative forms of both kurtosis and skewness do not require an extra stabilization constraint. Kurtosis 2 and ICA. It has been shown (Hyvarinen & Oja, 1996) that kurtosis, defined as h i £ ¤ K2 = E c4 − 3E2 c2 , can be used for ICA. This can be seen by using the property of this kurtosis measure, K2 (x1 + x2 ) = K2 (x1 ) + K2 (x2 ) for independent variables and defining z = AT m. We then get K2 (m · d) ≡ K2 (mT d) = K2 (mT As) = K2 (zT s) =
n X
zj4 K2 (sj ).
(3.9)
j=1
The ¤ points of equation 3.9 with respect to z under the constraint £ extremal E (m · d)2 = 1 occur when one component zj of z is ±1 and all the rest are zero (Delfosse & Loubaton, 1995). In other words, finding the extremal points of kurtosis leads to projections where m · d ≡ mT d = zT s equals, up to a sign, a single component sj of s. Thus, finding the extrema of kurtosis of the projections enables the estimation of the independent components individually, rather than all at once, as is done by other ICA rules. A full ICA code could be developed by introducing a lateral inhibition network, for example, but we restrict ourselves to the single neuron case here for simplicity. £ ¤ Maximizing K2 under the constraint E (m · d)2 = 1, and defining the £ T¤ covariance matrix of the inputs C = E dd , yields the following learning rule: ´ ¤ 2 ³ −1 £ C E d(m · d)3 − 3m . (3.10) m= λ
Receptive Field Formation in Natural Scene Environments
1803
This equation leads to an iterative fixed-point algorithm, which converges very quickly and works for both single-cell and network implementations (Hyvarinen & Oja, 1996). Quadratic BCM. The quadratic BCM (QBCM) measure as given in Intrator and Cooper (1992) is of the form QBCM =
1 1 3 E[c ] − E2 [c2 ]. 3 4
(3.11)
Maximizing this form using gradient ascent gives the learning rule: i h (3.12) ∇QBCM = E c2 − cE[c2 ] = E[c(c − 2M )σ 0 d]. Unlike the measures S2 and K2 above, the QBCM rule does not require any additional stabilization. This turns out to be an important property, since additional information can then be transmitted using the resulting norm of the weight vector m (Intrator, 1996). 4 Methods We use 13×13 circular patches from 12 images of natural scenes as the visual environment. Two different types of preprocessing of the images are used for each of the learning rules. The first is a difference of gaussians (DOG) filter, which is commonly used to model the processing done in the retina (Law & Cooper, 1994). The second is a whitening filter, used to eliminate the second-order correlations (Oja, 1995; Bell & Sejnowski, 1995). Whitening the data in this way allows one to use learning rules that are dependent on higher moments of the data but are particularly sensitive to the second moment. At each iteration of the learning, a patch is taken from the preprocessed (either DOGed or whitened) images and presented to the neuron. The moments of the output, c, are calculated iteratively using Z 1 t n 0 −(t−t0 )/τ 0 c (t )e dt . E[cn (t)] = τ −∞ In the cases where the learning rule is underconstrained (i.e., K2 and S2 ) we also normalize the weights at each iteration. For Oja’s fixed-point algorithm, the learning was done in batches of 1000 patterns over which the expectation values were performed. However, the covariance matrix was calculated over the entire set of input patterns. 5 Results 5.1 Receptive Fields. The resulting receptive fields (RFs) formed are shown in Figures 1 and 2 for both the DOGed and whitened images, respec-
1804
Brian S. Blais, N. Intrator, H. Shouval, & Leon N. Cooper
tively. Each RF shown was achieved using different random initial conditions for the weights. Every learning rule developed oriented RFs, though some were more sensitive to the preprocessing than others. The additive versions of kurtosis and skewness, K2 and S2 , respectively, developed significantly different RFs in the whitened environment compared with the DOGged environment. The RFs in the whitened environment had higher spatial frequency and sampled from more orientations than the RFs in the DOGed environment. This behavior, as well as the resemblance of the RFs in the DOGed environment to those obtained from PCA (Shouval & Liu, 1996), suggest that these measures have a strong dependence on the second moment. The multiplicative versions of kurtosis and skewness, K1 and S1 respectively, as well as QBCM, sampled from many orientations regardless of the preprocessing. The multiplicative skewness rule, S1 , gives RFs with lower spatial frequencies than either QBCM or the multiplicative kurtosis rule. This also disappears with the whitened inputs, which implies that the spatial frequency of the RFs is related to the strength of the dependence of the learning rule on the second moment. Example RFs using Oja’s fixed-point ICA algorithm are also shown in Figure 2; not surprisingly, they look qualitatively similar to those found using the stochastic maximization of additive kurtosis, K2 . The log of the output distributions for all of the rules has the double linear form, which implies a double exponential distribution. This distribution is one that we would consider sparse, but it would be difficult to compare the sparseness of the distributions merely on the appearance of the output distribution alone. In order to determine the sparseness of the code, we introduce a method for measuring it directly. 5.2 Structure Removal: Sensitivity to Outliers. Learning rules that are dependent on large polynomial moments, such as QBCM and kurtosis, tend to be sensitive to the tails of the distribution. This property implies that neurons are highly responsive and sensitive to the outliers, and consequently leads to a sparse coding of the input signal. Oversensitivity to outliers is considered to be undesirable in the statistical literature. However, in the case of a sparse code, the outliers, or the rare and interesting events, are what is important. The degree to which the neurons form a sparse code determines how much of the input distribution is required for maintaining the RF. This can be done in a straightforward and systematic fashion. The procedure involves simply eliminating from the environment those patterns for which the neuron responds strongly. An example RF and some of the patterns that give that neuron strong responses are shown in Figure 3. These patterns tend to be the high-contrast edges and are thus the structure found in the image. The percentage of patterns that needs to be removed in order to cause a change in the RF gives a direct measure of the sparsity of the coding. The process of training a neuron, eliminating patterns that yield
Receptive Field Formation in Natural Scene Environments
BCM
1805
Output Distribution
10 5
K
1
0 −10 10
0
10
0
10
0
10
0
10
0
10
DOGed
5 K
2
0 −10 10 5
S
1
0 −10 10 5
S2
0 −10 10 5 0 −10
Figure 1: Receptive fields using DOGed image input obtained from learning rules maximizing (from top to bottom) the QBCM objective function, kurtosis (multiplicative), kurtosis (additive), skewness (multiplicative), and skewness (additive). Shown are five examples (left to right) from each learning rule, as well as the log of the normalized output distribution, before the application of the rectifying sigmoid.
high response, and retraining can be done recursively to remove structure sequentially from the input environment and to pick out the most salient features in the environment. The results are shown in Figure 4. For QBCM and kurtosis, one need only eliminate less than one half of a percent of the input patterns to change the RF significantly. The changes that one can observe are orientation, phase, and spatial frequency changes. This is a very small percentage of the environment, which suggests that the neuron is coding the information in a very sparse manner. For the skewness maximization rule, more than 5% are needed to alter the RF properties, which implies a far less sparse coding. To make this more precise, we introduce a normalized difference measure
1806
Brian S. Blais, N. Intrator, H. Shouval, & Leon N. Cooper
Output Distribution
BCM
10 5 0 −0.1 10
K1
0
0.1
0
0.1
0
0.1
0
0.1
0
0.1
0
0.1
Whitened
5 0 −0.1 10
K
2
5 0 −0.1 10
S
1
5 0 −0.1 10
S
2
5 Oja ICA using K
2
0 −0.1 10 5 0 −0.1
Figure 2: Receptive fields using whitened image input, obtained from learning rules maximizing (from top to bottom) the QBCM objective function, kurtosis (multiplicative), kurtosis (additive), skew (multiplicative), skewness (additive), and Oja’s ICA rule based on the additive kurtosis measure. Shown are five examples (left to right) from each learning rule, as well as the log of the normalized output distribution, before the application of the rectifying sigmoid.
between two different RFs. If we take two weight vectors, m1 and m2 , then the normalized difference between them is defined as µ ¶ 1 m1 − m m2 − m ¯1 ¯2 2 − D≡ 4 km1 k km2 k 1 = (1 − cos α), 2
(5.1) (5.2)
¯ i is the mean of the where α is the angle between the two vectors and m elements of the vector i. This measure is not sensitive to scale differences,
Receptive Field Formation in Natural Scene Environments
1807
Figure 3: Patterns that yield high responses of a model neuron. The example receptive field is shown on the left. Some of the patterns that yield the strongest one-half percent of responses are labeled on the image on the right. These patterns are primarily the high-contrast edges.
because the vectors are divided by their norm, and it is not sensitive to scalar offset differences, because the mean of the vectors is subtracted. The measure has a value of zero for identical vectors, and a maximum value of one for orthogonal vectors. Shown in Figure 5 is the normalized difference as a function of the percentage eliminated, for the different learning rules. Differences can be seen with as little as a tenth of a percent, but only changes of around a half-percent or above are visible as significant orientation, phase, or spatial frequency changes. Although both skewness and QBCM depend primarily on the third moment, QBCM behaves more like kurtosis with regard to projections from natural images. Similar changes occur for both the BCM and kurtosis learning rules, and most likely occur under other rules that seek kurtotic projections. It is important to note, however, that patterns must be eliminated from both sides of the distribution for any rule that does not use the rectifying sigmoid because the strong negative responses carry as much structure as the strong positive ones. Such responses are not biologically plausible, so they would not be part of the encoding process in real neurons. It is also interesting to observe that the RF found after structure removal is initially of the same orientation, but of different spatial phase or possibly different position. Once enough input patterns are removed, the RF
1808
Brian S. Blais, N. Intrator, H. Shouval, & Leon N. Cooper
BCM
K
1
S1
Figure 4: Receptive fields resulting from structure removal using the QBCM rule, the rule maximizing the multiplicative form of kurtosis and skewness. The RF on the far left for each rule was obtained in the normal input environment. The next RF to the right was obtained in a reduced input environment, whose patterns were deleted that yielded the strongest 1% of responses from the RF to the left. This process was continued for each RF from left to right, yielding a final removal of about 5% of the input patterns.
becomes oriented in a different direction. If the process is continued, all of the orientations and spatial locations would be obtained. An objection may be made that the RFs formed are caused almost entirely by the application of the rectifying sigmoid. For odd-powered learning rules, the sigmoid is necessary to obtain oriented RFs because the distributions are approximately symmetric. This sigmoid is not needed for rules dependent on only the even-powered moments, such as kurtosis. Figure 6 demonstrates that the removal of the sigmoid and the removal of the mean from the moments calculations do not substantially affect the resulting RFs of the kurtosis rules. The choice of 13 × 13 pixel RFs was biologically motivated and computationally less demanding than larger RFs formation. Figure 7 shows some 21 × 21 pixel RFs; it is clear that little difference is observed. One may notice that the length of the RF is larger than the lengths of the RFs found using
Receptive Field Formation in Natural Scene Environments
1809
Normalized difference in RF
0.3 0.25
BCM Kurtosis 1 Skew 1
0.2 0.15 0.1 0.05 0 −2 10
−1
0
10 10 Percent removed
1
10
Figure 5: Normalized difference between RFs as a function of the percentage deleted in structure removal. The RFs were normalized, and mean zero, in order to neglect magnitude and additive constant changes. The maximum possible value of the difference is 1.
Olshausen and Field’s or Bell and Sejnowski’s algorithms. It is likely that events causing such elongated RFs are very rare, and thus lead to higher kurtosis. When additional constraints are imposed, such as finding a complete code, one ends up with less specialized RFs and, thus, less elongated RFs. 6 Discussion This study compares several learning rules that have some statistical or biological motivation, or both. (For a related study discussing projection pursuit and BCM, see Press and Lee, 1997.) We have used natural scenes to gain some more insight about the statistics underlying natural images. There are several outcomes from this study: • All rules used found kurtotic distributions. This should not come as a surprise because there are suggestions that a large family of linear filters can find kurtotic distributions (Ruderman, 1994).
1810
Brian S. Blais, N. Intrator, H. Shouval, & Leon N. Cooper
K with sigmoid 1
Output Distribution
10 5
K1 without sigmoid
0 −10 10
0
10
0
10
0
10
0
10
0
10
0
10
5 K1 without sigmoid and centered moments
0 −10 10 5
K2 with sigmoid
0 −10 10 5
K2 without sigmoid
0 −10 10 5
K2 without sigmoid and centered moments
0 −10 10 5 0 −10
Figure 6: Receptive fields using DOGed image input, obtained from learning rules maximizing (from top to bottom) multiplicative form kurtosis with rectified outputs, nonrectified outputs, and nonrectified outputs with centered moments respectively, and additive form kurtosis with rectified outputs, nonrectified outputs, and nonrectified outputs with centered moments, respectively. Shown are five examples (left to right) from each learning rule and the corresponding output distribution.
• The single-cell ICA rule we considered used the subtractive form of kurtosis as a measure for deviation from gaussian distributions and achieved RFs qualitatively similar to other rules discussed. • The QBCM and the multiplicative version of kurtosis are less sensitive to the second moments of the distribution and produce oriented RFs even when the data are not whitened. This is clear from the results
Receptive Field Formation in Natural Scene Environments
1811
BCM
K1
Figure 7: RFs using DOGed image input, obtained from the QBCM learning rule and the rule maximizing the multiplicative form of kurtosis.
about DOG-processed versus whitened inputs. The reduced sensitivity follows from the built-in second-order normalization that these rules have, kurtosis via division and BCM via subtraction. The subtractive version of kurtosis is sensitive and produces oriented RF only after sphering the data (Friedman, 1987; Field, 1994). • Both QBCM and kurtosis are sensitive to the tails of the distribution. In fact,the RF changes due to elimination of the upper 1/2% portion of the distribution (see Figure 4). The change in RF is gradual. At first, removal of some of the inputs results in RFs that have the same orientation but a different phase, once more patterns from the upper portion of the distribution are removed, different RF orientations are found. This finding gives some indication of the kind of inputs the cell is most selective to (values below its highest 99% selectivity); these are inputs with same orientation but different phase (different locality of RF). The sensitivity to small portions of the distribution represents the other side of the coin of sparse coding. It should be studied further as it may reflect some fundamental instability of the kurtotic approaches. • The skewness rules can also find oriented RFs. Their sensitivity to the upper parts of the distribution is not so dramatic, and thus the RFs do not change much when a few percent of the upper distribution are removed. • Kurtotic rules can find high kurtosis in either symmetric or rectified distributions. This is not the case for QBCM rule, which requires rectified distributions.
1812
Brian S. Blais, N. Intrator, H. Shouval, & Leon N. Cooper
• QBCM learning rule, which has been advocated as a projection index for finding multimodality in high-dimensional distribution, can find projections emphasizing high kurtosis when no cluster structure is present in the data. We have preliminary indications that the converse is not true: kurtosis measure does not perform well under distributions that are bi- or multimodal. This will be shown elsewhere. Acknowledgments This work was supported by the Office of Naval Research, the DANA Foundation, and the National Science Foundation. References Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Baddeley, R. J. (1996). An efficient code in V1? Nature (News and Views), 381, 560–561. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximisation approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bell, A. J., & Sejnowski, T. J. (1997). The independent components of natural scenes are edge filters.Vision Research, 37(23), 3327–3338. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Daugman, J. G. (1988). Complete discrete 2D Gabor transforms by neural networks for image analysis and compression. IEEE Transactions on ASSP, 36, 1169–1179. Delfosse, N., & Loubaton, P. (1995). Adaptive blind separation of independent sources: A deflation approach. Signal Processing, 45, 59–83. Diaconis, P., & Freedman, D. (1984). Asymptotics of graphical projection pursuit. Annals of Statistics, 12, 793–815. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266. Friedman, J. H., & Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on Computers, C(23), 881–889. Huber, P. J. (1985). Projection pursuit (with discussion). Annals of Statistics, 13, 435–475. Hyvarinen, A., & Oja, E. (1996). A fast fixed-point algorithm for independent component analysis. Int. Journal of Neural Systems, 7(6), 671–687.
Receptive Field Formation in Natural Scene Environments
1813
Hyvarinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9(7), 671–687. Intrator, N. (1990). A neural network for feature extraction. In D. S. Touretzky & R. P. Lippmann (Eds.), Advances in neural information processing systems, 2 (pp. 719–726). San Mateo, CA: Morgan Kaufmann. Intrator, N. (1996). Neuronal goals: Efficient coding and coincidence detection. In S. Amari, L. Xu, L. W. Chan, I. King, & K. S. Leung (Eds.), Proceedings of ICONIP Hong Kong: Progress in Neural Information Processing (Vol. 1, pp. 29– 34). Berlin: Springer-Verlag. Intrator, N. & Cooper, L. N. (1992). Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks, 5, 3–17. Jones, M. C., & Sibson, R. (1987). What is projection pursuit? (with discussion). J. Roy. Statist. Soc., Ser. A, 150, 1–36. Kendall, M., & Stuart, A. (1977). The advanced theory of statistics. New York: Macmillan. Law, C., & Cooper, L. (1994). Formation of receptive fields according to the BCM theory in realistic visual environments. Proceedings National Academy of Sciences, 91, 7797–7801. Levy, W. B., & Baxter, R. A. (1996). Energy efficient neural codes. Neural Computation, 8, 531–543. Oja, E. (1995). The nonlinear PCA learning rule and signal separation—mathematical analysis (Tech. Rep. No. A26). Helsinki: Helsinki University, CS and Inf. Sci. Lab. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple cell receptvie field properties by learning a sparse code for natural images. Nature, 381, 607–609. Press, W., & Lee, C. W. (1997). Searching for optimal value codes: Projection pursuit analysis of the statistical structure in natural scenes. In The neurobiology of computation: Proceedings of the Fifth Annual Computation and Neural Systems Conference (pp. 771–777). New York: Plenum. Ruderman, D. L. (1994). The statistics of natural images. Network, 5(4), 517–548. Shouval, H., & Liu, Y. (1996). Principal component neurons in a realistic visual environment. Network, 7(3), 501–515. Zetzsche, C. (1997). Intrinsic dimensionality: How biological mechanisms exploit the statistical structure of natural images (talk given at the Natural Scene Statistics Meeting, Hancock, MA). Received July 22, 1997; accepted March 2, 1998.
LETTER
Communicated by Gary Cottrell
Neural Feature Abstraction from Judgments of Similarity Michael D. Lee Communications Division, Defence Science and Technology Organisation, Salisbury, South Australia, 5108
The common neural network modeling practice of representing the elements of a task domain in terms of a set of features lacks justification if the features are derived through some form of ad hoc preabstraction. By examining a featural similarity model related to established multidimensional scaling techniques, a neural network is developed that generates features from similarity data and attaches weights to these features. The network performs a constrained search of a continuous solution space to determine the features and uses a previously developed regularization technique to minimize the number of features it derives. The network is demonstrated on artificial data, from which it recovers known features and weights, and on two real data sets involving the similarity of a set of geometric shapes and the abstract conceptual similarities of the 10 Arabic numerals. On the basis of these results, the relationship between the multidimensional scaling approach adopted by the network and an alternative additive clustering approach to feature extraction is discussed. 1 Introduction Many neural network models, particularly cognitive models, rely on a set of preabstracted features to describe the elements of their task domain. Shanks’s (1991) disease diagnosis model, for example, characterizes patients in terms of symptoms such as “sore gums,” “dizziness,” and “puffy eyes,” and the ARTMAP model of mushroom classification (Carpenter, Grossberg, & Reynolds, 1991) represents each mushroom as a bit string indicating the presence or absence of “fishy odor,” “conical cap shape,” and a range of other features. Unfortunately, the apparent simplicity and directness of this approach belies its limitations. In terms of the relationship between featural representation and cognitive modeling success, Smolensky (1987, cited by Smolensky, 1988) notes that “a poor representation will often doom the model to failure, and an excessively generous representation may essentially solve the problem in advance” (p. 69). Hertz, Krogh, and Palmer (1991, p. 144) make the same point in a more general context by demonstrating that the most primitive neural network can learn to solve the most difficult problem Neural Computation 10, 1815–1830 (1998)
c 1998 by the Commonwealth of Australia °
1816
Michael D. Lee
given an appropriate set of preprocessors. Clearly, the act, on the part of a modeler, of abstracting a set of relevant features from a domain constitutes a powerful form of preprocessing and can make an inappropriately large contribution to modeling success. For this reason, Komatsu (1992) argues that processes of featural selection often “bear the bulk of the explanatory burden” (p. 514). Brooks (1991) identifies the problem more bluntly: “This abstraction is the essence of intelligence and the hard part of the problems being solved” (p. 143). Rather than ad hoc extraction, therefore, what is required is a principled means by which a set of relevant features from a task domain can be explicitly and objectively generated. One way in which this might be achieved is suggested by a consideration of neural network models such as ALCOVE (Kruschke, 1992), which are founded on representations constructed by the multidimensional scaling of measures of psychological similarity (see Shepard, 1980, for an overview). There is significant theoretical and empirical impetus for accepting the veracity of multidimensionally scaled representations as models of human mental structure (Shepard, 1987a; see also Edelman & Intrator, 1997; Shepard, 1987b, 1994) but, most importantly, “The multidimensional scaling representation constrains the model. . . .We are not allowed to assume any input representation that happens to be convenient” (Kruschke, 1993, p. 5). Of particular interest here, is ALCOVE’s employment of featural binary representations in modeling human performance on the seminal classification task explored by Shepard, Hovland, and Jenkins (1961). This task involves a stimulus set consisting of geometric shapes naturally characterized in terms of a number of dimensions, such as shape, color, and size, which allow only binary distinctions (e.g., square versus circle, dark versus light, and large versus small). Within this approach, as depicted in Figure 1, each element may be represented by a vertex of a K-dimensional (hyper)cube, where the various coordinate axes correspond to the featural dimensions of the stimulus set. As is also shown in Figure 1, these featural dimensions are subject to processes of selective attention, which serve to weight the various dimensions. Following the empirical success of Nosofsky’s (1984, 1986) generalized context model, and the theoretical results reported by Shepard (1987a), ALCOVE models the similarity between any two elements as an exponential decay function of the city-block distance between the relevant two vertices, as follows: Ã sˆij = exp −σ
K X
! wk |pik
−
j pk |
,
(1.1)
k=1
where pik is a binary variable indicating the value of element i with regard to dimension k, wk is the (nonnegative) attentional weight assigned to dimension k, and σ is a parameter controlling the resolution or specificity of the representation. This resolution parameter is required in ALCOVE because
Feature Abstraction
1817
w2
w3
w1
Figure 1: Geometric interpretation of the featural representation employed in ALCOVE.
the weights are constrained to sum to unity to model the capacity limitations of attentional processes. From a computational perspective, however, if the weights are not subject to this constraint, the resolution is implicitly defined by the norm of the derived weights, and the need for its explicit inclusion as a parameter is removed. Therefore, in the representational context provided by ALCOVE, the task of feature abstraction for neural networks is one of determining the minimum number of binary variables pik and weights wk that satisfactorily account for a given similarity matrix, under the similarity model (see equation 1.1). Unfortunately, applications of ALCOVE and the generalized context model that require featural representations do not explicitly undertake the multidimensional scaling analysis on which all other applications are founded. Instead, the intuitively obvious form for the binary variables is simply assumed, and only the attentional weights are learned. Presumably, this shortcoming results from the fact that traditional multidimensional scaling algorithms cannot be constrained to produce binary representations of the type depicted in Figure 1. The aim of this article, therefore, is to describe and evaluate a neural network model designed to generate discrete featural representations from judgments of similarity, using the empirically successful and theoretically well-motivated similarity model of equation 1.1.
1818
Michael D. Lee
Figure 2: The architecture and nomenclature of the feature abstraction network.
2 Formulation of the Model The architecture and nomenclature of the network are shown in Figure 2. On each trial, a random element, m, is presented by activating the mth input unit, and leaving all other input units inactive. This presentation, through the connection weight matrix C, passes the value ckm to the kth internal representation unit. The activation function employed in these units is a parameterized sigmoid, fα , which results in the final internal representation value: pm k = fα (ckm ) =
1 . 1 + e−αckm
(2.1)
The estimated similarity of every element to the presented element is then calculated according to equation 1.1, and stored in the appropriate unit in the current similarity layer. Under the geometric interpretation depicted in Figure 1, the vector of m activation values across the internal representation layer [pm 1 , . . . , pk ] corresponds to the current representational point for the presented stimulus. In this context, the means by which the current similarities are generated from the internal representation may be conceived in terms of a radial basis function linkage between the two layers (Poggio & Girosi, 1990; Lowe, 1995). The estimated similarity, sˆi , between element i and presented element m given by equation 1.1 corresponds to the evaluation of the basis function g(D) = exp(−D), where D is the saliency weighted city-block distance bem tween [pi1 , . . . , piK ] and [pm 1 , . . . , pK ].
Feature Abstraction
1819
The role of the parameter α, which is increased by an amount 1α after each trial, is also readily understood in geometric terms. When α = 0 the value of each pm k is one-half, independent of ckm . As α increases, however, pm k becomes dependent on ckm , through a sigmoid relationship that gradually changes from a gentle near-linear slope, toward a progressively sharper approximation to a step function. Initially, a presented element is internally represented in the center of the (hyper)cube illustrated in Figure 1. As α begins to increase, this representational point is able to assume essentially any position within the (hyper)cube, but as α further increases, the point is forced toward the vertices, since each of its dimensional values becomes limited toward being either 0 or 1. Thus, the continual increase of α effectively constrains the exploration of a continuous solution space in the search for optimal discrete featural solutions. In this sense, the approach is similar to that employed in the Hopfield-style optimization network developed by Mehta and Fulop (1993) and the family of multidimensional scaling techniques described by Schneider (1992).1 The learning rules that guide the modification of the internal representations operate on the weights in C and the weights wk , and are derived by applying gradient descent optimization principles to an error measure consisting of two terms. The first term, called the similarity error, assesses the difference between the estimated pattern of interelement similarities and the corresponding target similarities, and is given by Es =
N 1X (ˆsj − sj )2 . 2 j=1
(2.2)
The second term, called the weight error, is employed to distinguish between the plethora of featural bases that could potentially provide a satisfactory account for a given similarity matrix. In particular, the requirement that a minimal number of features be extracted from the similarity ratings means that if two prospective solutions fare equally well in minimizing the similarity error, then the solution with the fewest significantly nonzero weights is to be preferred. One way in which the network can be encouraged to derive such solutions is through the addition of a regularizing or penalty term (see Ash & Cottrell, 1995, for an overview), which effectively embodies prior assumptions made about the distribution of weights (MacKay, 1992). A reasonable set of prior assumptions in this instance is that if a large number of potential features are considered, most should be given no weight, while the significant remaining weights are uniformly distributed. The weight error therefore adopts the form of a previously developed regularizing term (Weigend, 1 Schneider’s (1992) application of the parameterized sigmoid involves establishing a continuum between metric and nonmetric multidimensional scaling approaches and does not consider the issue of developing featural, rather than continuous, representations.
1820
Michael D. Lee
Rumelhart, & Huberman, 1991), which essentially embodies these assumptions (see Nowlan & Hinton, 1992, p. 476), and is given by: Ew =
K X k=1
w2k ε2 + w2k
,
(2.3)
where ε is a scaling parameter. The total error measure is simply the sum of the similarity and weight error measures, with a factor λ that controls the relative emphasis given to regularization: E = Es + λEw .
(2.4)
Having defined the total error, the required partial derivatives for the specification of the learning rules are: N X o¯sˆj ∂pm ∂Es ∂Es = × m × i ∂cim ∂ sˆj ∂pi ∂cim j=1
=
N X
j
m m (ˆsj − sj )(−wi sgn(pm i − pi ))(αpi (1 − pi ))
(2.5)
j=1
N X o¯sˆj ∂Es ∂Es = × ˆ ∂wi ∂ sj ∂wi j=1
=
N X j (ˆsj − sj )(−|pm i − pi |)
(2.6)
j=1
2ε2 wi ∂Ew = 2 , ∂wi (ε + w2i )2
(2.7)
which result in: old m m cnew im = cim + λc wi αpi (1 − pi )
N X
j
(ˆsj − sj ) sgn(pm i − pi )
(2.8)
j=1
wnew i
N X j = wold + λ (ˆsj − sj )|pm w i i − pi | − j=1
2λε2 wi , (ε 2 + w2i )2
(2.9)
where λc and λw are learning-rate parameters. Through the appropriate setting of ε, the functional form of equation 2.3 tends to prevent the same
Feature Abstraction
1821
feature being duplicated in a derived solution. That is, by making ε sufficiently small, a solution containing two features with identical membership patterns has a greater error than a solution where only one such feature exists (see Weigend, Rumelhart, & Huberman, 1991, Figure 1c). This property is particularly important in eliminating potential features with exactly opposite patterns of element membership (the poles in Hamming space), which redundantly represent the same information. Meanwhile, the choice of weight error permits the direct adoption of the successful adaptive tuning techniques for the λ parameter detailed by Weigend, Rumelhart, and Huberman (1991). Specifically, λ is initially set to zero and is increased by 1λ whenever the total error measure falls, is decreased by 1λ whenever the total error rises but remains less than an exponentially weighted average error, and is set to 0.9λ when the total error rises and is above this average. 3 Demonstrations of Model A first test of the network involved artificially generating a similarity matrix by the similarity model (see equation 1.1) using known features and attentional weights, and examining the network’s ability to recover this information. Four elements and three features were employed, with the feature vectors for the elements being [1 1 1], [1 1 0], [0 0 1], and [1 0 0], and the weights for the features being 0.5, 0.3, and 0.2. Accordingly, the network consisted of four units in each of the input, current similarity, and target similarity layers. A value of 10 was arbitrarily chosen to be sufficient to overestimate the number of features that ultimately would be derived, and 10 units were placed in the internal representation layer. Each connection weight in C was independently set to a random number chosen uniformly from the interval [−0.5, 0.5], and the 10 weights were all initialized to 1/10. The parameter values used were λc = λw = 1α = 0.05, ε = 0.01, and 1λ = 0.001. The patterns of change of the base 10 logarithm of the similarity error and the weight values across three separate simulations of the model are shown in Figures 3a through 3c. In all cases, the similarity error declines to a small value that is maintained, and 3 of the potential 10 features are retained, assuming the appropriate weights of 0.5, 0.3, and 0.2. Furthermore, an examination of the internal representation of the four elements in terms of these three features revealed that the original patterns of membership were recovered by the network. Across a total of 50 such simulations, 44 (88%) successful recoveries were achieved. Unsuccessful simulations were characterized by the retention of too few, if any, of the available featural dimensions, presumably because the regime for the modification of λ allowed an overly large emphasis to be given to regularization at some stage during the network’s evolution. In any case, Figure 3a is characteristic of the model’s behavior in the ap-
1822
Michael D. Lee
Figure 3: Model behavior in reconstructing known features and weights. (a– c) Ten potential features. (d) Twenty potential features. (e) Five potential features. (f) Different weights.
Feature Abstraction
1823
proximately half of these successful simulations, in that stability is achieved within a few hundred trials. Figures 3b and 3c are representative of the model’s performance in the remaining half, which required several thousand trials before a final representation was derived.2 The behavior depicted in Figure 3c is particularly noteworthy for its demonstration of the regularization component of the learning rules removing a significantly weighted, but redundant, featural dimension. Figures 3d and 3e depict typical model behavior when the original number of features available to the network is changed to 20 and 5, respectively. Once again, only 3 features are ultimately retained, and appropriate weights and feature vectors are recovered. Finally, Figure 3f gives an example of the model’s behavior when the predetermined weights were set to 1.5, 0.9, and 0.6, corresponding, in terms of the similarity model in equation 1.1, to changing the resolution σ from 1 to 3. Other simulations, involving different numbers of objects and different feature vectors, also suggest that the model is relatively successful in recovering known features and weights, particularly if multiple trials are employed. It is important to acknowledge, however, that the computational performance of the network is significantly affected by increases to the number of objects or potential features and may be unsuitable for particularly complicated task domains involving on the order of hundreds of objects or features. A second test of the network involved similarity data reported by Glushko (1975, Table 1b) on ratings of the “goodness” of the 17 geometric patterns shown in Figure 4. The network constructed to recover features from this similarity matrix contained 17 units in the input, current similarity, and target similarity layers, 10 internal representation units, and was otherwise identical to the previous network. Figures 5a and 5b display the behavior of the network over 1000 trials for two separate simulations, which are typical of a total of 10 conducted. In all cases, two features with weights significantly greater than the others were extracted by the network, and an examination of the learned internal representations revealed that these two features partitioned the shapes into three sets: S1 = {1, 2}, S2 = {3, 4, 5, 6, 7, 8, 9, 11}, and S3 = {10, 12, 13, 14, 15, 16, 17}. This was achieved, for example, in Figure 5a by one feature making a binary distinction between the shapes {1, 2} and {3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17}, while the second feature partitioned shapes {1, 2, 3, 4, 5, 6, 7, 8, 9, 11} from {10, 12, 13, 14, 15, 16, 17}. A reasonably satisfying explanation of these results is that the features correspond to symmetry properties of the geometric shapes. Both members of S1 are invariant under rotations of 90, 180, and 270 degrees, while 2 Although not implemented, the behavior of the model in all of the simulations reported here suggests that an appropriate stopping rule could be formulated in terms of the stability of the similarity error measure. The development of such a rule is a worthwhile topic for future research.
1824
Michael D. Lee
Figure 4: The 17 geometric patterns used by Glushko (1975).
Figure 5: Model behavior in abstracting features from the geometric shape similarity data.
the shapes in S2, with the exception of 11, are invariant under a rotation of 180 degrees. Meanwhile, all of the shapes in S3, with the exception of 10, vary under all of these rotations. Effectively, therefore, the two features divide the 17 elements into three sets on the basis of their symmetry properties. This result is encouraging, since the notion that symmetry plays a fundamental role in the perception of the “goodness” of shapes is one with
Feature Abstraction
1825
both a longstanding and accepted currency within psychology (Attneave, 1954; Garner, 1974; Leyton, 1992; Palmer, 1991; Shepard, 1994). Moreover, in terms of the misclassification of elements 10 and 11, it is comforting to note that previous analysis of the same similarity data (Borg & Lingoes, 1987, pp. 206–208; see also Glushko, 1975, p. 161) encountered precisely the same difficulty, suggesting that responsibility may well lie with the data rather than the feature extraction network. A final test of the network involved similarity data relating to the 10 Arabic numerals 0, 1, . . . , 9, generated by pooling judgments of their abstract conceptual similarity, obtained by Shepard, Kilpatrick, and Cunningham (1975), across three conditions of stimulus presentation. Figure 6 shows the behavior of the model across four separate simulations, using the same parameter values, and again with 10 potential feature units in the internal representation layer. In all cases, one feature is derived with a significantly larger weight (ranging from 0.77 to 0.93), with a succession of five or six features with smaller weights also being extracted. An examination of the learned internal representation reveals that the feature with greatest weight separates the numerals 1 through 9 from 0, effectively corresponding to the distinction between so-called counting and noncounting numbers. In all four cases, the feature associated with the second greatest weight (ranging from 0.42 to 0.44) distinguished the smaller numbers 0 through 5 from the larger numbers 6 through 9, and the feature associated with the third greatest weight (ranging from 0.31 to 0.36) generally partitioned powers of two—2, 4, 8—from the remaining numbers. The only exception in this regard was one instance in which 1 (correctly) and 0 (incorrectly) were treated as powers of two. The remaining features with nonnegligible weights in Figure 6 were less consistently extracted and were generally less amenable to meaningful interpretation, although a feature distinguishing odd from even, or doing so with one error, was produced in three of the four simulations.
4 Discussion The defining basis of the feature abstraction network is the similarity model (see equation 1.1), which essentially constitutes a discrete or featural variant of the multidimensional scaling approach to measuring similarity. It is therefore interesting to compare the network’s performance with established additive clustering techniques for data representation (Shepard, 1980; Shepard & Arabie, 1979; Tenenbaum, 1996), which attempt to identify underlying featural structures using an alternative similarity model. Under the additive clustering approach, a given measure of the similarity, sij , of elements i and j in the task domain is considered to have arisen from a weighted summation of the (potentially overlapping) classes to which both
1826
Michael D. Lee
Figure 6: Model behavior in abstracting features from the Arabic numerals similarity data.
belong and can therefore be estimated as follows: sˆij =
K X
j
wk pik pk + c,
(4.1)
k=1
where c is a constant. Clearly the fundamental difference between the two approaches resides in the interpretation of the featural coding given by the binary variables. The additive clustering model ascribes a weighted measure of similarity to two objects if and only if both belong to a particular class, which formally corresponds to both having a featural value of +1. Under the distance-based multidimensional scaling approach, however, any featural comparison affords some degree of similarity between any two given objects—but having the same featural value, whether 0 or +1, serves to increase this degree. In this sense, additive clustering seeks features that describe local class structures existing within the stimulus domain, while
Feature Abstraction
1827
the multidimensional scaling approach adopted here seeks features representing global binary distinctions that can be drawn across the stimulus domain. In some domains, the additive clustering approach may imply less than parsimonious class structures. For example, if a feature such as gender happened to be important in similarity determination, it does not seem sensible to argue that i and j being male promotes similarity, but a and b not being male (i.e., being female) does not. Presumably in these circumstances the single nominal class gender would need to be modeled as the two classes, male and female. Indeed, previously reported additive clustering analyses (Shepard & Arabie, 1979; Tenenbaum, 1996) of the numeral similarity data employed above provide concrete examples of this phenomenon. Rather than establishing a single binary distinction between large and small numbers, for instance, these techniques typically recover two features: one corresponding to small numbers and the other corresponding to large numbers. The countering benefit of this representational inefficiency, however, resides in additive clustering’s ability to capture small, salient subsets of stimuli. Within the numeral domain, for example, additive clustering analyses generate a feature comprising 3, 6, and 9, corresponding to the abstract concept “multiples of three.” It is unlikely the feature extraction network described here would be able to generate this distinction, since the remaining numerals—0, 1, 2, 4, 5, 7, and 8—do not seem to have a meaningful conceptual interpretation. As confirming preliminary evidence that the difference between the network developed here and additive clustering techniques resides in their different underlying similarity models, rather than the different optimization approaches they adopt, it is worth reporting the results of a reconfiguration of the feature extraction network. By replacing equation 1.1 with 4.1, and rederiving the learning rule (see equation 2.8) accordingly, the current optimizational approach was applied to perform additive clustering to the numeral domain. The four features extracted with greatest weight were given by the sets {2, 4, 8} (weight 0.46), {0, 1, 2} (weight 0.36), {6, 7, 8, 9} (weight 0.36), and {3, 6, 9} (weight 0.23). These features and weights bear close comparison to those reported by Tenenbaum (1996, Table 1): {2, 4, 8} (weight 0.44), {0, 1, 2} (weight 0.35), {3, 6, 9} (weight 0.33), and {6, 7, 8, 9} (weight 0.30). In practice, perhaps Tenenbaum’s (1996) technique might be preferred because its rigorous and principled foundation in the statistical field of multiple-cause learning affords possibilities, such as introducing prior biases on class sizes, not explicitly available in the present approach. The computational intensity of Tenenbaum’s (1996) method—which involves 100 to 200 expectation-maximization iterations, with 10 to 100 cycles of recorded and 20 to 200 cycles of unrecorded Gibbs sampling within each iteration—suggests, however, that the current approach may provide a relatively more efficient means of producing comparable results. What the results of the reconfigured network certainly demonstrate is that the
1828
Michael D. Lee
important theoretical difference between the feature extraction network developed here, and additive clustering approaches to feature extraction, lies in the different similarity models they adopt. Ultimately, the utility of the feature extraction network depends on the relationship between the target similarity values and the similarity model given in equation 1.1. If the similarity structures existing within a domain of interest are amenable to characterization in terms of a number of global binary distinctions, then the network provides an effective and principled technique for abstracting a parsimonious set of features and weights to describe the elements of the domain. Acknowledgments I thank Josh Tenenbaum for supplying the numeral similarity data and Gary Cottrell for helpful comments on an earlier version of the article. References Ash, T., & Cottrell, G. (1995). Topology-modifying neural network algorithms. In M. A. Arbib (Ed.), Handbook of brain theory and neural networks (pp. 990–993). Cambridge, MA: MIT Press. Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61, 183–193. Borg, I., & Lingoes, J. (1987). Multidimensional similarity structure analysis. New York: Springer-Verlag. Brooks, R. A. (1991). Intelligence without representation. Artificial Intelligence, 47, 139–159. Carpenter, G. A., Grossberg, S., & Reynolds, J. H. (1991). ARTMAP: Supervised realtime learning and classification of nonstationary data by a self-organizing neural network. Neural Networks, 4, 565–588. Edelman, S., & Intrator, N. (1997). Learning as extraction of low-dimensional representations. In R. L. Goldstone, P. G. Schyns, & D. L. Medin (Eds.), The psychology of learning and motivation: Mechanisms of perceptual learning. New York: Academic Press. Garner, W. R. (1974). The processing of information and structure. Potomac, MD: Erlbaum. Glushko, R. J. (1975). Pattern goodness and redundancy revisited: Multidimensional scaling and hierarchical clustering analyses. Perception and Psychophysics, 17(2), 158–162. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computing. Redwood City, CA: Addison-Wesley. Komatsu, L. K. (1992). Recent views of conceptual structure. Psychological Bulletin, 112(3), 500–526. Kruschke, J. K. (1992). ALCOVE: An exemplar-based connectionist model of category learning. Psychological Review, 99(1), 22–44.
Feature Abstraction
1829
Kruschke, J. K. (1993). Human category learning: Implications for backpropagation models. Connection Science, 5, 3–36. Leyton, M. (1992). Symmetry, causality, mind. Cambridge, MA: MIT Press. Lowe, D. (1995). Radial basis function networks. In M. A. Arbib (Ed.), Handbook of brain theory and neural networks (pp. 779–782). Cambridge, MA: MIT Press. MacKay, D. J. C. (1992). A practical Bayesian framework for backpropagation networks. Neural Computation, 4, 448–472. Mehta, S., & Fulop, L. (1993). An analog neural network to solve the Hamiltonian cycle problem. Neural Networks, 6, 869–881. Nosofsky, R. M. (1984). Choice, similarity, and the context theory of classification. Journal of Experimental Psychology: Learning, Memory, and Cognition, 10(1), 104– 114. Nosofsky, R. M. (1986). Attention, similarity, and the identificationcategorization relationship. Journal of Experimental Psychology: General, 115(1), 39–57. Nowlan, S. J. & Hinton, G. E. (1992). Simplifying neural networks by soft weightsharing. Neural Computation, 4, 473–493. Palmer, S. E. (1991). Goodness, Gestalt, groups, and Garner. In J. R. Pomerantz and G. L. Lockhead (Eds.), The perception of structure: Essays in honor of Wendell R. Garner (pp. 23–39). Washington, DC: American Psychological Association. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78(9), 1481–1497. Schneider, R. B. (1992). A uniform approach to multidimensional scaling. Journal of Classification, 9, 257–273. Shanks, D. R. (1991). Categorization by a connectionist network. Journal of Experimental Psychology: Learning, Memory and Cognition, 17(3), 433–443. Shepard, R. N. (1980). Multidimensional scaling, tree-fitting, and clustering. Science, 210, 390–398. Shepard, R. N. (1987a). Toward a universal law of generalization for psychological science. Science, 237, 1317–1323. Shepard, R. N. (1987b). Evolution of a mesh between principles of the mind and regularities of the world. In J. Dupr´e (Ed.), The latest on the best: Essays on evolution and optimality (pp. 251–275). Cambridge, MA: MIT Press. Shepard, R. N. (1994). Perceptual-cognitive universals as reflections of the world. Psychonomic Bulletin & Review, 1(1), 2–28. Shepard, R. N., & Arabie, P. (1979). Additive clustering: Representations of similarities as combinations of discrete overlapping properties. Psychological Review, 86, 87–123. Shepard, R. N., Hovland, C. L., & Jenkins, H. M. (1961). Learning and memorization of classifications. Psychological Monographs, 75(13) (Whole No. 517). Shepard, R. N., Kilpatrick, D. W., & Cunningham, J. P. (1975). The internal representation of numbers. Cognitive Psychology, 7, 82–138. Smolensky, P. (1988). Putting together connectionism—again. Behavioral and Brain Sciences, 11, 59–74. Tenenbaum, J. B. (1996). Learning the structure of similarity. In D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo (Eds.), Neural information processing systems, 8. Cambridge, MA: MIT Press.
1830
Michael D. Lee
Weigend, A. S., Rumelhart, D. E., & Huberman, B. A. (1991). Generalization by weight-elimination with application to forecasting. In R. P. Lippmann, J. E. Moody and D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 875–882). San Mateo, CA: Morgan Kauffman. Received April 24, 1997; accepted October 8, 1997.
LETTER
Communicated by Mike Casey
Classification of Temporal Patterns in Dynamic Biological Networks Patrick D. Roberts Neurological Sciences Institute, Oregon Health Sciences University, Portland, OR 97209, U.S.A.
A general method is presented to classify temporal patterns generated by rhythmic biological networks when synaptic connections and cellular properties are known. The method is discrete in nature and relies on algebraic properties of state transitions and graph theory. Elements of the set of rhythms generated by a network are compared using a metric that quantifies the functional differences among them. The rhythms are then classified according to their location in a metric space. Examples are given, and biological implications are discussed. 1 Introduction The development of theoretical tools is essential for a thorough understanding of complex biological systems. Networks of interconnected neurons are systems that can display very complex dynamics. Much of the literature concerning neural networks focuses on the strengths of synaptic connections (Hebb, 1949) when analyzing network dynamics. The emphasis on synaptic currents tends to downplay other membrane properties of the neurons that make important contributions to the activity of biological networks (Llin´as, 1988; Maynard, 1972; Mulloney & Selverston, 1974). In recent years it has become clear that the modulation of cellular properties is as important to the behavior of a biological network as the efficacy of synaptic connections. In the ongoing behavior of an animal, neural modulation by hormones or neuropeptides is crucial to the maintenance and control of neuronal activity (Harris-Warrick, Nagy, & Nusbaum, 1992). A formalism designed to aid in our understanding of biological neural systems should take into account such modulation for a thorough description of behavior in specific applications. Recent experimental studies of central pattern generators have revealed that the rhythmic output of small neural circuits can switch between elements of a restricted group of stable patterns (Dickinson & Moulins, 1992). In fact, the idea of a pattern generator as an anatomically distinct unit has given way to considerations of functional multiple-pattern generators that are dynamically sculpted out of a fixed anatomical network (Getting, 1989). These dynamic biological networks appear to be controlled by both diffuse c 1998 Massachusetts Institute of Technology Neural Computation 10, 1831–1846 (1998) °
1832
Patrick D. Roberts
application of neuromodulators and direct synaptic stimulation of target cells. Many modeling studies designed to explore the behavior of small neural networks have concentrated on the membrane conductances (Hodgkin & Huxley, 1952) of the component neurons. In these models, the details of multiple membrane currents in the component neurons are simulated using coupled differential equations. Although these efforts reveal the behavior of a network under perturbations of system parameters, the computational overhead of these models forbids a full classification of rhythmic output. One may reduce the computational load while maintaining the essence of relevant biological mechanisms by discretizing the details of membrane conductances. The logical complexity of discrete networks was first analyzed by McCulloch and Pitts (1942) in their study of neural networks. These early studies have since been expanded into a general study of discrete automata following technological advances in digital computers. A large class of discrete systems involves cellular automata (Wolfram, 1986; Kauffman, 1993) that have been shown to generalize to discrete neural network models (Garzon, 1990). Methods to analyze these systems have proved useful in the study of general complex systems (Weisbuch, 1991), but such methods must be adapted to the idiosyncrasies of multiple-pattern generators. Deterministic models, such as boolean automata, have been used to make specific predictions about the relation between a network’s architecture and its output. Classifications of networks have been made using these methods based on symmetries (Glass, 1975a) and the output dynamics (Wuensche & Lesser, 1992) of deterministic networks. However, deterministic systems do not include multiple mechanisms that are needed to predict all possible output patterns of a given biological network. In order to scan the range of possibilities, one must study less tractable, nondeterministic networks. In nondeterministic automata, each state may make a transition to more that one final state. This implies that the mechanisms do not uniquely determine the outcome following any point in the dynamics of the system. In terms of symbolic dynamics, these systems can be described by shifts of finite type (Lind & Marcus, 1995). Methods from automata theory have been applied to continuous state dynamical systems in order to analyze their computational complexity (Crutchfield, 1994). Such methods have recently been applied to neural network models, yielding insights into the mechanisms implied by their input-output behavior (Casey, 1996), and applied to study the computations of biologically plausible networks. (Kentridge, 1994). However, the previous research has proved difficult to use by experimentalists who wish to know what temporal patterns a specific dynamic biological network is capable of generating. Although discrete methods have been applied to study central pattern generators (Caianiello & Ricciardi, 1967; Glass & Young, 1979; Huerta, 1996), a method to classify rhythmic
Classification of Temporal Patterns
1833
behavior in terms of relations between different temporal sequences is a necessary addition to the theoretician’s toolbox. The purpose of this article is to introduce two new tools for the study of multiple-pattern generators. First, cellular properties of constituent neurons are expressed as discrete transitions on equal footing with synaptically induced transitions. Secondly, similarities between rhythmic patterns are quantified in order to classify the possible patterns generated by a given network with known synaptic connectivity and cellular properties. The set of rhythmic patterns forms a metric space where functionally similar patterns appear in clusters, and each cluster defines a functional mode of the system. Significant biological mechanisms that differentiate clusters can then be investigated to help understand how the network navigates through behavioral options. In the next section we make the necessary definitions for this approach and present the scheme for identifying rhythmic patterns. The following section investigates the properties of the space of rhythmic patterns followed by a section with examples of dynamic biological networks. We conclude with a discussion of some open mathematical questions and biological implications of this approach. 2 Neural States and Transition Graphs Central pattern generators are often found in motor circuits that require sustained bursts of action potentials to control muscle activity. Many of the neurons that participate in pattern generation exhibit plateau potentials—long depolarized states that arise from a bistable membrane potential (Hartline, 1987). In the language of Rinzel (1987), the methods developed here focus on the slow mechanisms involved in the generation of rhythmic behavior. Individual action potentials are considered to have only a secondary effect on pattern generation. 2.1 Neural States. We will describe the output state of our model neurons in terms of standard McCulloch-Pitts neural units (McCulloch & Pitts, 1942), cn , where n = 1, . . . , N, and N is the number of neural units under consideration. Associated with each neuron is an indicator of the membrane potential that takes its values in a binary state space, c˜n ∈ Z2 , where here the excited state (˜cn = 1) means that the neuron is firing a burst of action potentials. A network N is a set of N two-state neurons augmented with synaptic connections and cellular properties; N = {c1 , c2 , . . . , cN ; S , C }, where S is a set of synaptic connections and C is a set of cellular properties. Both S and C formalized as transitions between states and the elements of S sometimes take on additional parameters that denote their efficacy. A neural state (or configuration; Botelho & Garzon, 1991) is defined by Getting (1989) as the spatial distribution of activity within the network at any given moment in
1834
Patrick D. Roberts
time. For example, if at time t neuron c1 is firing a burst of action potentials, c2 is silent, c3 is silent, and so forth, then the neural state will be represented as ν(t) = [˜c1 c˜2 c˜3 , . . . , c˜N ](t) = [100, . . .](t). The set of time-independent neural states (ignoring the time component) is denoted by V = {ν1 , . . . , νh }, where h = 2N . We have given the neural states a subscript to identify them without an explicit reference to time. The cellular properties and synaptic connections provide the mechanisms (M) of the network that induce transitions between neural states. For each element of S and C there corresponds a collection of ordered pairs of neural states that represent transitions, {M(ν10 |ν), M(ν20 |ν) . . .}. In each transition, M(ν 0 |ν), the initial state is given by ν, the final state is ν 0 , and the mechanism that accounts for the transition is labeled by M. 2.2 Transition Graphs. The full set of transitions generated by all elements of S and C in the network N is denoted by E . Together with the set of neural states V , the transitions define a directed graph G(V , E ) that represents the dynamics of the network (Carr´e, 1979). In the graphical representation, the set V contains the vertices, and the set E contains the edges. Since there is typically more than one outgoing edge from each vertex in G(V , E ), the system is nondeterministic and can be thought of as a Markov chain where the precise values of the nonvanishing probabilities are not specified (sometimes called a topological Markov chain ; Lind & Marcus, 1995). We now define useful cellular properties and their associated transitions. Cellular properties are used here in the sense that there are transitions that individual neurons can undergo independent of external influences arising from synaptic connections. Depending on the conductance properties of a biological membrane, individual neurons can spontaneously terminate a plateau, remain tonically active, or oscillate between active and inactive states. The cellular property called plateau termination is interpreted as follows: A neuron (cn ) that can terminate a plateau, but cannot activate from a resting state without external influence, will contribute exactly one 0 0 transition to the set E , CPT n (ν |ν) ∈ E , where c˜n = 0 if c˜n = 1. Two other cellular properties, tonic activity and endogenous oscillation, 0 will be interpreted analogously as transitions. The first transition, CTA n (ν |ν), works in the opposite direction as plateau termination; if the neuron cn is 0 inactive, then it becomes active. The second cellular property, CEO n (ν |ν), results in one of two transitions, depending on the initial state of neuron cn . If c˜n = 1 then c˜0n = 0; otherwise c˜0n = 1 The information about synaptic connectivity of the network adds more transitions to the set E . Suppose there is a synaptic connection between neurons cm and cn in the network N = {. . . , cm , . . . , cn , . . .; S , C } where cn is the postsynaptic neuron. A synaptic transition associated with an inhibitory chemical synapse in S is denoted as SImn (ν 0 |ν), where the activity of the presynaptic neuron can silence the activity of the postsynaptic neuron so
Classification of Temporal Patterns
1835
that c˜0n = 0, if c˜n = 1 and c˜m = 1. The transition associated with an excitatory chemical synapse, SEmn (ν 0 |ν), is defined in an analogous manner except that the postsynaptic neuron is excited from an inactive. Electrical synaptic connections cause the neurons to equalize their membrane potential so they either excite or inhibit, depending on the state of the presynaptic neuron. If the synapse is a gap junction, then either neuron can take the role of postsynaptic and presynaptic neuron. Suppose that c˜n = 1 and c˜m = 0. Then two transitions are associated with this connection: one with final state c˜n = 1 and c˜m = 1, and the other with final state c˜n = 0 and c˜m = 0. Important subgraphs of G(V , E ) for the study of rhythmic behavior result from the elimination of edges that do not satisfy certain constraints. For instance, if the network is in a state where one of the neurons is heavily inhibited by synaptic currents, then it is biologically implausible that the neuron would undergo a transition from a silent state to firing a burst of action potentials. One may introduce thresholds for the neurons of the network and eliminate those transitions that violate a rule based on the sum of ionic currents in each neuron (McCulloch & Pitts, 1942). Let in ∈ Z be a postsynaptic current due to activity of neuron n where in > 0 for an excitatory and in < 0 for an inhibitory postsynaptic current. Assign a value C to each type of transition such that C > 0 if the transition represents a neuron changing from an inactive to an active state and C < 0 otherwise. A transition is eliminated by the synaptic constraint if C+ C+
X
X
in ≥ θ, for C < 0
n
in ≤ −θ, for C > 0,
(2.1)
n
where θ is a threshold and the sum is over all active presynaptic neurons. 3 Rhythms and Rhythm Space We will concentrate on the functional output of a given anatomical network of distinguishable neurons because dynamic biological networks tend to reconfigure themselves to alter their output. The emphasis on functional output is achieved by defining a rhythm as a cycle through the transition graph, G(V , E ), where each neuron participating in the functional network changes state exactly twice, and the states of all other neurons in the anatomical network are held fixed. This definition contains the simplification that makes our analysis possible, and there are several plausible reasons for why allowing only two state changes for each neuron is not too restrictive. Double bursts within a cycle of central pattern generators appear to be quite uncommon. In observed cases from the literature (see Miller, 1987), it can be argued that an observed double burst is actually a long plateau that has
1836
Patrick D. Roberts
its spikes suppressed by inhibitory input during the middle of the plateau. A good reason for allowing only one burst for each neuron per cycle is that plateau mechanisms are slow (Rinzel, 1987) so that small neural networks complete a full cycle before any neuron has recovered from its last burst. Larger networks may have a tendency to break up into smaller functional subnetworks, but this issue requires more detailed investigation. In order to count the maximum possible number of rhythms generated by a network of a given size N, let us consider the extreme case where each state shares an edge with exactly N other states. This would be the transition graph of a network composed of N endogenous oscillators. Each rhythm of the network is a 2N cycle of transitions. To count these rhythms, the number of orderings of these transitions is equal to the number of permutations of 2N elements (2N!). Dividing out the redundancy by rotations through the each cycle yields (2N − 1)! rhythms. A classification of cyclic dynamics on transition graphs corresponding to networks has been carried out previously (Glass, 1977). Every cyclic path is associated with a coordinate sequence (Gilbert, 1958)—that is, the sequence of neurons that change state at each time step. According to this scheme, any two cycles that have the same coordinate sequence are equivalent by a symmetry of the transition graph. In dynamic biological networks, the assignment of cellular properties to individual neurons breaks this symmetry so that another method must be devised for classification. 3.1 Distances Between Rhythms. The functional similarity relevant to many neural systems is the sequence of bursts generated by the composite neurons. Comparisons between rhythms will be accomplished by introducing a distance function onto the set of rhythms to quantify the functional differences between them. Two rhythms are defined to be neighbors if their coordinate sequences differ by the transposition of adjacent elements. For instance, if a rhythm contains the following sequence of neural states, · · · [. . . c˜m . . . c˜n . . .][. . . c˜0m . . . c˜n . . .][. . . c˜0m . . . c˜0n . . .] · · · ,
(3.1)
then its coordinate sequence contains the elements (. . . mn . . .). A neighboring rhythm shares all the states in the cycle but one, · · · [. . . c˜m . . . c˜n . . .][. . . c˜m . . . c˜0n . . .][. . . c˜0m . . . c˜0n . . .] · · · ,
(3.2)
and has a coordinate sequence with two adjacent elements transposed, (. . . nm . . .). Since these two rhythms differ by only one neural state, it is natural to consider them as neighbors in the set of rhythms. Our definition of distance is as follows: The distance between rhythm r1 and r2 is the minimum number of adjacent transpositions of coordinate sequence elements that transforms r1 into r2 . This operation is symmetric with respect to the rhythms and satisfies the triangle inequality, so it qualifies as a distance. Thus, we
Classification of Temporal Patterns
1837
may define a rhythm space, R, as a set of rhythms R along with metric, d : R × R → Z+ ∪ {0} as defined above. Note that this definition of distance differs from the Hamming distance (Hamming, 1986), which measures the overlap between two strings of binary numbers; the distance function introduced here involves the adjacent transitions to the neural states that do not overlap. Two neighboring rhythms have a strong functional similarity because most of the activation sequence is preserved under translation by one step in rhythm space. This observation is particularly important when considering networks that consist of motor neurons or a network that drives motor neurons. The sequence of neuronal activation will translate into a sequence of muscle contractions. Two rhythms that lie far apart in rhythm space will correspond to very different movement patterns, and neighboring rhythms will generate similar movements. 4 Rhythmic Examples Examples of rhythm-generating networks are presented in this section to illustrate the concepts of rhythm space. The following two examples are simple oscillator networks that generate all possible rhythms when no constraints are applied to the transition graphs. These examples demonstrate the complex geometric structure of rhythm space. The last example is from a well-known biological system (Getting, 1989) that has given rise to modern conceptual approaches to central pattern generators. The network is of interest here because it contains multicomponent synapses, and our analysis reveals the specific components that are necessary for the generation of viable rhythms. 4.1 Oscillator Networks. The simplest rhythm-generating network to be investigated here is an abstract two-neuron network where the maximum number of rhythms is (2N − 1)! = 6. An example of such a network would be two endogenous oscillators connected by excitatory chemical synapses, EO N = {c1 , c2 ; {SE1,2 , SE2,1 }, {CEO 1 , C2 }}, as shown at the top of Figure 1A. The transition graph is shown in middle of Figure 1A. The six rhythms are represented by the following sequences of states, followed by their associated coordinate sequences: r1 : [10][00][01][00], (1221) r3 : [11][01][00][01], (1221) r5 : [11][10][11][01], (2211)
r2 : [11][10][00][01], (2121) r4 : [10][11][01][00], (2121) r6 : [10][11][10][00], (2211).
(4.1)
Although rhythms are cyclical, we have adopted the convention of writing the rhythm beginning with the transition in which the first neuron changes state from 0 to 1. The structure of r1 tells us that it has two neighbors because there are only two transpositions of adjacent coordinate sequence
1838
Patrick D. Roberts
Figure 1: Two neurons. (A) Top: A network of two neurons connected by excitatory chemical synapses where each neuron is an endogenous oscillator. Middle: The transition graph associated with the network. Paths through the transition graph that represent two rhythms generated by this network is shown. If the transition that silences neuron 1 in r1 is transposed with an adjacent transition that excites neuron 2, then rhythm r1 is changed into rhythm r2 . Bottom: The rhythm space of the network. (B) Top: The same anatomical network as A, but here the neurons exhibit plateau termination. Middle: The transition graph for this network includes the only cyclic path that represents the rhythm r5 . Bottom: The rhythm space contains only one point.
elements that transform the rhythm. The transposition of the first coordinate sequence element with the second transforms r1 into r4 . Thus, the distance from r1 to r4 is d(r1 , r4 ) = 1. These two rhythms are superimposed on the transition graph of Figure 1A. The transposition of the last two coordinate sequence elements yields r2 . Continuing in this manner, one is able to map
Classification of Temporal Patterns
1839
out the rhythm space as shown at the bottom of Figure 1A. It is interesting to note that r2 and r4 are oriented in that they make a loop through the states in opposite directions, while the other four rhythms are nonoriented since they double back onto themselves. Thus, the distance from r2 to r4 must be greater than one since the transformation must pass through a nonoriented rhythm in order to switch the orientation. A similar network that generates only one rhythm is shown in Figure 1B for comparison. Here the cellular properties have been changes from endogenous oscillation to plateau termination so that there is no mechanism that can excite either neuron if the neural state is [0 0]. Thus, the rhythm space consists of a single rhythm. These two examples show how changes in the cellular properties can change the potential behavior of anatomically equivalent networks. The distances between rhythms generated by a two-neuron network are never great enough to illustrate the differences between rhythms separated by a large distance in rhythm space. Connecting a third oscillating neuron with gap junctions to our example network demonstrates the complexity of rhythm space. Cycles on the transition graph of a three-neuron network may be depicted by cyclical paths on cubes (Glass, 1975a, 1975b), as shown in Figure 2A. The figure shows an arbitrary cluster of rhythms in the space of rhythms generated by three oscillating neurons. Each line between cubes represents a distance of one so that the rhythms form a contiguous set in rhythm space. The figure is organized so that the rhythms on each row are members of dynamical equivalence classes determined by symmetries of the cube (Glass, 1975a). In the classification scheme presented here, rhythms are considered similar if they are near neighbors in rhythm space, as shown in the figure. The full rhythm space for three oscillating neurons contains 120 rhythms and exhibits a complicated topology, with several loops and interconnections between dynamical equivalence classes. Since the paths on the transition graph are defined to be cycles of length 2N, the rhythms may also be represented by hexagons (or 2N-gons for N-neuron networks). Each hexagon in Figure 2B represents a symmetry class of rhythms where the inscribed solid lines connect coordinate sequence elements involving the same neuron. Beside each hexagon is a representative example of the class from Figure 2A, and the letters surrounding the hexagons correspond to coordinate sequence elements of each example rhythm. The symmetry classes extend laterally to form loops in rhythm space through neighboring members of adjacent symmetry classes. The number of members in each symmetry class is given beside the corresponding hexagon and is computed by counting the symmetries of the inscribed figure in each hexagon, modulo rotations. 4.2 Multicomponent Synapses in a Biological Network. The escape reflex of the marine mollusk Tritonia diomedea is a swimming response generated by a rhythmic neural network (see Getting, 1989, for review). Swim-
1840
Patrick D. Roberts
Figure 2: The structure of rhythm space. (A) Each cube displays a cycle that represents a rhythm of the three-oscillator network described in the text. Solid lines between cubes represent a distance of one in rhythm space. (B) A hexagonal representation of rhythms and the associated path on the cube. Numbers quantify the members in each dynamical equivalence class.
ming consists of alternating dorsal and ventral flexions correlated with bursts of activity in two motor neuron pools. The alternating bursts of motor neurons are driven by a premotor central pattern generator (Dorsett, Willows, & Hoyle, 1976), that consists of three neuronal types interconnected with both inhibitory and excitatory chemical synapse. An interesting aspect of the pattern generator is that it contains multicomponent synapses (Getting, 1983)—synapses that generate both excitatory and inhibitory postsynaptic potentials on different time courses (see Figure 3A). There are three populations of premotor interneurons: the dorsal swim interneurons (DSI) that drive the dorsal motor pool, the ventral swim interneurons (VSI) that drive the ventral motor pool, and C2 interneurons that aid in generating a functionally appropriate rhythmic pattern. For a viable swim response, the DSI and VSI must fire out of phase with each other
Classification of Temporal Patterns
1841
Figure 3: Rhythms of the swim response network. (A) The pattern-generating network studied in the text. Filled circles are inhibitory synapses, and T-bars are excitatory synapses. Mixed synapses indicate multiple components. (B) Five rhythms generated by three different choices of synaptic components. Connecting lines represent a distance of one in rhythm space. (C) The same rhythms as B represented by cycles on cubes.
1842
Patrick D. Roberts
during some portion of the swim cycle. Otherwise the dorsal and ventral muscles will simply co-contract, immobilizing the mollusk. An external source drives the DSI neuron to tonic excitability, thus initiating the swimming response. There are no identified pacemaker neurons; thus the pattern is generated completely by the synaptic interactions. The properties of this network can be investigated using the methods of this article by carrying out the analysis on the circuit diagram in Figure 3A. The netPT PT work will be represented by NTritonia = {DSI, VSI, C2; S , {CTA DSI , C VSI , C C2 }}. There are two approaches to take when dealing with the multicomponent synapses. First, both excitatory and inhibitory synapses may be included to represent the multicomponent synapse,
S = { SIDSI, VSI , SIC2, DSI , SIC2, VSI , SIVSI, DSI , SEDSI, VSI , SEC2, DSI , SEC2, VSI , SEDSI, C2 },
(4.2)
where in SIM, N , M is the presynaptic neuron and N is the postsynaptic neuron. Alternately, several analyses may be run with different combinations of a single synapse representing each multicomponent synapse to determine which components are necessary for the pattern generation: a3 a2 1 S (a1 , a2 , a3 ) = { SaDSI, VSI , S C2, DSI , S C2, VSI ,
SIVSI, DSI , SEDSI, C2 },
(4.3)
where ai = I or E. The first approach generates a large, contiguous cluster of 44 rhythms, which can be compared to the experimental results (Getting, 1983). Only 1 or 2 of the rhythms represent the sequence of neural state transitions that are observed in the biological network, and in 9 rhythms VSI fires only in phase with DSI, leading to inappropriate motor behavior. One may try to reduce the number of rhythms by applying the synaptic constraint (see equation 2.1), but if the threshold is set at θ = 0, then no rhythms survive the constraint, and with θ = 1, all 44 of the rhythms survive. The reason for this all-or-nothing result from the constraint is that our method does not take into account the time courses of the multicomponent synapses. The constraint counts the currents of all synaptic conductances simultaneously, thus eliminating more than is realistic when the threshold is set low. Otherwise, all of the synaptic currents can act without regard to temporal ordering to generate many spurious rhythms that do not follow from the synaptic time courses of the biological network. In order to tease out the important components of the synaptic dynamics, we take the second approach of using a single synaptic connection to represent each multicomponent synapse. A systematic study of the network with synaptic connections S (a1 , a2 , a3 ) reveals that only a limited set of com-
Classification of Temporal Patterns
1843
ponents are necessary for rhythm generation. With no synaptic constraints applied, S (I, E, I) and S (I, I, I) generate no rhythms. Under the synaptic constraint with θ = 0, three of the synapse sets generate rhythms. These are S (E, I, E), S (E, I, I), and S (I, I, E), which generate the rhythms shown in Figure 3B. The rhythms generated by S (E, I, E) are numbered 1, 2, and 3; the network defined by S (E, I, I) generates 1, 4, and 5; and S (I, I, E) generates number 3. The rhythm most consistent with neurophysiological recordings (Getting, 1983) is number 2; rhythm 3 may be consistent with the last cycle of the escape response. Taken together these rhythms form a single cluster in rhythm space, showing that all three connectivity sets generate functionally similar rhythms. Yet on closer inspection, it becomes clear that rhythm 5 does not allow DSI and VSI to fire out of phase with each other. Note that this rhythm lies farthest from the rhythms consistent with physiological recordings. This example shows the functional meaning of the measure used in rhythm space. The rhythms that are within a distance of 1 or 2 to the rhythm that represents normal activity of the swim cycle are still able to generate a viable escape response. Farther away, the system generates rhythms that are functionally distant in the sense that the viable response is not adequately performed. This analysis tells us that certain phases of the synaptic responses are more important that others for generating viable rhythms. 5 Discussion The main objective of this work is to fashion tools that are useful in the study of biological neural networks that exhibit complicated behavior. Rather than using a continuous approach of dynamical systems, discrete methods have been chosen so that a classification theorem for rhythmic networks begins to emerge. This will help to fill the gap between simulation studies of biological networks and a global understanding of the systems. The lack of classification theorems for nonlinear dynamical systems in higher dimensions leaves one to probe the parameter space of a conduction-based simulation to gain an overall mapping of the expected behavior. The intent here is not to develop another method to analyze coupled oscillators, but to develop a way of understanding the behavior of complex automata with a rule base that can be tailored to biological problems. The result uncovers some interesting mathematical questions as well as opens the door to some potentially useful biological applications. A software implementation of the concepts presented here is available by anonymous ftp at reed.edu/reed/users/proberts. An important mathematical question relates to the structure of rhythm space. We have been unable to derive a simple formula to measure the distance between two rhythms as described above. The calculations have been done in an iterative fashion by finding the nearest neighbors of the rhythms of interest and continuing until a region of rhythm space large
1844
Patrick D. Roberts
enough to contain all of the rhythms is mapped out. Since the number of elements in rhythm space increases as (2N − 1)!, this method becomes impractical for any but the smallest networks. Another open question is what the greatest distance between two rhythms is in terms of the number of neurons. Rhythm space appears to have a rich and regular structure, and more research is needed to resolve these issues. Once a network has been analyzed to classify the rhythms so that their positions in rhythm space have been determined, it might be useful to rate the rhythms in terms of probabilities from the most likely to the least. At every state from which there is more than one transition possible, a probability may be assigned to each transition dependent on cellular and synaptic factors. Such an assignment has already been introduced in the constraints where we eliminated transitions by effectively deeming them “impossible.” This approach could be made more precise, leading to the product of the transition probabilities in a rhythm to compute the relative probability of each rhythm. Such a ranking would be useful for predicting observations in real biological networks, estimating changes that need to be made to alter the output sequence, and what rhythmic changes can be expected as certain transition probabilities are varied. The introduction of probabilities can also help to compare rhythms that involve different numbers of neurons. In its existing form, rhythm space is useful only for comparing rhythms involving the same number of neurons. This is due to the restriction that in a rhythm, each neuron changes state exactly twice. One could allow a rhythm to be expanded with subrhythms involving subsets of the neurons considered in the main rhythm. Such an expansion would convert an N rhythm into an N+N0 rhythm, where N0 ≤ N is the number of neurons in the subrhythm. Due to the large number of possible expansions, the introduction of probabilities would help to choose only the most prominent rhythms for comparison. We expect that after considerations of each neuron’s burst length and recovery period, there is a certain optimum rhythmic period that would suppress most of the generated rhythms. As parameters such as transition probabilities are introduced into the formalism, one moves away from formal classification and into biological modeling. To be sure, a rhythm as defined above is not what one observes in recordings of neuronal activity in dynamic biological networks. In order to convert rhythms into a form that can be compared with data, the time courses of the transition mechanisms must be considered. Each time step of a rhythm lasts as long as the time course of the next transition. Varying degrees of precision can be introduced until the converted rhythm best matches observations within experimental error. An application to the stomatogastric ganglion (Johnson & Hooper, 1992) using this approach has led to experimental predictions (Roberts, 1997) revealing the existence of mechanisms that would not be obvious from the study of a single rhythm. Thus, the formalism presented here not only treads on some rich mathemat-
Classification of Temporal Patterns
1845
ical territory, but can aid our understanding of the mechanisms involved in dynamic biological networks. Acknowledgments I thank Gin McCollum and Jan Holly for discussions and many helpful suggestions and John Taylor for pointing out the methods of symbolic dynamics. In addition, several ideas presented here were inspired by participation in the Santa Fe Institute’s Summer School on Complex Systems, 1994. This research was supported by National Institutes of Health grant R01-DC02482 awarded to G. McCollum. References Botelho, F., & Garzon, M. (l991). On dynamical properties of neural networks. Complex Systems, 5, 401–413. Caianiello, E. R., & Ricciardi, L. M. (l967). Reverberations and control of neural networks. Kybernetik, 4, 33–40. Carr´e, B. (1979). Graphs and networks. Oxford: Oxford University Press. Casey, M. (l996). The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Comp., 8, 1135–1178. Crutchfield, J. P. (l994). The calculi of emergence: Computation, dynamics, and induction. Physica D, 75, 11–54. Dickinson, P. S., & Moulins, M. (1992). Interactions and combinations between different networks in the stomatogastric nervous system. In R. M. HarrisWarrick, E. Marder, A. I. Selverston, & M. Moulins (Eds.), Dynamic biological networks (pp. 139–160). Cambridge, MA: MIT Press. Dorsett, P. A., Willows, A. O. D., & Hoyle, G. (l976). The neuronal basis of behavior in Tritonia: IV. The central origin of a fixed action pattern demonstrated in the isolated brain. J. Neurobiol., 4, 287–300. Garzon, M. (l990). Cellular automata and discrete neural networks. Physica D, 45, 431–440. Getting, P. (1983). Mechanisms of pattern generation underlying swimming in Tritonia: II. Network reconstruction. J. Neurophysiol., 49, 1017–1035. Getting, P. (1989). Emerging principles governing the operation of neural networks. Ann. Rev. Neurosci., 12, 185–204. Gilbert, E. N. (1958). Grey codes and paths on the n-cube. Bell Syst. Tech. J., 37, 815–826. Glass, L. (l975a). Classification of biological networks by their qualitative dynamics. J. Theor. Biol., 54, 85–l07. Glass, L. (l975b). Combinatorial and topological methods in nonlinear chemical kinetics. J. Chem. Phys., 63, l325–l335. Glass, L. (l977). Combinatorial aspects of dynamics in biological systems. In U. Landman (Ed.), Statistical mechanics and statistical methods in theory and applications (pp. 585–611). New York: Plenum.
1846
Patrick D. Roberts
Glass, L., & Young, R. (l979). Structure and dynamics of neural network oscillators. Brain Res., l79, 207–2l8. Hamming, R. W. (1986). Coding and information theory. Englewood Cliffs, N.J.: Prentice-Hall. Harris-Warrick, R. M., Nagy, F., & Nusbaum, M. P. (1992). Neuromodulation of the stomatogastric networks by identified neurons and transmitters. In R. M. Harris-Warrick, E. Marder, A. I. Selverston, & M. Moulins (Eds.), Dynamic biological networks (pp. 87–137). Cambridge, MA: MIT Press. Hartline, D. K. (1987). Plateau potential. In G. Adelman (Ed.), Encyclopedia of neuroscience (pp. 955–956). Boston: Birkhauser. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hodgkin, A. L., & Huxley, A. F. (l952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London),108, 37–77. Huerta, R. (l996). A finite automata model of spiking-bursting neurons. Int. J. Bifurcation and Chaos, 4, 705–714. Johnson, B. R., & Hooper, S. L. (1992). Overview of the stomatogastric nervous system. In R. M. Harris-Warrick, E. Marder, A. I. Selverston, & M. Moulins (Eds.), Dynamic biological networks (pp. 1–30). Cambridge, MA: MIT Press. Kauffman, S. A. (1993). The origins of order. Oxford: Oxford University Press. Kentridge, R. W. (l994). Symbols, neurons, soap-bubbles and the neural computation underlying cognition. Minds and Machines, 4, 439–449. Lind, D., & Marcus, B. (1995). An introduction to symbolic dynamics and coding. Cambridge: Cambridge University Press. Llin´as, R. (1988). The intrinsic electrophysiological properties of mammalian neurons: Insights into central nervous system function. Science, 242, 1654– 1664. Maynard, D. M. (1972). Simpler networks. Ann. N.Y. Acad. Sci., 193, 59–72. McCulloch, W. S., & Pitts, W. (1942). A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys., 5, 115–133. Miller, J. P. (1987). Pyloric mechanisms. In A. I. Selverston & M. Moulins (Eds.), The crustacean stomatogastric system (pp. 109–136). Berlin: Springer-Verlag. Mulloney, B., & Selverston, A. I. (1974). Organization of the stomatogastric ganglion of the spiny lobster. J. Comp. Physiol., 91, 1–74. Rinzel, J. (1987). A formal classification of bursting mechanisms in excitable systems. In E. Teramoto & M. Yamaguti (Eds.), Mathematical topics in population biology, morphogenesis and neurosciences (pp. 261–281). Berlin: Springer-Verlag. Roberts, P. D. (1997). Classification of temporal patterns in the stomatogastric ganglion. Neurosci., 81, 281–296. Weisbuch, G. (1991). Complex systems dynamics. Redwood City, CA: AddisonWesley. Wolfram, S. (1986). Theory and applications of cellular automata. Singapore: World Scientific. Wuensche, A., & Lesser, M. J. (1992). The global dynamics of cellular automata. Reading, MA: Addison-Wesley. Received April 10, 1997; accepted December 12, 1997.
LETTER
Communicated by Helge Ritter
Kernel-Based Equiprobabilistic Topographic Map Formation Marc M. Van Hulle K. U. Leuven, Laboratorium voor Neuro- en Psychofysiologie, Campus Gasthuisberg, B-3000 Leuven, Belgium
We introduce a new unsupervised competitive learning rule, the kernelbased maximum entropy learning rule (kMER), which performs equiprobabilistic topographic map formation in regular, fixed-topology lattices, for use with nonparametric density estimation as well as nonparametric regression analysis. The receptive fields of the formal neurons are overlapping radially symmetric kernels, compatible with radial basis functions (RBFs); but unlike other learning schemes, the radii of these kernels do not have to be chosen in an ad hoc manner: the radii are adapted to the local input density, together with the weight vectors that define the kernel centers, so as to produce maps of which the neurons have an equal probability to be active (equiprobabilistic maps). Both an “online” and a “batch” version of the learning rule are introduced, which are applied to nonparametric density estimation and regression, respectively. The application envisaged is blind source separation (BSS) from nonlinear, noisy mixtures. 1 Introduction The self-organizing (feature) map (SOM) algorithm (Kohonen, 1982, 1995) is aimed at developing, in an unsupervised way, a mapping from an input space V onto an equal or lower-dimensional discrete lattice with a regular and fixed topology.1 It was originally conceived for nonparametric regression analysis whereby the converged topographic map was intended to capture the principal dimensions of the input space (Ritter, Martinetz, & Schulten,1992; Mulier & Cherkassky, 1995; Kohonen, 1995, p. 152). It has also been regarded as a discrete, nonparametric model of the input probability density insofar as it was not intended to model the fine structure of the density distribution (Kohonen, 1995). Even with regard to this restriction, there is a more fundamental problem with density estimation, however. When the neighborhood function has vanished, the SOM algorithm converges toward a mapping that will minimize, on average, the distortion 1 An undirected graph in which every nonborder vertex has the same, fixed number of incident edges and which usually appears in the form of an array with a rectangular or simplex topology.
c 1998 Massachusetts Institute of Technology Neural Computation 10, 1847–1871 (1998) °
1848
Marc M. Van Hulle
metric used for quantizing V space into N disjoint Voronoi partitionings (usually the mean squared error, MSE).2 As a result, the weight density at convergence will not be a linear function of the input density (Ritter & Schulten, 1986; Ritter, 1991). Furthermore, the algorithm often yields neurons that are never active (“dead” units). In an attempt to remedy this problem and to produce maps in which the neurons have an equal probability to be active (equiprobabilistic maps), several researchers were inspired by Grossberg’s idea (1976) of adding a “conscience” to frequently winning neurons to feel “guilty” and to reduce their winning rate (for references, see Van Hulle, 1997b). For example, in conscience learning (DeSieno, 1988), one records, for each neuron i, the number of times it has won the competition and adds a scaled version of this quantity (a bias, in fact) to the distance metric used in the nearest-neighbor rule. In a more recent attempt, the learning rate of each neuron is related to its activation probability (Bauer, Der, & Herrmann, 1996). Although adding a “conscience” may be a good heuristic to achieve such maps, it is no more than that since equiprobable quantization is in general not equivalent to MSE minimization (Van Hulle & Martinez, 1993). Recently we introduced a more direct way to density estimation by optimizing an information-theoretic criterion, instead of a distortion criterion, using an “online” learning rule called the (lattice-based) maximum entropy learning rule (MER) (Van Hulle, 1995,1997a). We formally proved that in the one-dimensional case, MER yields an equiprobable quantization for any N (Van Hulle, 1995), and in the multidimensional case, it yields a quantization that approximates an equiprobable one for large N (Van Hulle, 1997a, 1997c). A drawback of MER is that we cannot use it for nonparametric regression or dimensionality reduction purposes since the dimension of the lattice is required to match that of the input space in which it is developed. Furthermore, the definition of quantization region is too complicated to be practical in the higher-dimensional case. In this article, we will introduce a new unsupervised learning rule that performs equiprobabilistic topographic map formation of regular, fixedtopology lattices for use with nonparametric density estimation as well as nonparametric regression analysis. The receptive fields (RFs) of the neurons are kernel based and compatible with the ones used in radial basis function (RBF) networks (Moody & Darken, 1988; Poggio & Girosi, 1990) but, unlike other learning schemes, the radii are adapted so as to model the local input density. The rule is computationally efficient with respect to MER by the much simpler definition of quantization region. We will introduce an “online” as well as a “batch” version of the learning rule, which we will call the kernel-based maximum entropy learning rule (kMER). The applica-
2 In fact, the “batch” SOM algorithm is similar to the LBG algorithm (Linde, Buzo, & Gray, 1980) for building scalar and vector quantizers, except for the neighborhood function (Luttrell, 1991).
Kernel-Based Equiprobabilistic Topographic Map Formation
1849
tion we will explore is blind source separation (BSS) from nonlinear, noisy mixtures. 2 Kernel-Based Maximum Entropy Learning Rule Consider a k-dimensional lattice A, with a regular and fixed topology, in d-dimensional input space V ⊆
1 0
if v ∈ Si if v 6∈ Si .
(2.1)
The RF centers wi are adapted so as to produce a topology-preserving mapping; the radii σi are adapted so as to produce a lattice of which the neurons have an equal probability to be active (equiprobabilistic map), by balancing the frequencies of their supra- and subthreshold activations. In other words, the wi ’s and σi ’s are adapted so as to achieve a topographic map that maximizes the (unconditional) information-theoretic entropy of its N binary neurons’ suprathreshold activations. As in the SOM algorithm, the critical factor in generating topologypreserving mappings is the use of a neighborhood function 3 for updating the RF centers wi , that is, a decreasing function of the minimum distance, in lattice-space coordinates, between the suprathreshold activated neuron i and its neighboring neurons. As a result of the latter, neighboring neurons cooperate and specialize for similar input signals and the lattice organizes into an orderly, topology-preserving state. Apart from a cooperative element, there is also a competitive element in the learning process of the wi ’s since we want to “pave” fully the input space with our RFs according to the statistical properties of the input distribution. Since the RFs can overlap, and since their radii are not defined by the other neuron weights, as in the Moody and Darken approach, among others, we cannot simply rely on a winner-take-all scheme, as in the original SOM algorithm, or on common vertices in the lattice quadrilaterals, as in MER, to pave the input space even when the lattice is in a topology-preserving state. To solve this problem, we define a fuzzy code membership function 4i (Zadeh, 1965) (we first consider
1850
Marc M. Van Hulle
A K(v-wi,σi ) τi v
σi wi Si
V
B Si wi
wj
v V
Sj
Figure 1: Kernel-based equiprobabilistic topographic map formation. (A) Neuron i has a localized receptive field K(v − wi , σi ), centered at wi in input space V ⊆
Kernel-Based Equiprobabilistic Topographic Map Formation
1851
“online” learning), 1li (v) , ∀i ∈ A, k∈A 1lk (v)
4i (v) = P
(2.2)
P so that 0 ≤ 4i (v) ≤ 1 and i 4i (v) = 1. We update the RF center wi proportional to 4i and in the general direction of v (see Figure 1B). Hence, inputs that are shared by different neurons (the shaded region in Figure 1B) will lead to smaller weight updates. As a result, there will be a competitive element in the learning process since the RF centers will tend to be pulled apart by the unbalanced weight update strengths. Note that since in general the RFs will overlap, 4i is a measure for the probability that neuron i belongs to the subset of activated neurons, rather than just the probability that neuron i is active. By virtue of its interpretation and the fact that it is a normalized binary entity, normalized with respect to the number of active neurons, 4i is quite different from the type of fuzzy membership function used in the fuzzy-clustering literature (see Rose, Gurewitz, & Fox, 1990), where each datum is associated in probability with each cluster. When the average total (cluster) assignment cost is a linear summation over all clusters, and no assumptions are made about the data distribution—that is, when the principle of maximum entropy is applied—then the “fuzzy membership in clusters” appears in the form of the softmax function, a normalized exponential, normalized with respect to all clusters. This membership function definition has been adopted by Rose et al. (1990), when deriving their optimal vector quantizer, and more recently by Graepel, Burger, & Obermayer (1997), when deriving their soft topographic vector quantizer (STVQ). Besides the kernel centers wi , we also update the kernel radii σi . The idea is to adjust them in such a way that, at convergence, the probability for ρ , ∀i, with ρ a scale factor. In this way, neuron i to be active P(1li (v) 6= 0) = N ρ , the kernel radii will be adapted to the local input densities: b p(wi ) = N Vol(σ i) with b p(.) the density estimate at the kernel center, and Vol(.) the volume of the hypersphere, with radius σi , of the neuron’s RF region. When ρ = 1, we will strive toward mutually exclusive neural activations, given the restrictions set by the hyperspherical shape of the RF regions; when ρ is larger, we in fact impose a smoothness constraint, in terms of the degree of overlap between the RF regions, on the topographic map that will be developed; and when ρ < 1 we will obtain a low degree of RF overlap and a more sparsely covered input space. The major motivation for choosing ρ > 1 is that, in data-bound applications, there are simply not enough training data available to account for detailed density models. The obvious alternative is to apply regularization or smoothing by increasing the degree of overlap between the RF regions in a manner similar to (variable) kernel smoothing (Silverman, 1992). In the opposite case, when ρ < 1, we will obtain a more sparse but, in essence, a (low-)equiprobabilistic representation of the input space, a property that is desired in “sparse-distributed” sensory coding
1852
Marc M. Van Hulle
schemes (for a discussion, see Field, 1994): every neuron has the same but low probability of being active and only a small number of neurons are responding to any input sample. The kMER can be formalized as follows, in “online” mode: X 3(i, j, σ3 ) 4j (v) Sgn(v − wi ), ∀i ∈ A, (2.3) 1wi = η j∈A
with Sgn(.) the sign function taken componentwise, σ3 the neighborhood range (in lattice coordinates) of the neighborhood function 3(.), which is P normalized to unity, j∈A 3(i, j, σ3 ) = 1, and, 1σi = η
³ρ
r
N
´ (1 − 1li (v)) − 1li (v) , ∀i ∈ A,
(2.4)
.
ρN (see Figure 1B). In “batch” mode, for a set of M input samples, with ρr = N−ρ we can first summate the 1wi ’s and the 1σi ’s over all samples before actually performing the updates. In appendix A, we show that the average of kMER equations 2.3 and 2.4 converges since a Lyapunov function exists on which gradient descent is performed.
3 Nonparametric Density Estimation Since kMER is aimed at producing an equiprobable quantization, the converged map can be used for nonparametric density estimation purposes. Let K(v − wi , σi ) be a radially symmetric kernel for neuron i, with center wi and radius σi . At convergence, the radius of each kernel will be such that ρ , ∀i ∈ A (see appendix B). Hence, the probability density P(1li (v) 6= 0) = N function underlying the input samples, p(v), can be estimated as: b p(v) =
N X K(v − wi , ρs σi ) i=1
Zi
N exp(− kv−wi k ) X 2(ρ σ )2 2
=
s i
i=1
Zi
,
(3.1)
when radially symmetric gaussian kernels are used, with ρs a factor with with kMER can be scaled, and Zi a proper norwhich the radii σi obtained R K(.,.) dv
= N1 . Note that we consider for the kernels malizing factor so that V Zi a separate scale factor ρs since, in this way, we put density estimation in a format traditionally used in variable kernel density estimation (Silverman, 1992). We will quantify the effect of ρs and ρr since eventually both control the degree of smoothing performed by the kernels in the density estimate, albeit in a different way. As an example, we reconsider the quadrimodal density function (see Figure 2A), which we have used before for comparing the density estimation performance of MER and the SOM algorithm (Van Hulle, 1997c). This
Kernel-Based Equiprobabilistic Topographic Map Formation
1853
function is obtained in the following way. The (v1 , v2 ) plane is divided into four equally sized quadrants. Within each quadrant, the density function is generated by considering two independent product distributions, one for each v dimension. The product distributions are generated by taking the product of two uniformly and independently distributed random numbers. The analytic equation of the two-dimensional product distribution is then (− log v1 )(− log v2 ) when (v1 , v2 ) ∈ [0, 1)2 . The quadrimodal distribution is in turn obtained by choosing one quadrant over the other with equal probability. The distribution is unbounded and comprises four heavily skewed but disjunct modes separated by sharp transitions (discontinuities), which makes it difficult to quantize. The support of the distribution is bounded within the unit square [−1, 1)2 . We further consider the same N = 24 × 24 planar lattice, with a rectangular topology, and the same random weight initialization and set of input samples as in Van Hulle (1997c). In addition, we initialize the radii randomly by sampling homogeneously and independently the uniform distribution [0, 0.1). We use a small, fixed learning rate η and perform “online” learning.3 We take η = 0.001 and use a gaussian neighborhood function and decrease its range in the following way: µ σ3 (t) = σ30 exp −2 σ30
t tmax
¶ ,
(3.2)
with t the present time step, tmax the maximum number of time steps, and σ30 the range spanned by the neighborhood function at t = 0; we take tmax = 2,000,000 and σ30 = 12. 3.1 Lattice and RF Dynamics. The evolution of the weights of the lattice neurons and their RF regions is shown in Figure 3, for ρr = 2. We observe that kMER achieves a topology-preserving mapping. The distribution of the neuron weights at convergence is clearly adapted to the input distribution. The radii are also well adapted to achieve an equiprobable quantization since the standard deviation is 3.47 × 10−4 with respect to N1 , that is, when normalizing the probabilities so that their sum equals unity. 3.2 Density Estimation Performance. We proceed with these wi ’s and σi ’s as indicated in equation 3.1 with ρs = 2. Furthermore, in order to quantify the density estimation performance, we calculate the density estimates for 2500 positions on a uniform 50×50 grid within the unit square [−0.98, 1)2 , which is chosen in such a way that the discontinuities in the theoretical distribution are avoided, and determine the MSE with respect to the theoretical Theoretically, we should decrease η to zero at a suitable rate, such as η = 1t (Ljung, 1977). However, since we use a small but fixed η, for reasons of comparison with our earlier work, we will verify convergence and topographic ordering experimentally. 3
1854
Marc M. Van Hulle
Figure 2: Two-dimensional product distribution (A) and nonparametric models obtained with kMER when the radii are adapted (B) and not adapted (C). The theoretical function (A) is unbounded and discontinuous and its support bounded by the unit square [−1, 1)2 . All functions shown are plotted in steps of 1 . pd = probability density. 25
Kernel-Based Equiprobabilistic Topographic Map Formation
1855
Figure 3: Temporal evolution of the neuron weights (left column) and the RF regions (right column) of a 24 × 24 lattice with a rectangular topology using the “online” version of kMER. The outer squares outline the support of the input density function shown in Figure 2A. The values given below the squares represent time.
1856
Marc M. Van Hulle
Table 1: Density Estimation Performance of kMER, with Adapted Radii and Fixed Radii. Note: The second column lists the mean squared error (MSE) performance for the (ρs , ρr ) combination indicated, and the third column lists the same but for the best combination found (MSEbest ); a dash (-) is used when the corresponding configuration does not apply. The fourth and fifth columns list 1MSE—the relative difference (in %) between the worst and best MSE values obtained in the indicated ρs or ρr range (between square brackets), when either ρs or ρr is allowed to vary. Rule MSE (ρs , ρr ) MSEbest (ρs , ρr ) 1MSE (ρs , 2) 1MSE (2, ρr ) kMER 5.76×10−2 (2,2) 5.75×10−2 (2.1,2) 22.9% ([1,5],2) 14.1% (2,[1,5]) – – kMER f ixed 9.88×10−2 (2,–) 8.22×10−2 (0.75,–) 35.1% ([0.75,5],2)
Table 2: Density Estimation Performance of the SOM and MER Rules and the Variable Kernel Density Estimation Technique (VK). Rule MER SOM VK M = 500 VK M = 2000 VK M = 5000 VK M = 10,000
MSE 1.09 × 10−1 1.17 × 10−1 7.12 × 10−2 5.92 × 10−2 5.57 × 10−2 5.35 × 10−2
distribution Figure 2A;4 the MSE obtained in this way equals 5.76 × 10−2 . (All quantitative results are summarized in Tables 1 and 2.) The resulting density estimate is shown in Figure 2B. Previously, we obtained for MER and the SOM algorithm, using the same training and test sets, and the same weight initialization and neighborhood function, inferior MSE results (see Table 2, and the thick and thin dot-dashed lines in Figure 4). The corresponding density estimates are shown in Figures 8B and 8C in Van Hulle (1997c). In addition, we also implemented the variable kernel method (VK; Breiman, Meisel, & Purcell, 1977), a standard technique for nonparametric density estimation that operates on fixed data sets. For VK, we took the overall degree of smoothing h = M0.30.2 (Silverman, 1992, p. 45) and the sensitivity parameter α = 12 (Breiman et al., 1977); the pilot estimate was √ determined by using the (kth-) nearest neighbor technique, with k = M. The results are listed in Table 2 as a function of the sample size M, whereby the smaller data sets we used are subsets of the larger ones. The density estimate obtained for M = 5000 is shown in Fig-
4 Instead of the MSE, we could have used the Kullback-Leibler distance to quantify the density estimation performance.
Kernel-Based Equiprobabilistic Topographic Map Formation
1857
Figure 4: Mean squared error (MSE) performance of the kMER, SOM, and MER rules. The thick and thin continuous lines represent the MSE plotted as a function of the scale factor ρs for kMER when the radii are adapted (scale factor ρr = 2) and when they are kept fixed, respectively. The thick and thin dashed lines represent the theoretically expected MSE performance, in the case of an infinite number of neurons, for kMER when the radii are adapted and when they are kept fixed, respectively. The horizontal thick and thin dot-dashed lines represent the MSE performance of MER and SOM, which do not depend on ρs but are plotted as a reference.
ure 8D in Van Hulle (1997c). As M increases, the MSE performance of VK gradually exceeds that of any of the other techniques listed since with VK, a kernel is placed at every input sample (hence, N ≡ M here). On the other hand, for the same reason, the computations rapidly get out of hand since, for each estimate b p(v), on the order of M2 kernel function evaluations have to be performed, whereas for kMER, only on the order of N2 evaluations have to be performed, with N ¿ M. In addition, since with VK the density estimate can be built only after the complete training set is available, online learning is not feasible. 3.3 Effect of Scale Factors ρs and ρr . We first determine the MSE performance as a function of the scale factor ρs ; the result is shown in Figure 4 (thick continuous line). As ρs approaches zero, the MSE performance rapidly deteriorates since the individual kernels increasingly will show through in the final density estimate. In the opposite case, when ρs becomes large, the density estimate becomes increasingly smoothened and thus less capable of reflecting the fine structure of the individual modes present in the original
1858
Marc M. Van Hulle
density function. The best MSE result for kMER is obtained for ρs = 2.1 and ρr = 2, that is, when “optimizing” for ρs in steps of 0.025 and ρr in steps of 1 (MSEbest in Table 1). In order to assess the effect ρs has on the MSE, we determine the relative difference between the worst and the best MSE values in the interval [1, 5], relative to the worst MSE value, 1MSE, with ρr fixed at 2 (fourth column in Table 1). Similarly, we determine 1MSE as a function of ρr , also in the interval [1, 5], but now with ρs fixed at 2 (fifth column in Table 1). The smaller effect of ρr can be understood in the following, albeit qualitative, way. Since ρr ∼ = ρ for N large, we have that when p(v) does not √ vary too much within each RF region Si , σi ∝ d ρ, whereas σi scales linearly with ρs in equation 3.1 3.4 Fixed-Radius kMER. In principle it is feasible to run the same simulations but with all radii kept fixed (kMER f ixed ), say, at a value that would be appropriate for a uniform distribution within the same unit square: σi = √ 2 , ∀i. (This value is also close to the average radius obtained in Nπ
the previous case when ρr = 2, that is, 4.87 × 10−2 .) The best MSE result is obtained for ρs = 0.75 (optimized in steps of 0.25), which is inferior to that obtained when the radii are adapted (compare the MSEbest results of kMER and kMER f ixed in Table 1). The obtained estimate is shown in Figure 2C: we observe that the overall quality of the estimate is also inferior. In addition, the choice of ρs is much more critical than in the previous case since its effect on the MSE is much larger (thin, continuous line in Figure 4; see also its 1MSE value in Table 1). 3.5 Theoretically Expected Performance. Finally, in order to assess the MSE performance for increasing numbers of neurons N, we consider the density estimate pbt (v) obtained for the asymptotic case of a continuous lattice with an infinite number of neurons. We change the index of each neuron for a position vector w and assume a proper normalization of each RF kernel. The sum in equation 3.1 is replaced by the integral: Z Kw (v − w, ρs σw ) p(w) dw, (3.3) pbt (v) = w∈V
when we Rassume that p(w) is a linear function of p(v). The radii σw are ρ , with Sw the RF region for the neuron at posisuch that v∈Sw p(v)dv = N tion w. When we determine this integral numerically, for example, for 2500 positions on a uniform 50 × 50 grid within the unit square [−0.98, 1)2 and for ρr = 2, and plot the MSE result as a function of ρs , we obtain the thick dashed curve in Figure 4. In case we keep the σw fixed at the same value as in the simulation of the previous paragraph, we obtain a regular convolution integral for equation 3.3. When we determine this integral numerically, also for 2500 positions, we obtain the thin dashed curve in Figure 4. Since R in the limiting case, limρs →∞ v (p(v) − pbt (v))2 dv = (p(v) − p(v))2 , that is,
Kernel-Based Equiprobabilistic Topographic Map Formation
1859
the variance of the theoretical curve, the two dashed curves will have the same asymptote (here, MSE= 0.146). When referring to the dashed curves, we observe that for a large range of ρs values, the use of adaptive kernels with kMER is beneficial also when the size N of the discrete lattice increases. 4 Nonparametric Rregression Since the dimensionality of the lattice can be different from that of the input space in which it is developed, we can use kMER for training a lattice for nonparametric regression purposes, as was done for the SOM algorithm (Ritter et al., 1992; Mulier & Cherkassky, 1995; Kohonen, 1995). For the sake of exposition, we first consider a two-dimensional, rectangular lattice sized N = 5 × 5 neurons in three-dimensional space. We use a set of M = 225 input vectors v = (v1 , v2 , v3 ) obtained by sampling the two-dimensional function shown in Figure 5A (adopted from Hwang, Lay, Maechle, Martin, & Schimert, 1994): f (v1 , v2 ) =
1.3356 (1.5(1 − v1 ) + exp(2v1 − 1) sin(3π(v1 − 0.6)2 ) + 5 (4.1) exp(3(v2 − 0.5)) sin(4π(v2 − 0.9)2 )),
in the v3 = 0 plane, using samples taken homogeneously and independently from the uniform distribution [0, 1)2 , and regard the resulting function values as the third coordinates of the input vectors, v3 = f (v1 , v2 ). We use the batch version of kMER with ρr = 1 and run the simulations for 10,000 epochs. The weights are randomly initialized by sampling the uniform distribution [0, 1)3 and the radii by sampling the uniform distribution [0, 0.1). The learning rate η = 0.001 and the initial range of the neighborhood function σ30 = 5. The converged neuron weights and the piecewise-linear regression surface defined by them are shown in Figure 5B; the converged, spherical RF regions are not shown. We observe that a topology-preserving mapping is achieved in the v3 = 0 plane (thick lattice in Figure 5B). The standard deviation with respect to an equiprobable quantization equals 1.25 × 10−2 (for normalized probabilities). This value improves when ρr increases: for ρr = 2 and 3 we obtain 3.97 × 10−3 and 1.41 × 10−3 , respectively. This is due to the fact that when ρr increases, more data points are considered in each RF region Si . The regression surface obtained for a lattice sized N = 10 × 10 neurons, σ30 = 10, and M = 1000, is shown in Figure 5C. Furthermore, for the sake of comparison, we can also rerun the same simulations but now for the SOM algorithm, using the same neighborhood function and range adaptation scheme as for kMER, but with η = 0.015. The results are displayed in Figures 5D and 5E. We observe that the regression surfaces obtained with the SOM algorithm are, qualitatively speaking, more irregularly shaped than those obtained with kMER. However, if we compare these regression sur-
1860
Marc M. Van Hulle
Figure 5: (A) Two-dimensional scalar function v3 = f (v1 , v2 ), which is sampled (M = 225) and used for regression purposes. (B,C) Regression surfaces obtained with kMER using lattices sized N = 5 × 5 neurons (B) and 10 × 10 neurons (C). (D,E) Regression surfaces obtained using the SOM algorithm for the same conditions as in (B) and (C), respectively. Each regression surface is obtained by filling the lattice quadrilaterals with piecewise-linear and opaque, triangular surfaces. The lattice is also shown in the v3 = 0 plane, for the sake of clarity (thick lines).
faces to those generated by the standard projection pursuit learning (PPL) regression technique, supplemented with Supersmoother, in order to obtain the regression surfaces, then the kMER results are inferior (see Figures 6c,d in Hwang et al., 1994). This is partly due to the piecewise linear interpolation used in the kMER-case for obtaining the regression surfaces but, more
Kernel-Based Equiprobabilistic Topographic Map Formation
1861
important, due to the fact that in PPL, an optimization technique is used in which the projection directions, onto which the data points are projected, are optimized sequentially in order to minimize the residual regression error. Hence, unlike the kMER case, PPL regression proceeds by using feedback from the current regression error. The kMER case could be cast into the PPLSupersmoother format, to achieve a better regression performance, but this goes beyond the scope of this article. Finally, we could also perform nonparametric regression fitting of a scalar function f (v) by considering all RF kernels P developed to be of equal height, or even by using normalized kernels i K(v − wi , .) = 1, and by taking the weighted sum of the RF outputs: f (v) ≈
X
Wi K(v − wi , σi ),
(4.2)
i∈A
with Wi scalar weights, which, in turn, can be learned by a simple supervised learning algorithm. Hence, in this way, the connection with RBF networks (Moody & Darken, 1988) or, more generally, with HyberBF networks using limited numbers of kernels (Poggio & Girosi, 1990) becomes clear.5 The individually adapted kernels then perform local smoothing of the interpolation function, depending on the local sample density, as done for kMER, instead of depending on the overall (MSE) regression performance (Poggio & Girosi, 1990) or the likelihood of the input samples (Xu & Jordan, 1993; Nowlan, 1990). In any case, an adaptive approach is expected to yield favorable results when used in a minimal network configuration. Similarly, vector functions can be regressed by considering the foregoing for each vector component separately. This application will be explored elsewhere. 5 Application to Blind Source Separation Since topographic maps perform a mapping from input to lattice space, and since with kMER the lattices will be equiprobabilistic, several candidate signal processing applications can be listed that rely on knowledge of the density distribution underlying the input signals. One particularly intriguing application, which has enjoyed a great deal of interest in the unsupervised learning community, is blind source separation (BSS) (for references, see Amari & Cardoso, 1997). The goal of BSS is to separate statistically independent source signals s(t) = [si (t)], i = 1, . . . , m, with t a time index, from the observed mixtures v(t) = F(s(t)), with F the unknown mixing function. The latter corresponds to an invertible linear transformation in the linear case: v(t) = Am s(t), with Am the m × m mixing matrix. The majority of the unsupervised learning algorithms that have been developed for 5 There is also a connection between (fixed-width) RBF networks and (fixed-width) kernel density estimation (Lowe, 1991).
1862
Marc M. Van Hulle
BSS are aimed at obtaining the linear transformation that renders the estimated sources statistically independent. Since the underlying assumptions are essentially the same, there is an intimate link with linear independent component analysis (Comon, 1994). Equiprobabilistic topographic maps can be beneficial when used in BSS. Since each neuron is equally likely to be the “winner,” by definition, we have the important property that the map’s output density is uniformly distributed. Furthermore, since we consider rectangular lattices, the output density is factorizable along the lattice coordinates into statistically independent components (Kohonen, Raivio, Simula, Vent¨a, & Henriksson, 1996; Pajunen, Hyv¨arinen, & Karhunen, 1996; Lin, Grier, & Cowan, 1997). Hence, for each observed mixture v(t), we determine the winning neuron and consider its lattice coordinates to be the quantized estimates of the source signals, b si (t). As mentioned by Pajunen and coworkers, it would be interesting to see how a learning rule for equiprobabilistic topographic map formation performs in this case, since they applied only the SOM algorithm. As an example, we reconsider the one used by Pajunen and coworkers: two source signals, a sine wave, and a uniformly distributed white noise signal within range [−1, 1). The source signals are mixed by using the mixing matrix, µ Am =
0.7 0.3
0.3 0.7
¶ ,
(5.1)
after which the first component of the resulting mixture vector, termed z1 , is nonlinearly distorted, v1 = z31 + z1 , so as to yield a nonlinear BSS problem; the second component, z2 , is left unchanged, v2 = z2 . We do not perform any whitening on v. We apply batch learning on a lattice sized 15 × 15 neurons, until 1000 epochs have elapsed, by using essentially the same simulation setup for kMER and the SOM algorithm, as before, with ρr = 2 for kMER. The converged lattices are shown in Figures 6A and 6B, for kMER and the SOM algorithm, respectively. In Figures 6C and 6D we show 50 sample excerpts of the original sine wave and noise signal, and their estimates obtained in the case of kMER, respectively. For the source estimates, the lattice coordinates obtained were converted to source signal values, for the sake of comparison. Moreover, we also consider the case where the mixtures are noisy: v1 = z31 + z1 + n1 and v2 = z2 + n2 , with n1 and n2 zero-mean gaussian white noise with standard deviation 0.1. For each case, we quantify the MSE performance for both kMER and the SOM algorithm, using the same test signal tracks of 500 samples. The results, summarized in Table 3, clearly show the advantage of kMER. The MSE performance could be improved further by interpolating between lattice coordinates when defining the winner. Finally, bear in mind that we have devised a heuristic BSS technique, which is expected to work only for mixture densities whose shapes allow
Kernel-Based Equiprobabilistic Topographic Map Formation
1863
Figure 6: Blind source separation. (A,B) Lattices sized 15 × 15 neurons obtained with kMER (A) and the SOM algorithm (B), after 10,000 epochs have elapsed. (C,D) Excerpts of the original sine wave (thick line) and its estimate (thin line) (C), and of the original white noise signal (thick line) and its estimate (thin line) (D).
for a natural mapping of rectangular lattices: if the source signals are uniformly distributed, the lattice coordinates will approximatively obey the same distribution, and the equiprobabilistic map as a whole will be able to invert the true mixing function. Clearly, there is an implicit assumption made about the nature of the source densities here, but, in the linear case, it seems that assuming any subgaussian source density will work (Moreau & Macchi, 1996), even if the original and estimated sources have different but still resembling densities (Cardoso, 1997), or in the nonlinear case, if the
1864
Marc M. Van Hulle
Table 3: MSE Performance of the SOM and kMER Rules in Nonlinear Blind Source Separation from Noiseless and Noisy Mixtures, Given the Training Set Size M. Rule
kMER kMER kMER SOM SOM SOM
M
50 500 5000 50 500 5000
Noiseless Mixtures
Noisy Mixtures
MSE(s1 , b s1 )
MSE(s2 , b s2 )
MSE(s1 , b s1 )
MSE(s2 , b s2 )
1.77 × 10−2 9.11 × 10−3 8.74 × 10−3 1.96 × 10−2 1.66 × 10−2 1.27 × 10−2
7.81 × 10−2 3.50 × 10−2 2.94 × 10−2 1.19 × 10−1 7.30 × 10−2 6.42 × 10−2
2.05 × 10−2 1.19 × 10−2 9.13 × 10−3 2.73 × 10−2 1.76 × 10−2 1.34 × 10−2
8.17 × 10−2 3.94 × 10−2 3.44 × 10−2 1.35 × 10−1 7.92 × 10−2 7.06 × 10−2
mixing function is mildly nonlinear (Pajunen et al., 1996). In any case, all bounded independent source signals have a rectangular joint density support that will be only mildly distorted when the mixing function is mildly nonlinear. 6 Discussion We have introduced a new unsupervised competitive learning rule, the kernel-based maximum entropy learning rule, for equiprobabilistic topographic map formation in lattices with fixed and regular topologies. The original SOM algorithm (Kohonen, 1982, 1995) differs from kMER by the criterion optimized (distortion minimization vs. entropy maximization) and by the type of receptive fields used (nonoverlapping, Voronoi-based RFs versus overlapping, kernel-based RFs). Furthermore, kMER can be used for nonparametric regression purposes, just like the SOM algorithm, but the former will lead to a more equitable weight distribution and, hence, a better density estimation performance. In contrast with the original SOM algorithm, which was intended to capture the main dimensions in the input distribution, its constructive variants, such as the ones using a dynamically defined topology (Kohonen, 1995) and the growing SOMs (Martinetz & Schulten, 1991; Fritzke, 1992, 1994), are believed to capture better the fine structure of the input density distribution. However, we hasten to add that this point is usually judged in terms of the absence of dead units for disconnected or nonconvex but mainly uniform density distributions. Fritzke’s (1992) growing cell structures algorithm uses a fixed, simplex-based neighborhood topology and a Voronoi tessellation of the input space (using the minimum Euclidean distance metric). New units are inserted close to units that have too high an activation probability; existing units are deleted if their activation probabilities are too low. However, we know that aiming for an equiprobable quantization in this way can be only a heuristic since it is in general not compatible with the minimum MSE
Kernel-Based Equiprobabilistic Topographic Map Formation
1865
approach, which is tacitly assumed when a minimum Euclidean distance metric is used for defining the Voronoi partitionings (Van Hulle & Martinez, 1993). Furthermore, unlike kMER, the algorithm needs to keep track of the activation probabilities. The neural gas network (Martinetz & Schulten, 1991) also generates topology-preserving mappings, by using a shrinking neighborhood function that depends on the relative distances between the neuron weights, but modifies the topology locally by inserting or deleting edges depending on their “age.” The incremental RBF approach of Fritzke (1994) is still different since it starts from a small, “maximally” ordered map and gradually inserts new units according to the same “age”-related metric as for the neural gas network. The main difference is the use of RBFs as receptive fields instead of Voronoi partitionings. Apart from the fact that in Fritzke’s approach both the lattice topology and the lattice size are adapted, there is a connection with kMER since the kernels in our approach can be replaced by RBFs for regression purposes, without loss of generality. However, contrary to Fritzke’s approach, and also that of Moody and Darken (1988), we do not have to choose the RBF radii in an ad hoc fashion (such as the mean distance with the few nearest-neighbor neurons). Instead, the radii are considered as additional weight components (thresholds in fact), and they are adapted together with the weight vectors that define the RBF centers. Finally, with respect to density estimation, since the radii of our RF kernels are adapted depending on the local input density, there is a connection with the variable kernel density estimation technique of Breiman et al. (1977)—a widely used nonparametric density estimation technique. The basic idea of the latter is to place a unit-volume kernel at each input sample, taken from a finite set of M samples, but to allow the width of each kernel to vary with the local sample density. In our case, only a restricted number of kernels are allocated, which saves memory space; in addition, since we dispose of an online version of kMER, we can start building the density estimate gradually, as the input samples become available. Furthermore, unlike kMER, the variable kernel technique starts from a pilot density estimate, which has to be provided by another technique, such as the kth nearest-neighbor method.
Appendix A: Convergence Formally, kMER equations 2.3 and 2.4 can be viewed as a Markov process in which the transition probabilities depend on the present input and the weights and radii obtained at the previous time step. For a statistically stationary input density, we will now prove that the average of these learning rules converges since it performs stochastic gradient descent on a positive definite cost function E. We first need a number of definitions. Since the code membership function 1li (v) is not differentiable at its bor-
1866
Marc M. Van Hulle
der, we replace it by the following “smooth” function: 1
.
Si (v, wi , σi ) =
ik 1 + exp (−β σi −kv−w ) 2 2
2
, ∀i ∈ A,
(A.1)
that is, the usual sigmoid function, and β the gain. The arguments of this function are chosen in such a way that it approaches 1li (v) in the limit of β → ∞. The fuzzy code membership function 4i equation 2.2, is redefined as
Si (v) , ∀i ∈ A, k∈A Sk (v)
β 4i (v) = P
(A.2)
and the following “local error” is ascribed to each neuron i: β
β
β
β
N X
ei = ewi + eσ i ,
∀i ∈ A,
(A.3)
with ewi = β
eσ i =
j=1 σi
β
3(i, j, σ3 ) 4j (v) | v − wi |,
Z
−
0
ρr ´ ρr ³ + 1+ Si (v, wi , σi∗ ) dσi∗ . N N
The average learning rules are then obtained in the following way: * h1wi iV = −η lim
β→∞
=η
* N X j=1
β
∂ewi ∂wi
+ V
+
3(i, j, σ3 ) 4j (v) Sgn(v − wi )
+ β ∂eσ i h1σi iV = −η lim β→∞ ∂σi V ³ Dρ E ρr ´ r − 1+ 1li (v) , =η N N V
, V
*
∀i ∈ A,
(A.4)
after some algebraic manipulations. Proposition 1. For a continuous, statistically stationary probability density p(v), v ∈ V ⊆
Kernel-Based Equiprobabilistic Topographic Map Formation
1867
Proof. In order to investigate whether a Lyapunov function exists, we will calculate the derivatives of both average learning rules. We use the following shorthand notation: .
.
.
.
h1wi iV = ηhFi (wi , σi , v)iV = η Fi (wi , σi ),
(A.5)
h1σi iV = ηhGi (wi , σi , v)iV = η Gi (wi , σi ),
∀i ∈ A,
(A.6)
for which the derivatives of Fi and Gi form a matrix comprising four submatrices: · 1 ¸ H H2 H= (A.7) H3 H4 .
with H1 = [Hji1 = ∂Gi ∂σj ],
∂Fi ∂wj ],
. ∂Gi ∂wj ],
H2 = [Hji2 =
. ∂Fi ∂σj ],
H3 = [Hji3 =
.
and H4 = [Hji4 =
since we also have to consider the cross-dependencies between the weight and radius updates. Submatrix H1 . Taking the derivative of Fi with respect to wj , i 6= j, we obtain: * β + ∂ ∂ewi ∂Fi =− lim ∂wj ∂wj β→∞ ∂wi V + * N ∂ X β = lim Sgn(v − wi ) 3(i, k, σ3 ) 4k (v) , (A.8) β→∞ ∂wj k=1 V
since it is allowed to interchange the gradient and the limit. The critical β terms are the derivatives of 4k : β
∂4j
∂wj
β
β
= β 4j (1 − 4j )(1 − Sj )(v − wj ),
β
∂4k β β = −β 4k 4j (1 − Sj )(v − wj ), ∂wj
k 6= j,
(A.9) β
after some algebraic manipulations. The term 4j (1 − Sj ) equals zero for β
∂4
both v ∈ Sj and 6∈ Sj when β → ∞. This means that we have for ∂wkj an expression of the form ∞0, in the limiting case, which can be rearranged as ∞ ˆ rule, we obtain that the limit of this derivative ∞ . By using de l’Hopital’s ∂Fi equals zero and hence, ∂wj = 0, i 6= j. From the latter we conclude that the off-diagonal elements of the matrix H1 = [Hji1 ], i 6= j, are all equal to zero and thus, that H1 is symmetric.
1868
Marc M. Van Hulle
Submatrix H4 . Taking the derivative of Gi with respect to σi , using Leibniz’s differentiation theorem, we obtain: ³ D³ ρ ρr ´ ´E ∂ ∂Gi r =− lim − + 1 + Si ∂σi ∂σi β→∞ N N V ρr ´ ´ ∂ ³ ρr ³ − 1+ Pi , = ∂σi N N
(A.10)
with Pi the probability that neuron i is active, for radius σi of its RF region, for which the last right-hand term is clearly nonzero. The derivatives of Gi with respect to σj , i 6= j, are all zero (no dependence). Hence also for this case, the off-diagonal elements of the matrix H4 are all zero. Submatrix H2 . Taking the derivative of Gi with respect to wi , we obtain: ³ D³ ρ ρr ´ ´E ∂ ∂Gi r =− lim − + 1 + Si ∂wi ∂wi β→∞ N N V ´ E D³ ρr = lim 1 + Si (1 − Si )β(v − wi ) , β→∞ N V
(A.11)
for which the limit equals zero (by using de l’ Hopital’s ˆ rule). The derivatives of Gi with respect to wj are equal to zero as well (no dependence). Hence, H2 is the null matrix. Submatrix H3 . Taking the derivative of Fi with respect to σj yields: * β + ∂ ∂ewi ∂Fi =− lim β→∞ ∂σj ∂σj ∂wi
V
= lim hSgn(v − wi ) β→∞
N ∂ X β 3(i, k, σ3 ) 4k (v)iV , ∂σj k=1
(A.12) β
for which the critical terms in the sum are the derivatives of 4k : β
∂4j
∂σj
β
β
= β 4j (1 − 4j )(1 − Sj )σj ,
β
∂4k β β = −β 4k 4j (1 − Sj )σj , k 6= j, ∂σj
(A.13)
both of which become equal to zero in the limit of β after de l’Hopital’s ˆ rule is applied. Hence, H3 is the null matrix. In summary, since the global matrix H (see equation A.7) is symmetric—it is thus a Hessian—a Lyapunov function exists on which the average learning rules perform gradient descent.
Kernel-Based Equiprobabilistic Topographic Map Formation
1869
Appendix B: Equiprobable Quantization Proposition 2. Given a statistically stationary input probability density p(v), the kernel radii developed with average kMER are guaranteed to converge to a lattice of which the neurons have an equal probability to be active. Proof.
We have the following equilibrium point: ´E (1 − 1li (v)) − 1li (v) N V ρr ´ ρr ρr ³ − 1+ Pi , = (1 − Pi ) − Pi = N N N
h1σi iV = 0 =
D³ ρ
r
∀i,
(B.1)
with Pi the probability that neuron i is active, given p(v), and from which follows that: Pi =
ρr ρ = , N + ρr N
∀i,
(B.2)
.
ρN . Now since Nρ is a constant, independent of the lattice index since ρr = N−ρ i, and since Pi is a monotonically increasing function of the kernel radius of neuron i, equation B.2 is the sole solution resulting from h1σi iV = 0, Furthermore, since Pi is a monotonically increasing function of σi , the solution ρr + ², with ² a disturbance from the equilibis a stable one: for Pi = N+ρ r rium condition h1σi iV = 0, we have that h1σi iV = −(1 + ρNr )² ≷ 0 for ² ≷ 0. Hence, we are guaranteed to obtain a set of equiprobable neurons at convergence.
Acknowledgments I thank P. Pajunen, Helsinki University of Technology, Finland, for sharing his insights on blind source separation. As a research associate of the Fund for Scientific Research—Flanders (Belgium), I am supported by research grants received from the Fund for Scientific Research (G.0185.96), the National Lottery (9.0185.96), the Research Fund of the K.U.Leuven (F/95/138), the Flemish Regional Ministry of Education (Belgium)(GOA 95/99-06), and the European Commission (ECVnet EP8212). References Amari, S.-I., & Cardoso, J.-F. (1997). Blind source separation—Semiparametric statistical approach. IEEE Trans. on Signal Processing, 45(11), 2692–2700. Bauer, H.-U., Der, R., & Herrmann, M. (1996). Controlling the magnification factor of self-organizing feature maps. Neural Computation, 8, 757–771. Breiman, L., Meisel, W., & Purcell, E. (1977). Variable kernel estimates of multivariate densities. Technometrics, 19, 135–144.
1870
Marc M. Van Hulle
Cardoso, J.-F. (1997). Infomax and maximum likelihood for blind signal separation. IEEE Signal Processing Letters, 4(4), 112–114. Comon, P. (1994). Independent component analysis—A new concept? Signal Processing, 36(3), 287–314. DeSieno, D. (1988). Adding a conscience to competitive learning. Proc. IEEE International Conference on Neural Networks (Vol. I, pp. 117–124). San Diego. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Fritzke, B. (1992). Growing cell structures—a self-organizing network in k dimensions. In I. Alexander & J. Taylor (Eds.), Artificial Neural Networks 2 (pp. 1051–1056). Amsterdam: Elsevier. Fritzke, B. (1994). Fast learning with incremental RBF networks. Neural Processing Letters, 1(1), 2–5. Graepel, T., Burger, M., & Obermayer, K. (1997). Phase transitions in stochastic self-organizing maps. Physical Review E, 56(4), 3876–3890. Grossberg, S. (1976). Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. Biol. Cybern., 23, 121–134. Hwang, J.-N., Lay, S.-R., Maechle, M., Martin, R. D., & Schimert, J. (1994). Regression modeling in back-propagation and projection pursuit learning. IEEE Trans. on Neural Networks, 5(3), 342–353. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybern., 43, 59–69. Kohonen, T. (1995). Self-organizing maps. Heidelberg: Springer-Verlag. Kohonen, T., Raivio, K., Simula, O., Vent¨a, O., & Henriksson, J. (1996). Combining linear equalization and self-organizing adaptation in dynamic discrete-signal detection. In Proc. IJCNN (Vol. I, pp. 223–228). San Diego. Lin, J. K., Grier, D. G., & Cowan, J. D. (1997). Faithful representations of separable distributions. Neural Computation, 9, 1305–1320. Linde, Y., Buzo, A., & Gray, R. M. (1980). An algorithm for vector quantizer design. IEEE Trans. on Communications, COM-28, 84–95. Ljung, L. (1977). Analysis of recursive stochastic algorithms. IEEE Trans. Automat. Contr., AC-22, 551–575. Lowe, D. (1991). What have neural networks to offer statistical pattern processing? In Proc. SPIE Conference on Adaptive Signal Processing (pp. 460–471). San Diego. Luttrell, S. P. (1991). Code vector density in topographic mappings: Scalar case. IEEE Trans. on Neural Networks, 2, 427–436. Martinetz, T., & Schulten, K. (1991). A “neural-gas” network learns topologies. In T. Kohonen, K. M¨akisara, O. Simula, & J. Kangas (Eds.), Artificial neural networks (pp. 397–402). Amsterdam: North-Holland. Moody, J., & Darken, C. (1988). Learning with localized receptive fields. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proc. 1988 Connectionist Models Summer School (pp. 133–143). San Mateo, CA: Morgan Kaufmann. Moreau, E., & Macchi, O. (1996). High-order contrasts for self-adaptive source separation. Intl. J. Adaptive Control Signal Processing, 10, 19–46.
Kernel-Based Equiprobabilistic Topographic Map Formation
1871
Mulier, F., & Cherkassky, V. (1995). Self-organization as an iterative kernel smoothing process. Neural Computation, 7, 1165–1177. Nowlan, S. J. (1990). Maximum likelihood competitive learning. In D. S. Touretzky (Ed.), Advances in neural information processing systems, II (pp. 574–582). San Mateo, CA: Morgan Kaufmann. Pajunen, P., Hyv¨arinen, A., & Karhunen, J. (1996). In S.-I. Amari, L. Xu, L.-W. Chan, I. King, & K.-S. Leung (Eds.), Nonlinear blind source separation by self-organizing maps. Progress in neural information processing (Vol. 2, pp. 1207–1210). New York: Springer-Verlag. Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982. Ritter, H. (1991). Asymptotic level density for a class of vector quantization processes. IEEE Transactions on Neural Networks, 2(1), 173–175. Ritter, H., Martinetz, T., & Schulten, K. (1992). Neural computation and selforganizing maps: An introduction. Reading, MA: Addison-Wesley. Ritter, H., & Schulten, K. (1986). On the stationary state of Kohonen’s selforganizing sensory mapping. Biol. Cybern., 54, 99–106. Rose, K., Gurewitz, E., & Fox, G.C. (1990). Statistical mechanics and phase transitions in clustering. Physical Review Letters, 65(8), 945–948. Silverman, B. W. (1992). Density estimation for statistics and data analysis. London: Chapman & Hall. Van Hulle, M .M. (1995). Globally-ordered topology-preserving maps achieved with a learning rule performing local weight updates only. Proc. IEEE NNSP95 (pp. 95–104). Cambridge, MA. Van Hulle, M. M. (1997a). The formation of topographic maps that maximize the average mutual information of the output responses to noiseless input signals. Neural Computation, 9(3), 595–606. Van Hulle, M. M. (1997b). Topology-preserving map formation achieved with a purely local unsupervised competitive learning rule. Neural Networks, 10, 431–446. Van Hulle, M. M. (1997c). Nonparametric density estimation and regression achieved with topographic maps maximizing the information-theoretic entropy of their outputs. Biol. Cybern., 77, 49–61. Van Hulle, M. M., & Martinez, D. (1993). On an unsupervised learning rule for scalar quantization following the maximum entropy principle. Neural Computation, 5, 939–953. Xu, L., & Jordan, M. I. (1993). Unsupervised learning by EM algorithm based on finite mixture of gaussians. World Congress on Neural Networks (Vol. 2, pp. 431–434). Portland, OR. Zadeh, L. A. (1965). Fuzzy sets. Information and Control, 8, 338–353. Received July 21, 1997; accepted January 9, 1998.
LETTER
Communicated by Allan Yuille
An Energy Function and Continuous Edit Process for Graph Matching Andrew M. Finch Richard C. Wilson Edwin R. Hancock Department of Computer Science, University of York, York, Y01 5DD, U.K.
The contributions of this article are twofold. First, we develop a new nonquadratic energy function for graph matching. The starting point is a recently reported mixture model that gauges relational consistency using a series of exponential functions of the Hamming distances between graph neighborhoods. We compute the effective neighborhood potentials associated with the mixture model by identifying the single probability function of zero Kullback divergence. This new energy function is simply a weighted sum of graph Hamming distances. The second contribution is to locate matches by graduated assignment. Rather than solving the meanfield saddle-point equations, which are intractable for our nonquadratic energy function, we apply the soft-assign ansatz to the derivatives of our energy function. Here we introduce a novel departure from the standard graduated assignment formulation of graph matching by allowing the connection strengths of the data graph to update themselves. The aim is to provide a means by which the structure of the data graph can be updated so as to rectify structural errors. The method is evaluated experimentally and is shown to outperform its quadratic counterpart. 1 Introduction Graph matching is a topic of pivotal importance (Mjolsness, Gindi, & Anandan, 1989; Gold & Rangarajan, 1996; Gold, Rangarajan, & Mjolsness, 1996; Sanfeliu & Fu, 1983; Cross & Hancock, 1995, 1997; Wilson & Hancock, 1995; Wilson, Evans, & Hancock, 1995) in pattern interpretation. The problem invariably arises whenever a pattern recognition task is given a relational abstraction. It was the classical work on structural pattern recognition of the 1980s that first identified the main technical difficulties associated with relational graph matching (Sanfeliu & Fu, 1983; Shapiro & Haralick, 1981). Although providing a conceptually attractive structural abstraction, relational graphs are notoriously fragile to noise and segmentation error. For this reason practical graph matching must be realized by inexact means. Effective inexact graph matching is critically dependent on the availability of two computational ingredients. The first of these is an accurate c 1998 Massachusetts Institute of Technology Neural Computation 10, 1873–1894 (1998) °
1874
Andrew M. Finch, Richard C. Wilson, and Edwin R. Hancock
means of computing the distance between structurally corrupted relational descriptions (Shapiro & Haralick, 1985; Sanfeliu & Fu, 1983). Here several measures have been investigated, including the idea of defining an edit distance between graphs (Shapiro & Haralick, 1985) and the use of inter-graph entropy (Wong & You, 1985). The second ingredient is a means of controlling structural errors introduced by noise or segmentation error (Shapiro & Haralick, 1981). Classical solutions to this problem have included constraint filtering applied to the association graph (Barrow & Popplestone, 1971), labeling clutter with a null label (Boyer & Kak, 1988), and graph editing (Sanfeliu & Fu, 1983). The first of these two issues has recently stimulated considerable interest in the connectionist literature (Simic, 1991; Mjolsness et al., 1989; Gold & Rangarajan, 1996; Gold et al., 1996; Suganathan, Teoh, & Mital, 1995). Here the aim has been to deploy continuous optimization methods such as the relatively heuristic graduated assignment (Gold & Rangarajan, 1996; Blake & Zisserman, 1987) or the more principled mean-field theory (Hoffmann & Buhmann, 1997; Yuille & Kosowsky, 1994; Rangarajan, Gold, & Mjolsness, 1996; Yuille, Stolorz, & Utans, 1994; Peterson & Soderberg, 1989) to update a set of assignment variables representing the matching process. Specifically, several authors (Simic, 1991; Mjolsness et al., 1989; Suganathan et al., 1995; Gold & Rangarajan, 1996; Gold et al., 1996) have addressed the issue as to how to capture the relational matching problem using an energy function. However, the basic formulation invariably revolves around an energy function that is quadratic in the assignment variables. Performance has been demonstrated to be enhanced if additional, sometimes nonquadratic, terms are added. Examples include node self-amplification term that encourages binary solutions (Gold & Rangarajan, 1996) and the addition of a logarithmic barrier entropy that convexifies the energy (Yuille & Kosowsky, 1994). Despite this effort expended in designing energy functionals with desirable global features, the second issue, of how to handle structural errors, has received little attention. The favored method is to introduce a null attractor, requiring additional energy terms that must be carefully controlled (Suganathan et al., 1995). A more elegant alternative is provided by the slack variables of Gold and Rangarajan (1996) that endow the null process with a degree of robustness. However, in both cases there is no mechanism for correcting structural errors due to spurious edge or node insertions. A recent addition to the literature that falls closer to the mark is the continuous embedding of Meila and Jordan (1997) that allows triangulated graphs to be moralized so as to produce a tree structure. In a recent series of papers we have developed a Bayesian framework for relational graph matching (Cross & Hancock, 1995, 1997; Wilson et al., 1995; Wilson & Hancock, 1995, 1997). The novelty resides in the fact that relational consistency is gauged by a probability distribution that uses Hamming distance to measure structural differences between the graphs under match. The probability distribution has a mixture structure that sums a series of
Graph Matching
1875
exponential functions of the Hamming distances between neighborhoods of the model and data graphs. From a practical perspective, this new framework has been used to match complex infrared (Cross & Hancock, 1995) and radar imagery (Wilson et al., 1995, 1997). Of particular relevance to the study reported here, it has also been used successfully to overcome structural errors using a graph-edit process (Wilson & Hancock, 1995, 1997) of the sort originally proposed by Sanfeliu and Fu (1983). The optimization of this relational consistency measure and the control of the associated graph edit process have hitherto been confined to the use of discrete update procedures (Hancock & Kittler, 1993; Wilson et al., 1995; Cross & Hancock, 1995, 1997). Examples include discrete relaxation (Sanfeliu & Fu, 1983, Wilson et al., 1995), simulated annealing (Aarts & Korst, 1989; Geman & Geman, 1984; Kirkpatrick, Gelatt, & Vecchi, 1983; Cross & Hancock, 1995), and genetic search (Cross & Hancock, 1997). Our aim in this article is to consider how the process of updating assignment variables can be realized by continuous means (Gold & Rangarajan, 1996; Peterson & Soderberg, 1989; Yuille, 1994; Suganathan et al., 1995). Specifically we consider how the matching process can be effected using the graduated assignment algorithm of Gold and Rangarajan (1996), which is based on the soft-max idea of Bridle (1990). We begin from the probability distribution for the assignment variables recently developed by Wilson and Hancock (1997) The resulting distribution is defined over a set of discrete assignment labels. The development of a graduation algorithm poses dual demands. The first of these revolves around identifying the energy function that encapsulates the mixture distribution for the discrete matching errors. The second task is to develop continuous update equations that can be used to soften the discrete assignment representation while allowing structural errors in the data graph to be identified. There are two novel outcomes of our study. First we arrive at a new graph-matching energy. Here we replace the mixture distribution by a single probability function that corresponds to zero Kullback divergence. According to our analysis, the effective matching energy is in fact a weighted sum of Hamming distances between the model graph and the data graph neighborhoods. The quantities of interest in iteratively updating the state of match are the derivatives of the global energy function with respect to the softened assignment variables. In the case of our new matching energy, these derivatives take on a particularly interesting form, which provides an intuitive insight into the minimization of the energy. The second novel contribution is to introduce a new update process which allows the edge set of the data graph to be updated with the aim of controlling structural corruption due to spurious edges or nodes. An experimental evaluation of the technique reveals not only that it is successful in matching noisecorrupted graphs, but that it significantly outperforms the optimization of the standard quadratic energy function. The optimization process underpinning our matching process is the
1876
Andrew M. Finch, Richard C. Wilson, and Edwin R. Hancock
graduated assignment algorithm of Gold and Rangarajan (1996) The basic idea is to update assignment variables by exponentiating the derivatives of the energy function. Although this is a relatively heuristic optimization method, it has recently been formally demonstrated to exhibit desirable convergence properties (Rangarajan, Yuille, Gold, & Mjolsness, 1997). It is the nonquadratic nature of our energy function that draws us to graduated assignment. The more rigourous framework that underpins mean-field theory proves to be intractable for our energy function. The reason is that meanfield equations must be located by solving the set of saddle-point equations that minimize the Kullback divergence. This approach is most tractable when the underlying cost function is quadratic in the assignment variables, although Hoffmann and Buhmann (1997) have solved the mean-field equations in a nonquadratic case. In section 2 we review how relational consistency can be measured by appealing to the concept of a label error process. Section 3 shows how the Bayesian consistency measure can be used to construct a global graphmatching energy. Issues of energy minimization are the subject of section 4. Here we describe how matches are iteratively updated using the soft-assign ansatz. Section 5 details the experimental evaluation of our matching process, focusing on the issue of noise sensitivity by investigating the systematic effects of parameter choice, structural error, and edge density. Finally, section 6 offers some conclusions. 2 Relational Consistency Our overall goal here is to formulate a nonlinear optimization technique for matching relational graphs. We use the notation G = (V, E) to denote the graphs under match, where V is the set of nodes and E is the set of edges. Our aim in matching is to associate nodes in a graph GD = (VD , ED ) representing data to be matched against those in a graph GM = (VM , EM ) representing an available relational model. Formally, the matching is represented by a function f : VD → VM from the nodes in the data graph GD to those in the model graph GM . We capture the structure of the two graphs using a pair of connection matrices. The connection matrix for the data graph consists of the binary array ½ 1 if (a, b) ∈ ED or a = b (2.1) Dab = 0 otherwise, while that for the model graph is Mαβ =
n
1 if (α, β) ∈ EM or α = β 0 otherwise.
(2.2)
Since the current state of match between the two graphs is represented by the function f : VD → VM , the statement f (a) = α means that the node a ∈ VD
Graph Matching
1877
is matched to the node α ∈ VM . The binary representation of the current state of match is captured by a set of assignment variables that convey the following meaning: n (2.3) saα = 1 if f (a) = α 0 otherwise. The basic goal of the matching process is to optimize a consistency measure that gauges the structural similarity of the matched data graph and the model graph. In a recent series of papers, we have shown how consistency of match can be modeled using a Bayesian framework (Wilson et al., 1995; Wilson & Hancock, 1995, 1997). The basic idea is to construct a probability distribution that models the effect of memoryless matching errors in generating departures from consistency between the data and model graphs. Suppose that Sα = α ∪ {β|(α, β) ∈ EM } represents the set of nodes that form the immediate contextual neighborhood of the node α in the model graph. The aim in gauging relational consistency is to measure the structural differences between the set of model-graph neighborhoods and their matched counterparts in the data graph. Further, suppose that 0a = f (a) ∪ { f (b)|(a, b) ∈ ED } represents the set of matches assigned to the contextual neighborhood of the node a ∈ VD of the data graph. Basic to Wilson and Hancock’s modeling of relational consistency is to regard the complete set of model-graph relations as states over which to construct a mixture model for the corrupt matches observed on the data graph. In other words, X P(0a |Sα )P(Sα ). (2.4) P(0a ) = α∈VM
The component densities appearing in this mixture distribution are modeled under the assumption that discrete matching errors are uniform and memoryless. Accordingly, the confusion probability P(0a |Sα ) can be factorized over the component matches, Y P( f (b)|β). (2.5) P(0a |Sα ) = b∈Ca
The confusion probabilities are modeled under the assumption that the individual misassignment errors occur with a uniform probability Pe . As a result, the single-node match probabilities are distributed as follows: ½ (1 − Pe ) if f (b) = β (2.6) P( f (b)|β) = if f (b) 6= β. Pe Under this distribution rule, the mixture components acquire the following exponential form: · ¸ (2.7) P(0a |Sα ) = Ka exp −µH(a, α) .
1878
Andrew M. Finch, Richard C. Wilson, and Edwin R. Hancock
The exponential constant is related to the uniform error probability as follows: µ = ln
1 − Pe . Pe
(2.8)
The distribution of misassignment errors is in fact measured by the Hamming distance H(a, α) between the current matching configuration 0a residing on the data graph and the configuration Sα from the model graph. In terms of our discrete representation of the matching process, the Hamming distance is computed using a series of Kronecker delta functions defined over the data graph node matches and the assignments demanded by the model-graph configurations: H(a, α) =
X
(1 − δ f (b),β ).
(2.9)
b∈Ca
According to our binary representation of the matching process, the distance measure is computed using the connectivity matrices and the assignment variables in the following manner: H(a, α) =
X X
Mαβ Dab (1 − sbβ ).
(2.10)
b∈VD β∈VM
With these ingredients and under the assumption of a uniform prior for the model-graph constraints, that is, P(Sα ) = |V1M | , the probability of the matched neighborhood 0a reduces to P(0a ) =
Ka X exp[−µH(a, α)]. |VM | α∈V
(2.11)
M
The probability distribution P(0a ) may be regarded as providing a natural way of modeling departures from consistency at the neighborhood level. Matching consistency is graded by Hamming distance, and controlled hardening may be induced by reducing the label error probability Pe toward zero. 3 The Effective Potential for Discrete Relaxation We would like to construct an energy function that can be used to establish continuous update equations for the assignment variables of our matching process. The route to computing the energy function is to identify the probability distribution q(0a ) that replaces the compound exponential mixture distribution appearing in equation 2.11 with a single exponential function. We locate this distribution by computing the Kullback-Leibler divergence
Graph Matching
1879
between the components of the mixture distribution, P(0a |Sα ) and the equivalent distribution function q(0a ) over the space of model graph label configurations. Although the idea of finding mean-field equations that minimize the divergence between a target probability distribution and a parameterized version is central to some of the classical work on mean-field theory (Saul & Jordan, 1995; Yuille et al., 1994), our approach is somewhat different in philosophy. The classical approach is to find saddle-point equations that minimize the divergence. However, in the case of our compound exponential mixture distribution appearing in equation 2.11 this is not tractable in closed form. For this reason we commence by locating an equivalent energy that results in zero divergence. We then soften the assignment representation to obtain a continuously defined energy function. The assignment variables are updated by applying the soft-assign ansatz to the gradients of the resulting global energy function. We commence our development by considering the divergence between the mixture components P(0a |Sα ) of P(0a ) and the single effective distribution function q(0a ), J(0a ) =
X
P(0a |Sα ) . q(0a )
P(0a |Sα ) ln
α∈VM
(3.1)
When the divergence vanishes, then the distribution q(0a ) models the mixture distribution P(0a ) in the maximum entropy sense. This condition is satisfied when X P(0a |Sα ) ln P(0a |Sα ) ln q(0a ) =
α∈VM
X
P(0a |Sα )
.
(3.2)
α∈VM
Substituting for P(0a |Sα ) from equation 2.7, X ln q(0a ) = ln Ka − µ
H(a, α) exp[−µH(a, α)]
α∈VM
X
exp[−µH(a, α)]
.
(3.3)
α∈VM
The internal energy associated with the matching configuration 0a is given by U(0a ) = − ln q(0a ).
(3.4)
Since the quantity ln Ka is an additive constant that does not depend on the assignment variables and µ is a global scaling constant, we omit them from
1880
Andrew M. Finch, Richard C. Wilson, and Edwin R. Hancock
the definition of energy. We therefore focus on the quantity X U(0a ) =
H(a, α) exp[−µH(a, α)]
α∈VM
X
exp[−µH(a, α)]
.
(3.5)
α∈VM
In other words the neighborhood potentials are simply weighted sums of Hamming distance between the data and model graphs. In fact, the potentials display an interesting barrier property. The potential is concentrated at Hamming distance H ' µ1 . Both very large and very small Hamming distances contribute insignificantly to the energy function, that is, limH→0 H exp[−µH] = 0 and limH→∞ H exp[−µH] = 0. Finally, with the neighborhood matching potentials at hand, we construct a global internal matching energy E by summing the contributions over the nodes of the data graph, that is,
E=
X
U(0a ).
(3.6)
a∈VD
In passing, it is interesting to consider the relationship between our internal energy and the free energy of statistical physics. In statistical physics, it is the free energy that is the focus of attention. There are two physical versions of the free energy. The Helmholtz form is appropriate when there is a fixed number of states. When the volume of the system changes, the Gibbs form is appropriate. For our problem, the number of discrete states is fixed, so it is to the global Helmholtz free energy that we turn:
F=
X a∈VD
U(0a ) + T
X
P(0a ) ln P(0a ).
(3.7)
a∈VD
From this expression it is clear that minimizing the free energy F is equivalent only to minimizing the total internal energy E when the computational temperature T is zero. However, at nonzero temperatures, minimizing free energy can also be viewed as minimizing internal energy subject to the maximum entropy constraint, where the computational temperature plays the role of a Lagrange multiplier. Quite informally, since our internal energies have been derived to minimize Kullback divergence, then this additional constraint can be viewed as superfluous. At a more formal level Hoffmann and Buhmann (1997) have invoked a result due to Peirls, which shows that the upper bound on the free energy is equal to the zero-point free energy plus the change in internal energy. Since the zero-point internal energy is zero in our case (recall that limµ→∞ H exp[−µH] = 0), then minimizing internal
Graph Matching
1881
energy corresponds to minimizing the upper bound on the free-energy. The same conclusion can be reached by noting that our internal energy corresponds to zero divergence and appealing to the argument of Dayan, Hinton, Neal, and Zemel (1995). Here it is shown that the difference between the free energy and internal energy is proportional to the Kullback divergence between nonequilibrium and equilibrium probability distributions, that is, P(0a ) and q(0a ). In section 4, we therefore develop continuous update equations for the assignment variables by applying graduated assignment to the gradients of the total internal energy E . In the next section we will describe how optima of this essentially discrete global energy function may be located using continuous optimization (Gold & Rangarajan, 1996; Suganathan et al., 1995; Peterson & Soderberg, 1989; Yuille, 1994). However, before we develop our update equations we pause to consider the relationship with the energy function proposed by Gold and Rangarajan (1996). This comparison is quite informal, but under conditions in which µ → ∞, that is, Pe → 0, then we can identify the confusion probability P(Sα |0a ) with the assignment variable saα . Substituting for the Hamming distance H(a, α), we find
Eapprox =
X X X X
Dab Mαβ (1 − sbβ )saα .
(3.8)
a∈VD α∈VM b∈VD β∈VM
We can rewrite this energy to make the relationship with Gold and Rangarajan’s (1996) energy function more transparent:
Eapprox = −
X X X X
Dab Mαβ sbβ saα
a∈VD α∈VM b∈VD β∈VM
+
X X X X
Dab Mαβ (saα − 1).
(3.9)
a∈VD α∈VM b∈VD β∈VM
The first term is the familiar quadratic assignment energy. The second term is also common to Gold and Rangarajan. Moreover, in the case of our approximate energy function, there are no Lagrange multipliers that need to be determined. However, Gold and Rangarajan’s full energy contains two additional terms. The first of these is a node self-amplification term that was introduced as an empirical means for encouraging convergence to a binary solution. The second addition is the logarithmic entropy or barrier term originally introduced in Yuille and Kosowsky (1994). This term convexifies the energy function. There has recently been a formal proof of convergence for both the full soft-assign algorithm and a number of simplified variants. This proof is based on the positivity of the Kullback divergence (Rangarajan et al., 1997). In other words, although we offer no formal proof of binary convergence for our algorithm, this literature points to the fact that algorithms with the same limiting form of the energy function do possess this property (Yuille & Kosowsky, 1994; Rangarajan et al., 1997).
1882
Andrew M. Finch, Richard C. Wilson, and Edwin R. Hancock
4 Optimising the Global Cost Function We are now in a position to develop a continuous update algorithm by softening the discrete ingredients of our graph matching potential. The idea is to compute the derivatives of the global internal energy given in equation 3.6 and to effect the softening process using the soft-max idea (Bridle, 1990). Although this update process bears many of the hallmarks of mean-field annealing, it is couched in less rigorous terms. Mean-field equations are invariably derived from a quadratic cost function and are obtained by solving the set of saddle-point equations that minimize the Kullback-Leibler divergence with respect to the assignment variables. However, in the case of our nonquadratic cost function, the closed-form solution of the saddle-point equations is less tractable. For reasons of computational expediency, we therefore adopt the soft-assign idea to constrain the assignment variables to lie within the unit simplex. In practice we realize the updating of the matched version of the data graph using a two-stage update process. The first update mode involves softening the assignment variables—the sαβ that indicate the current state of match. The second update process aims to modify the connection strengths associated with the edges in the data graph. Data graph nodes that are unmatchable become disjoint by virtue of having weak connection weights and cease to play any significant role in the update process. It is this second update process that provides the basic mechanism for controlling relational inexactness in our technique. From the implementational standpoint, it obviates the need for an explicit null label and the requisite energy terms in the cost function. In fact, this process can be viewed as an implicit way of performing the explicit graph-edit operations of the type originally suggested in Sanfeliu and Fu (1983) and recently used to match cluttered scenes by Wilson and Hancock (1995). However, we provide no proof of convergence. The dual update steps are heuristically inspired. Suffice it to say that experiments reveal that the algorithm appears to be both numerically stable and capable of controlling a significant population of clutter. 4.1 Updating Assignment Variables. The energy function represented by equations 3.5 and 3.6 is defined over the discrete matching variables saα . The basic idea underpinning this article is to realize a continuous process for updating the assignment variables. The step size is determined by computing the partial derivatives of the global matching energy with respect to the assignment variables. We commence by computing the derivatives of the contributing neighborhood energy, ¶¸ µ X· ∂H(a, α) ∂U(0a ) = , 1 − µ H(a, α) − U(0a ) ξaα ∂sbβ ∂sbβ α∈V M
(4.1)
Graph Matching
1883
where exp[−µH(a, α)] . 0 α 0 ∈VM exp[−µH(a, α )]
ξaα = P
(4.2)
To develop this result further, we must compute the derivatives of the Hamming distances. From the definition given in equation 2.10, it follows that ∂H(a, α) = −Mαβ Dab . ∂sbβ
(4.3)
It is now a straightforward matter to show that the derivative of the global matching energy is equal to ¶¸ · µ X X ∂E =− Dab Mαβ 1 − µ H(a, α) − U(0a ) ξaα . (4.4) ∂sbβ a∈V α∈V D
M
We would like our continuous matching variables to remain constrained to lie within the range [0, 1]. For this reason, rather than using a linear update rule, we exploit the soft-max ansatz (Bridle, 1990): · ¸ 1 ∂E exp − T ∂saα (4.5) saα ← X · ¸. 1 ∂E exp − T ∂saα0 α 0 ∈V M
The value of the temperature T in the update process has been controlled using a slow exponential annealing schedule of the form suggested by Gold and Rangarajan (1996). We initialize the matching algorithm with assignment variables computed using a gaussian error model. Details are application specific and are deferred until we outline our experiments in section 5. The mathematical structure of the update process is important and deserves further comment. The quantity ξaα defined in equation 4.2 naturally plays the role of a matching probability. The first term appearing under the square bracket in equation 4.4 can therefore be thought of as analogous to the optimal update direction for the standard quadratic cost function (Suganathan et al., 1995; Gold & Rangarajan, 1996); we will discuss this relationship in more detail in section 4.3. The second term modifies this principal update direction by taking into account the weighted fluctuations in the Hamming distance about the effective potential or average Hamming distance. If the average fluctuation is zero, there is no net modification to the update direction. When the net fluctuation is nonzero, the direction of update is modified so as to compensate for the movement of the mean value of the effective potential. Moreover, since the fluctuation term is itself proportional to µ, this has an insignificant effect for Pe ' 12 but dominates the update process when Pe → 0.
1884
Andrew M. Finch, Richard C. Wilson, and Edwin R. Hancock
4.2 Updating the Connection Weights. The updating of assignment variables described in the previous subsection can be used to rectify initial matching errors. However, in most realistic graph-matching problems, there are also structural errors due to the presence of contaminating noise or clutter. There are several ways in which clutter can be controlled. For instance, in a recent comparative study of discrete relaxation, we compared the relative efficacy of three alternative strategies for identifying structural errors (Wilson & Hancock, 1996b, 1997). These include the labeling of clutter using null-category (Wilson & Hancock, 1996a; Finch, Wilson, & Hancock, 1997), the explicit removal of clutter using iterative graph-edit operations (Sanfeliu & Fu, 1983), and, finally, the application of a constraint filtering operation to remove clutter nodes from the final match (Barrow & Popplestone, 1971). The main conclusion of this study was that iterative graph-editing outperforms null labeling and constraint filtering as a means of controlling structural corruption. When a continuous update process is used rather than a discrete one, then the scope for clutter control is more limited. In fact, the process is usually effected by augmenting the model-graph nodes with a null attractor (Wilson & Hancock, 1996a; Finch et al., 1997). This is exactly the strategy adopted by Gold and Rangarajan (1996), who introduce slack variables into their soft-assign algorithm to accommodate unmatchable nodes. Here, our approach is different in motivation. Based on our experiences in the discrete graph-matching domain, we opt to develop a continuous counterpart of the graph-edit process. The basic idea is to update the elements of the data graph connection matrix, and hence provide a continuous mechanism for controlling clutter and relational inexactness. By weakening connections in the data graph, the matches of extraneous nodes become disjoint. To develop our continuous graph-edit process, we note that the softassign ansatz outlined in the previous section can be applied to update the elements of the connectivity matrix. In order to proceed, we will require the derivative of the Hamming distance with respect to the relevant element of the connection matrix, that is, X ∂H(a, α) = Mαβ (1 − sbβ ). ∂Dab β∈V
(4.6)
M
As a result the gradient of the global matching energy with respect to the elements of the data graph connection matrix is equal to · µ ¶¸ X X ∂E = Mαβ (1 − sbβ ) 1 − µ H(a, α) − U(0a ) ξaα . ∂Dab α∈V β∈V M
(4.7)
M
The soft-assign update equation for the elements of the data graph
Graph Matching
1885
connection matrix is · ¸ 1 ∂E exp − T ∂Dab Dab ← X ¸. · 1 ∂E exp − T ∂Da0 b0 (a0 ,b0 )∈E
(4.8)
D
From equation 4.8 it is interesting to note that the derivatives of the matching energy are zero when the entire neighborhood of the node a is consistently matched onto the model graph. 4.3 Quadratic Assignment Problem. Before we proceed to experiment with the new graph-matching process, we briefly review the standard quadratic formulation of the matching problem investigated by Simic (1991), Suganathan et al., (1995), and Gold and Rangarajan (1996). The common feature of these algorithms is to commence from the quadratic cost function,
EH = −
1 X X X X Dab Mαβ saα sbβ . 2 a∈V α∈V b∈V β∈V D
M
D
(4.9)
M
In this case the derivative of the global cost function is linear in the assignment variables, that is, 1 X X ∂ EH =− Dab Mαβ saα . ∂sbβ 2 a∈V α∈V D
(4.10)
M
This step size is equivalent to that appearing in equation 4.4 provided that µ = 0, that is, Pe → 12 . The update is realized by applying the soft-assign ansatz of equation 4.5. When the quantity ζaα defined in equation 4.2 is replaced by the assignment variable saα and we take the limit µ = 0, that is, Pe = 12 , in equation 4.4, then the soft-assign update(Peterson & Soderberg, 1989). In the next section we will provide some experimental comparison with the resulting matching process. However, it is important to stress that the update process adopted here is very simplistic and leaves considerable scope for further refinement. For instance, in Gold and Rangarajan (1996), Sinkhorn matrices have been used to impose a permutation structure on the final solution. We could clearly apply the same apparatus to our matching process. It is also important to stress that null matches are handled differently. Rather than adjusting the data graph connection weights, Gold and Rangarajan control noise by introducing slack variables into the rows and columns of their match matrix. Finally, as we have already pointed out in section 3, the basic quadratic assignment energy is augmented with various additional terms. Examples include barrier potentials and self-amplification terms (Gold & Rangarajan, 1996).
1886
Andrew M. Finch, Richard C. Wilson, and Edwin R. Hancock
5 Experiments Our aims in this section are twofold. First, we investigate the sensitivity of our matching process. We focus on three aspects of the matching process: sensitivity to noise and structural error, the effect of edge density, and the best choice of operating parameter µ. The second goal is to offer some comparative evaluation of the matching scheme. We investigate the relative noise sensitivity when we apply the soft-assign algorithm to both the weighted Hamming distance energy and the quadratic assignment energy. Our evaluation of the graph-matching process is based on synthetic data. Specifically, we study randomly connected graphs of various edge densities. 5.1 Randomly Connected Graphs. Our first set of experiments is concerned with matching randomly connected graphs. For a node set of predetermined size, we generate a random set of connections. The number of connections is controlled to give a predefined value of the average edge density per node. The resulting structures are used as the model graphs in our experiments. In order to generate data graphs, we introduce structural errors. These errors are introduced by randomly deleting a predetermined fraction on nodes from the model graphs. If a node is deleted, so are its connecting edges. We generated initialization errors by randomly misassigning a predetermined fraction of the initial winner-take-all matches between the data and model graphs. These initialization errors are simulated as follows. Associated with each node in the graph is a uniformly distributed random unary attribute; for the node a ∈ VD , we denote the variable by xa . This unary ¯ attribute is used to compute an initial assignment variable. Uncertainties in the measurements are simulated by randomly perturbing the unary attributes of the data graph with a gaussian error distribution so as to produce a specified fraction of initial matching errors. The initial assignment variables are computed from exponential distributions of the distance between attribute vector pairs using an estimate of the measurement noise-variance σ , that is, exp[− 2σ1 2 (xa − xα )2 ] ¯ ¯ · ¸. saα = X 1 2 exp − 2 (xa − xα0 ) ¯ 2σ ¯ α 0 ∈V
(5.1)
M
Our default algorithm settings are as follows. For the majority of our experiments, the fraction of initial winner-take-all errors is 50%. The graphs used in our study have random connections with an average of six edges per node. The graphs are nonplanar. We operate with the parameter µ set to 2.0; the motivation for this choice will be provided in section 5.2. The temperature of the soft-assign update is annealed from an initial setting of 10. The schedule for annealing the temperature between the epochs n and
Graph Matching
1887
n + 1 of the algorithm is of the form T(n+1) = 0.95T(n) . This is somewhat slower than that employed by Gold and Rangarajan (1996), who reduce the temperature by a factor of 0.925 with each iteration. There are four updates at each temperature epoch. Typically, we reach a stable winnertake-all matching configuration after 50 iterations. In other words, the final temperature is 0.77. 5.2 Sensitivity Study. In this subsection we study the sensitivity systematics of our matching process. To commence, Figure 1 illustrates the effect of edge density on the matching process. The figure shows the fraction of correctly matched nodes as a function of the fraction of nodes deleted from the model graph. The different curves are for average edge densities of two, four, six, and eight edges per node. In each case, the graphs contain 50 nodes, and the fraction of initialization errors is 50%. Each data point is averaged over 100 random experiments. The main point to be drawn from these curves is that the accuracy of match increases with edge density. When the fraction of deleted nodes is less than 0.3, there is little systematic difference between the matching fractions. It is only when the fraction of structural errors exceeds 0.5 that the effect of edge density becomes marked. This is in accord with Gold and Rangarajan’s (1996) experience with graduated assignment, where it was found that performance improves with increasing edge density. The curves shown in Figure 1 were obtained under conditions in which the fraction of initialization errors is 50%. Our next sequence of experiments is aimed at illustrating the effects of initialization error on the matching process. Here we have used random graphs in which there are six edges per node. Figure 2 shows the final fraction of correct matches as a function of the fraction of initialization errors. The straight line is the break-even case in which there is no net improvement in the matching process. The curves show the net improvements obtained when the fractions of structural errors are 5%, 10%, and 20%. Clearly, as the fraction of structural errors increases, then so the ability to recover from initialization error degrades. For the highest level of structural corruption, the iterative improvements are marginal. However, there are only a few cases in which the data points dip below the break-even line. In other words, the updating process rarely disrupts the configuration of correctly assigned labels in the initial match. The parameter µ controls the balance between the two terms in the step size for assignment variable update. As pointed out in section 4, the first term moves the assignment variables in the direction of maximum neighborhood support. This term can be thought of as ensuring maximum structural congruency between the matched graphs. The second term modifies direction of update to minimize the Hamming distance fluctuations. Figure 3 shows a plot of the fraction of correct matches as a function of µ. Here there are no structural errors. The only source of error is initialization error, which is set at 50%. The plot shows an interesting threshold structure in
1888
Andrew M. Finch, Richard C. Wilson, and Edwin R. Hancock
Figure 1: Effect of edge density on matching performance: The graph shows the final fraction of correct matches as a function of the fractional structural corruption in the graphs for different edge densities. The solid, dotted, dotdashed and dashed curves respectively correspond to two, four, six and eight edges per node.
the matching errors. When µ < 1, the matching performance is very poor. When µ > 1, the matching process is essentially insensitive to parameter choice. It is also important to note that the performance falls off rapidly when µ = 0. This corresponds to the case when our matching algorithm implements the standard Potts glass realization of the quadratic assignment problem. To investigate the role of the control parameter in more detail, we studied the noise sensitivity of the matching algorithm for various choices of µ. The curves in Figure 4 show the noise sensitivity of the matching process when µ = 0.5 and µ = 2.0. These two values span the onset threshold exhibited in Figure 3. Here the noise is structural, and we commence with 50% initialization error. The main point to note from these curves is that the higher value of µ gives superior performance at low levels of structural corruption. The lower value of µ gives relatively poor performance at low levels of structural corruption, but performs better when there is significant corruption. The main empirical conclusion to be drawn from these observations is as follows. The Hamming-distance fluctuation term appears to play an important role in controlling the matching process when initializa-
Graph Matching
1889
Figure 2: Effect of initialization error. The plot shows the fraction of correct matches in the final match as a function of the fraction of initially correct matches. The different curves are for varying degrees of structural corruption.
tion error dominates. The neighborhood support term, on the other hand, is more important when structural errors dominate. 5.3 Algorithm Comparison. The main conclusion of the sensitivity study reported in the previous subsection was that the matching process offered optimal performance when the graphs have an average density of six edges per node. Now we offer a comparative evaluation of the matching process using synthetic Delaunay graphs. We selected this structure since it is typical of the neighborhood graphs widely exploited in computer vision problems. Importantly, it also has a density of approximately six edges per node. However, it differs from a randomly connected graph in one critical respect. In a randomly connected graph, unless the edge density is very high, there are few first-order cycles. By contrast, the Delaunay graph is composed entirely of first-order triangular faces. In other words, every node in a Delaunay graph belongs to at least one first-order cycle. Since the graph triangulates the nodes, the average node is shared by six first-order faces. We commence our construction of the synthetic Delaunay graphs by generating random dot patterns. Each random dot is used to seed a Voronoi cell. The Delaunay triangulation is the region adjacency graph for the Voronoi cells. Structural corruption is simulated by deleting and adding controlled
1890
Andrew M. Finch, Richard C. Wilson, and Edwin R. Hancock
Figure 3: Sensitivity to the parameter µ. The plot shows the fraction of correct matches as a function of the value of µ.
numbers of random dots. For a series of different corruption levels, we have generated a sample of 100 random graphs, each containing 50 nodes. The random corruption level process is such as to maintain the overall size of the graphs. For each graph, we measure the quality of match by computing the fraction of the surviving nodes for which the assignment variables indicate the correct match. We have compared the results obtained by applying the soft-assign process to the weighted Hamming distance energy and the standard quadratic assignment energy. Figure 5 shows the final fraction of correct matches for each of the algorithms. The data curves show the correct matching fraction averaged over the graph samples as a function of the corruption fraction. The main conclusions that can be drawn from these plots is that the new matching technique described in this article significantly outperforms its conventional quadratic counterpart (described in section 4.3). The main difference between the two techniques resides in the fact that our new method relies on updating with derivatives of the energy function that are nonlinear in the assignment variables. Finally, we compare the results obtained using our matching method with those reported by Gold and Rangarajan (1996). First, it is important to note that Gold and Rangarajan’s analysis proceeds only to noise levels of 20%. Several algorithm variants are reported. However, optimal performance is
Graph Matching
1891
Figure 4: Effect of varying the parameter µ. The plot shows the effect of the parameter µ on the noise sensitivity curves. The solid curve is for µ = 1.0 while the dotted curve is for µ = 2, 0.
Figure 5: Experimental comparison: softened discrete relaxation (dotted curve); matching using the quadratic cost function (solid curve).
1892
Andrew M. Finch, Richard C. Wilson, and Edwin R. Hancock
achieved with attributed relational graph matching. This method augments the quadratic assignment energy with a binary attribute difference term. When three binary attributes per edge are used, then the performance at the 20% noise level is comparable to that obtained with our algorithm. However, this version of Gold and Rangarjan’s algorithm contrasts with our own in that it draws very heavily on attribute structure. After the initialization step, our algorithm relies purely on graph structure. 6 Conclusions Our main contribution here has been to develop a new graduation algorithm for relational graph matching. There are two novel contributions. The first is the development of a new energy function for graph matching. The second is to update the connection structure of the data graph with the aim of controlling structural corruption. The development of the energy function commences from the Bayesian consistency measure developed in Wilson et al. (1995). We show how the discrete assignment representation underpinning the consistency can be softened. Specifically, we use the apparatus of statistical physics to compute a graph-matching energy. Formally, the energy is a weighted sum of Hamming distances between the graphs being matched. The gradient of the energy function has an interesting structure. In particular, the step size for assignment variable update has a two-component structure. The first component is in the direction of maximum local gradient. The second term retains memory of the structure of the energy landscape. In particular, it modifies the direction of update so as to ensure that local fluctuations in Hamming distance do not become excessively large. In the second novel contribution, we address the problem of rectifying structural errors by developing a graduation step for modifying the connection structure of the data graph. The idea is to update the elements of the data graph connection matrix so that clutter nodes can become disjoint if they persist in failing to find a consistent match. The operating limits of the matching algorithm have been evaluated on simulation data. Here we show the algorithm to be effective under severe structural error. The net effect of increasing structural error is to limit the extent to which the algorithm can recover from initialization errors. Finally, we have shown how the method relates to the standard quadratic assignment algorithm extensively studied in the connectionist literature (Gold & Rangarajan, 1996; Simic, 1991; Suganathan et al., 1995). Moreover, an experimental comparison reveals that the method offers superior performance in terms of noise control. There are clearly a number of shortcomings in the reported work. In the first instance, we fall well short of developing rigorous mean-field equations (Hoffmann & Buhmann, 1997). There is also no formal proof of convergence under the dual update operations. Nonetheless, the experimentation
Graph Matching
1893
demonstrates that the conceptual framework offers encouraging potential in the graph-matching domain. Suffice to say that the ideas presented here represent the first steps. We are considering the application of statistical physics to the energy function developed in this article and the conclusions will be reported in due course. References Aarts, E., & Korst, J. (1989). Simulated annealing and Boltzmann machines. New York: Wiley. Barrow, H. G., & Popplestone, R. J. (1971). Relational descriptions in picture processing. Machine Intelligence, 6. Blake, A., & Zisserman, A. (1987). Visual reconstruction. Cambridge, MA: MIT Press. Boyer, K., & Kak, A. (1988). Structural stereopsis for three-dimensional vision. IEEE Pattern Analysis and Machine Intelligence, 10, 144–166. Bridle, J. S. (1990). Training stochastic model recognition algorithms can lead to maximum mutual information estimation of parameters. In D. S. Touretzky (Ed.), Advances in neural information processing systems (pp. 211–217). San Mateo, CA: Morgan Kauffman. Cross, A. D. J., & Hancock, E. R. (1995). Relational matching with stochastic optimisation. IEEE Computer Society International Symposium on Computer Vision (pp. 365–370). Cross, A. D. J., & Hancock, E. R. (1997). Inexact graph matching using genetic search. Pattern Recognition, 30, 953–970. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Hemholtz machine. Neural Computation, 7, 889–904. Finch, A. M., Wilson, R. C., & Hancock, E. R. (1997). Matching Delaunay graphs. Pattern Recognition, 30, 123–140. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and Bayesian restoration of images. IEEE Pattern Analysis and Machine Intelligence, 6, 721–741. Gold, S., & Rangarajan, A. (1996). A graduated assignment algorithm for graph matching. IEEE Pattern Analysis and Machine Intelligence, 18, 377–388. Gold, S., Rangarajan, A., & Mjolsness, E. (1996). Learning with pre-knowledge: Clustering with point and graph-matching distance measures. Neural Computation, 8, 787–804. Hancock, E. R., & Kittler, J. (1993). Discrete relaxation. Pattern Recognition, 23, 711–733. Hoffmann, T., & Buhmann, J. M. (1997). Pairwise data clustering with deterministic annealing. IEEE PAMI, 19, 1–14. Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimisation by simulated annealing. Science, 220, 671–680. Meila, M., & Jordan, M. (1997). Optimal triangulation with continuous cost functions. In M. C. Mozer, M. Jordan, & M. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 557-563). Cambridge, MA: MIT Press.
1894
Andrew M. Finch, Richard C. Wilson, and Edwin R. Hancock
Mjolsness, E., Gindi, G., & Anandan, P. (1989). Optimisation in model matching and perceptual organisation. Neural Computation, 1, 218–229. Peterson, C., & Soderberg, B. (1989). A new method for mapping optimisation problems. International Journal of Neural Systems, 1, 2–33. Rangarajan, A., Gold, S., & Mjolsness, E. (1996). A novel optimizing network architecture with applications. Neural Computation, 8, 1041–1060. Rangarajan, A., Yuille, A., Gold, S., & Mjolsness, E. (1997). A convergence proof for the soft-assign quadratic assignment algorithm. In M. C. Mozer, M. Jordan, & M. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 620–626). Cambridge, MA: MIT Press. Sanfeliu, A., & Fu, K. S. (1983). A distance measure between attributed relational graph. IEEE SMC, 13, 353–362. Saul, L. K., & Jordan, M. I. (1995). Exploiting tractable substructures in intractable networks. In D. S. Touretzky, M. C. Moser, & D. Hasselmo (Eds.), Neural information processing systems, 8 (pp. 476–492). Cambridge, MA: MIT Press. Shapiro, L. & Haralick, R. M. (1981). Structural description and inexact matching. IEEE PAMI, 3, 504–519. Shapiro, L., & Haralick, R. M. (1985). A metric for comparing relational descriptions. IEEE Pattern Analysis and Machine Intelligence, 7, 90–94. Simic, P. (1991). Constrained nets for graph matching and other quadratic assignment problems. Neural Computation, 3, 268–281. Suganathan, P. N., Teoh, E. K., & Mital, D. P. (1995). Pattern recognition by graph matching using Potts MFT networks. Pattern Recognition, 28, 997–1009. Wilson, R. C., Evans, A. N., & Hancock, E. R. (1995). Relational matching by discrete relaxation. Image and Vision Computing, 13, 411–421. Wilson, R. C., & Hancock, E. R. (1995). Relational matching by dynamic graph structures. Proceedings of the Fifth International Conference on Computer Vision (pp. 450–456). Wilson, R. C., & Hancock, E. R. (1996a). A Bayesian compatibility model for graph matching. Pattern Recognition Letters, 17, 263–276. Wilson, R. C., & Hancock, E. R. (1996b). Gauging relational consistency and rectifying structural errors. IEEE Computer Society Computer Vision and Pattern Recognition Conference (pp. 47–54). Wilson, R. C., & Hancock, E. R. (1997). Structural matching by discrete relaxation. IEEE Pattern Analysis and Machine Intelligence, 19, 634–648. Wong, A. K. C., & You, M. (1985). Entropy and distance of random graphs with application to structural pattern recognition. IEEE Pattern Analysis and Machine Intelligence, 7, 599–609. Yuille, A. L. (1994). Generalised deformable models, statistical physics and matching problems. Neural Computation, 2, 1–24. Yuille, A. L., & Kosowsky, J. J. (1994). Statistical physics algorithms that converge. Neural Computation, 6, 341–356. Yuille, A. L., Stolorz, P., & Utans, J. (1994). Statistical physics, mixtures of distributions, and the EM algorithm. Neural Computation, 6, 344–340. Received March 31, 1997; accepted January 6, 1998.
LETTER
Communicated by Leo Breiman
Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms Thomas G. Dietterich Department of Computer Science, Oregon State University, Corvallis, OR 97331, U.S.A.
This article reviews five approximate statistical tests for determining whether one learning algorithm outperforms another on a particular learning task. These tests are compared experimentally to determine their probability of incorrectly detecting a difference when no difference exists (type I error). Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences t test based on taking several random train-test splits. A third test, a paireddifferences t test based on 10-fold cross-validation, exhibits somewhat elevated probability of type I error. A fourth test, McNemar’s test, is shown to have low type I error. The fifth test is a new test, 5 × 2 cv, based on five iterations of twofold cross-validation. Experiments show that this test also has acceptable type I error. The article also measures the power (ability to detect algorithm differences when they do exist) of these tests. The cross-validated t test is the most powerful. The 5×2 cv test is shown to be slightly more powerful than McNemar’s test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, McNemar’s test is the only test with acceptable type I error. For algorithms that can be executed 10 times, the 5 × 2 cv test is recommended, because it is slightly more powerful and because it directly measures variation due to the choice of training set. 1 Introduction In the research, development, and application of machine learning algorithms for classification tasks, many questions arise for which statistical methods are needed. The purpose of this article is to investigate one of these questions, demonstrate that existing statistical methods are inadequate for this question, and propose a new statistical test that shows acceptable performance in initial experiments. To understand the question raised in this article, it is helpful to consider a taxonomy of the different kinds of statistical questions that arise in machine learning. Figure 1 gives a taxonomy of nine statistical questions. c 1998 Massachusetts Institute of Technology Neural Computation 10, 1895–1923 (1998) °
1896
Thomas G. Dietterich
Let us begin at the root of the tree. The first issue to consider is whether we are studying only a single application domain or multiple domains. In most applied research, there is a single domain of interest, and the goal is to find the best classifier or the best learning algorithm to apply in that domain. However, a fundamental goal of research in machine learning is to find learning algorithms that work well over a wide range of application domains. We will return to this issue; for the moment, let us consider the single-domain case. Within a single domain, there are two different sets of questions, depending on whether we are analyzing classifiers or algorithms. A classifier is a function that, given an input example, assigns that example to one of K classes. A learning algorithm is a function that, given a set of examples and their classes, constructs a classifier. In a particular application setting, our primary goal is usually to find the best classifier and estimate its accuracy with future examples. Suppose we are working for a medical instrumentation company and wish to manufacture and sell an instrument for classifying blood cells. At the time we are designing the instrument, we could gather a large collection of blood cells and have a human expert classify each cell. We could then apply a learning algorithm to produce a classifier from this set of classified cells. The classifier would be implemented in the instrument and sold. We want our instrument to contain the most accurate classifier we can find. There are some applications, however, where we must select the best learning algorithm rather than find the best classifier. For example, suppose we want to sell an e-mail system that learns to recognize and filter junk mail. Whenever the user receives an e-mail message that he considers junk mail, he will flag that message. Periodically, a learning algorithm included in the program will analyze the accumulated examples of junk and nonjunk e-mail and update its filtering rules. Our job is to determine which learning algorithm to include in the program. The next level of the taxonomy distinguishes between two fundamental tasks: estimating accuracy and choosing between classifiers (or algorithms). When we market our blood cell diagnosis system, we would like to make a claim about its accuracy. How can we measure this accuracy? And, of course, when we design the system, we want to choose the best classifier from some set of available classifiers. The lowest level of the taxonomy concerns the amount of data available. If we have a large amount of data, then we can set some of them aside to serve as a test set for evaluating classifiers. Much simpler statistical methods can be applied in this case. However, in most situations, the amount of data is limited, and we need to use all we have as input to our learning algorithms. This means that we must use some form of resampling (i.e., cross-validation or the bootstrap) to perform the statistical analysis. Now that we have reviewed the general structure of the taxonomy, let’s consider the nine statistical questions. We assume that all data points (exam-
Comparing Supervised Classification Learning Algorithms
1897
Figure 1: A taxonomy of statistical questions in machine learning. The boxed node (Question 8) is the subject of this article.
ples) are drawn independently from a fixed probability distribution defined by the particular application problem. Question 1: Suppose we are given a large sample of data and a classifier C. The classifier C may have been constructed using part of the data, but there are enough data remaining for a separate test set. Hence, we can measure the accuracy of C on the test set and construct a binomial confidence interval (Snedecor & Cochran, 1989; Efron & Tibshirani, 1993; Kohavi, 1995). Note that in Question 1, the classifier could have been produced by any method (e.g., interviewing an expert); it need not have been produced by a learning algorithm. Question 2: Given a small data set, S, suppose we apply learning algorithm A to S to construct classifier CA . How accurately will CA classify new examples? Because we have no separate test set, there is no direct way to answer this question. A frequently applied strategy is to convert this question into Question 6: Can we predict the accuracy of algorithm A when it is trained on randomly selected data sets of (approximately) the same size as S? If so, then we can predict the accuracy of CA , which was obtained from training on S. Question 3: Given two classifiers CA and CB and enough data for a sep-
1898
Thomas G. Dietterich
arate test set, determine which classifier will be more accurate on new test examples. This question can be answered by measuring the accuracy of each classifier on the separate test set and applying McNemar’s test, which will be described below. Question 4: Given two classifiers, CA and CB , produced by feeding a small data set S to two learning algorithms, A and B, which classifier will be more accurate in classifying new examples? Again, because we have no separate set of test data, we cannot answer this question directly. Some researchers have taken the approach of converting this problem into a question about learning algorithms (Question 8). If we can determine which algorithm usually produces more accurate classifiers (when trained on data sets of approximately the same size), then we can select the classifier (CA or CB ) created by that algorithm. Question 5: Given a learning algorithm A and a large data set S, what is the accuracy of the classifiers produced by A when trained on new training sets of a specified size m? This question has not received much attention in the literature. One approach, advocated by the DELVE project (Hinton, Neal, Tibshirani, & DELVE team members, 1995; Rasmussen, 1996), is to subdivide S into a test set and several disjoint training sets of size m. Then A is trained on each of the training sets, and the resulting classifiers are tested on the test set. The average performance on the test set estimates the accuracy of new runs. Question 6: Given a learning algorithm A and a small data set S, what is the accuracy of the classifiers produced by A when A is trained on new training sets of the same size as S? Kohavi (1995) shows that stratified 10fold cross-validation produces fairly good estimates in this case. Note that in any resampling approach, we cannot train A on training sets of exactly the same size as S. Instead, we train on data sets that have slightly fewer examples (e.g., 90% of the size of S in 10-fold cross-validation) and rely on the assumption that the performance of learning algorithms changes smoothly with changes in the size of the training data. This assumption can be checked experimentally (by performing additional cross-validation studies) with even smaller training sets, but it cannot be checked directly for training sets of the size of S. Results on the shape of learning curves show that in some cases, this smoothness assumption will be violated (Haussler, Kearns, Seung, & Tishby, 1994). Nonetheless, it is observed to hold experimentally in most applications. Question 7: Given two learning algorithms A and B and a large data set S, which algorithm will produce more accurate classifiers when trained on data sets of a specified size m? This question has not received much attention, although the DELVE team has studied this question for regression problems. They divide S into several disjoint training sets and a single test set. Each algorithm is trained on each training set, and all resulting classifiers are tested on the test set. An analysis of variance can then be performed that includes terms for the choice of learning algorithm, the choice of the training
Comparing Supervised Classification Learning Algorithms
1899
set, and each individual test example. The Quasi-F test (Lindman, 1992) is applied to determine whether the effect due to the choice of learning algorithms is significantly nonzero. Question 8: Given two learning algorithms A and B and a small data set S, which algorithm will produce more accurate classifiers when trained on data sets of the same size as S? The purpose of this article is to describe and compare several statistical tests for answering this question. Because S is small, it will be necessary to use holdout and resampling methods. As mentioned regarding Question 6, this means that we cannot answer this question exactly without making the assumption that the performance of the two learning algorithms changes smoothly with changes in the size of the training set. Specifically, we will need to assume that the relative difference in performance of the two algorithms changes slowly with changes in the size of the training set. Question 9: Given two learning algorithms A and B and data sets from several domains, which algorithm will produce more accurate classifiers when trained on examples from new domains? This is perhaps the most fundamental and difficult question in machine learning. Some researchers have applied a simple sign test (or the Wilcoxon signed-ranks test) to try to answer this question, based on single runs or cross-validation-based estimates, but these tests do not take into account the uncertainty of the individual comparisons. Effectively, we want to combine the results from several answers to Question 8, where each answer has an associated uncertainty. This is an important question for future research. Questions 7, 8, and 9 are the most important for experimental research on learning algorithms. When someone develops a new learning algorithm (or a modification to an existing algorithm), answers to these questions can determine whether the new algorithm is better than existing algorithms. Unfortunately, many data sets used in experimental research are too small to allow posing Question 7. Hence, this article focuses on developed good statistical tests for Question 8. We define and compare five statistical tests for this question. Before proceeding with the derivation of these statistical tests, it is worth noting that each of the questions posed can be extended beyond classification algorithms and misclassification rates. For example, in many decisionmaking settings, it is important to estimate the conditional probability that a new example belongs to each of the K classes. One measure of the accuracy of probability estimates is the log loss; Questions 1, 2, 5, and 6 can be rephrased in terms of determining the expected log loss of a classifier or an algorithm. Similarly, Questions 3, 4, 7, and 8 can be rephrased in terms of determining which classifier or algorithm has the smaller log loss. We are unaware of any statistical research specifically addressing these questions in the case of log loss, however. In many neural network applications, the task is to predict a continuous response variable. In these problems, the squared error is usually the nat-
1900
Thomas G. Dietterich
ural loss function, and Questions 1, 2, 5, and 6 can be rephrased in terms of determining the expected mean squared error of a predictor or of an algorithm. Similarly, Questions 3, 4, 7, and 8 can be rephrased in terms of determining which predictor or algorithm has the smaller mean squared error. Question 1 can be addressed by constructing a confidence interval based on the normal or t distribution (depending on the size of the test set). Question 3 can be addressed by constructing a confidence interval for the expected difference. The DELVE project has developed analysis-of-variance techniques for Questions 5 and 7. Appropriate statistical tests for the smallsample questions (2, 4, 6, and 8) are still not well established. The statistical tests for regression methods may suggest ways of designing statistical tests for the log loss case, an important area for future research. To design and evaluate statistical tests, the first step is to identify the sources of variation that must be controlled by each test. For the case we are considering, there are four important sources of variation. First is the random variation in the selection of the test data used to evaluate the learning algorithms. On any particular randomly drawn test data set, one classifier may outperform another even though on the whole population, the two classifiers would perform identically. This is a particularly pressing problem for small test data sets. The second source of random variation results from the selection of the training data. On any particular randomly drawn training set, one algorithm may outperform another even though, on the average, the two algorithms have the same accuracy. Even small changes to the training set (such as adding or deleting a few data points) may cause large changes in the classifier produced by a learning algorithm. Breiman (1994, 1996) has called this behavior “instability,” and he has shown that this is a serious problem for the decision tree algorithms, such as CART (Breiman, Friedman, Olshen, & Stone, 1984). A third source of variance can be internal randomness in the learning algorithm. Consider, for example, the widely used backpropagation algorithm for training feedforward neural networks. This algorithm is usually initialized with a set of random weights, which it then improves. The resulting learned network depends critically on the random starting state (Kolen & Pollack, 1991). In this case, even if the training data are not changed, the algorithm is likely to produce a different hypothesis if it is executed again from a different random starting state. The last source of random variation that must be handled by statistical tests is random classification error. If a fixed fraction η of the test data points is randomly mislabeled, then no learning algorithm can achieve an error rate of less than η. A good statistical test should not be fooled by these sources of variation. The test should conclude that the two algorithms are different if and only if their percentage of correct classifications would be different, on the average,
Comparing Supervised Classification Learning Algorithms
1901
when trained on a training set of a given fixed size and tested on all data points in the population. To accomplish this, a statistical testing procedure must account for these sources of variation. To account for test data variation and the possibility of random classification error, the statistical procedure must consider the size of the test set and the consequences of changes in it. To account for training data variation and internal randomness, the statistical procedure must execute the learning algorithm multiple times and measure the variation in accuracy of the resulting classifiers. This article begins by describing five statistical tests bearing on Question 8: McNemar’s test, a test for the difference of two proportions, the resampled t test, the cross-validated t test, and a new test called the 5 × 2 cv test. The article then describes a simulation study that seeks to measure the probability that each test will incorrectly detect a difference when no difference exists (type I error). The results of the simulation study show that only McNemar’s test, the cross-validated t test, and the 5 × 2 cv test have acceptable type I error. The type I error of the resampled t test is very bad and the test is very expensive computationally, so we do not consider it further. The type I error of the difference-of-proportions test is unacceptable in some cases, but it is very cheap to evaluate, so we retained it for further study. The simulation study is somewhat idealized and does not address all aspects of training data variation. To obtain a more realistic evaluation of the four remaining tests, we conducted a set of experiments using real learning algorithms on realistic data sets. We measured both the type I error and the power of the tests. The results show that the cross-validated t test has consistently elevated type I error. The difference-of-proportions test has acceptable type I error, but low power. Both of the remaining two tests have good type I error and reasonable power. The 5 × 2 cv test is slightly more powerful than McNemar’s test, but also 10 times more expensive to perform. Hence, we conclude that the 5 × 2 cv test is the test of choice for inexpensive learning algorithms but that McNemar’s test is better for more expensive algorithms. 2 Formal Preliminaries We will assume that there exists a set X of possible data points, called the population. There also exists some target function, f , that classifies each x ∈ X into one of K classes. Without loss of generality, we will assume that K = 2, although none of the results in this article depend on this assumption, since our only concern will be whether an example is classified correctly or incorrectly. In an application setting, a sample S is drawn randomly from X according to a fixed probability distribution D. A collection of training examples is constructed by labeling each x ∈ S according to f (x). Each training example therefore has the form hx, f (x)i. In some applications, there may be a source
1902
Thomas G. Dietterich
of classification noise that randomly sets the label to an incorrect value. A learning algorithm A takes as input a set of training examples R and outputs a classifier fˆ. The true error rate of that classifier is the probability that fˆ will misclassify an example drawn randomly from X according to D. In practice, this error rate is estimated by taking our available sample S and subdividing it into a training set R and a test set T. The error rate of fˆ on T provides an estimate of the true error rate of fˆ on the population X. The null hypothesis to be tested is that for a randomly drawn training set R of fixed size; the two learning algorithms will have the same error rate on a test example randomly drawn from X, where all random draws are made according to distribution D. Let fˆA be the classifier output by algorithm A trained on training set R, and let fˆB be the classifier output by algorithm B trained on R. Then the null hypothesis can be written as PrR,x [ fˆA (x) = f (x)] = PrR,x [ fˆB (x) = f (x)], where the notation PrR,x indicates the probability taken with respect to the random draws of the training set R and the test example x. 3 Five Statistical Tests We now describe the statistical tests that are the main subject of this paper. We begin with simple holdout tests and then consider tests based on resampling from the available data. 3.1 McNemar’s test. To apply McNemar’s test (Everitt, 1977), we divide our available sample of data S into a training set R and a test set T. We train both algorithms A and B on the training set, yielding classifiers fˆA and fˆB . We then test these classifiers on the test set. For each example x ∈ T, we record how it was classified and construct the following contingency table: Number of examples misclassified by both fˆA and fˆB Number of examples misclassified by fˆB but not by fˆA
Number of examples misclassified by fˆA but not by Number of examples misclassified by neither fˆA nor
fˆB fˆB .
We will use the notation n00 n10
n01 n11
where n = n00 + n01 + n10 + n11 is the total number of examples in the test set T. Under the null hypothesis, the two algorithms should have the same error rate, which means that n01 = n10 . McNemar’s test is based on a χ 2 test
Comparing Supervised Classification Learning Algorithms
1903
for goodness of fit that compares the distribution of counts expected under the null hypothesis to the observed counts. The expected counts under the null hypothesis are n00 (n01 + n10 )/2
(n01 + n10 )/2 n11
The following statistic is distributed (approximately) as χ 2 with 1 degree of freedom; it incorporates a “continuity correction” term (of −1 in the numerator) to account for the fact that the statistic is discrete while the χ 2 distribution is continuous: (|n01 − n10 | − 1)2 . n01 + n10 If the null hypothesis is correct, then the probability that this quantity is 2 = 3.841459 is less than 0.05. So we may reject the null greater than χ1,0.95 hypothesis in favor of the hypothesis that the two algorithms have different performance when trained on the particular training set R. Note, however, that this test has two shortcomings with regard to Question 8. First, it does not directly measure variability due to the choice of the training set or the internal randomness of the learning algorithm. A single training set R is chosen, and the algorithms are compared using that training set only. Hence, McNemar’s test should be applied only if we believe these sources of variability are small. Second, it does not directly compare the performance of the algorithms on training sets of size |S|, but only on sets of size |R|, which must be substantially smaller than |S| to ensure a sufficiently large test set. Hence, we must assume that the relative difference observed on training sets of size |R| will still hold for training sets of size |S|. 3.2 A Test for the Difference of Two Proportions. A second simple statistical test is based on measuring the difference between the error rate of algorithm A and the error rate of algorithm B (Snedecor & Cochran, 1989). Specifically, let pA = (n00 + n01 )/n be the proportion of test examples incorrectly classified by algorithm A, and let pB = (n00 + n10 )/n be the proportion of test examples incorrectly classified by algorithm B. The assumption underlying this statistical test is that when algorithm A classifies an example x from the test set T, the probability of misclassification is pA . Hence, the number of misclassifications of n test examples is a binomial random variable with mean npA and variance pA (1 − pA )n. The binomial distribution can be well approximated by a normal distribution for reasonable values of n. Furthermore, the difference between two independent normally distributed random variables is itself normally distributed. Hence, the quantity pA − pB can be viewed as normally distributed if we assume that the measured error rates pA and pB are independent. Under the null hypothesis, this will have a mean of zero and a standard error
1904
Thomas G. Dietterich
of
r se =
2p(1 − p) , n
where p = (pA + pB )/2 is the average of the two error probabilities. From this analysis, we obtain the statistic z= p
pA − pB 2p(1 − p)/n
,
which has (approximately) a standard normal distribution. We can reject the null hypothesis if |z| > Z0.975 = 1.96 (for a two-sided test with probability of incorrectly rejecting the null hypothesis of 0.05). This test has been used by many researchers, including the author (Dietterich, Hild, & Bakiri, 1995). However, there are several problems with this test. First, because pA and pB are each measured on the same test set T, they are not independent. Second, the test shares the drawbacks of McNemar’s test: it does not measure variation due to the choice of training set or internal variation of the learning algorithm, and it does not directly measure the performance of the algorithms on training sets of size |S|, but rather on the smaller training set of size |R|. The lack of independence of pA and pB can be corrected by changing the estimate of the standard error to be r n01 + n10 0 . se = n2 This estimate focuses on the probability of disagreement of the two algorithms (Snedecor & Cochran, 1989). The resulting z statistic can be written as z0 =
|n01 − n10 | − 1 , √ n01 + n10
which we can recognize as the square root of the χ 2 statistic in McNemar’s test. In this article, we have experimentally analyzed the uncorrected z statistic, since this statistic is in current use and we wanted to determine how badly the (incorrect) independence assumption affects the accuracy of the test. For small sample sizes, there are exact versions of both McNemar’s test and the test for the difference of two proportions that avoid the χ 2 and normal approximations. 3.3 The Resampled Paired t Test. The next statistical test we consider is currently the most popular in the machine learning literature. A series of (usually) 30 trials is conducted. In each trial, the available sample S is
Comparing Supervised Classification Learning Algorithms
1905
randomly divided into a training set R of a specified size (e.g., typically twothirds of the data) and a test set T. Learning algorithms A and B are trained (i) on R, and the resulting classifiers are tested on T. Let p(i) A (respectively, pB ) be the observed proportion of test examples misclassified by algorithm A (respectively B) during trial i. If we assume that the 30 differences p(i) = (i) p(i) A − pB were drawn independently from a normal distribution, then we can apply Student’s t test, by computing the statistic √ p· n , t = rP n
(p(i) −p)2 n−1
i=1
P where p = n1 ni=1 p(i) . Under the null hypothesis, this statistic has a t distribution with n − 1 degrees of freedom. For 30 trials, the null hypothesis can be rejected if |t| > t29,0.975 = 2.04523. There are many potential drawbacks of this approach. First, the individ(i) ual differences p(i) will not have a normal distribution, because p(i) A and pB (i) are not independent. Second, the p ’s are not independent, because the test sets in the trials overlap (and the training sets in the trials overlap as well). We will see below that these violations of the assumptions underlying the t test cause severe problems that make this test unsafe to use. 3.4 The k-Fold Cross-Validated Paired t Test. This test is identical to the previous one except that instead of constructing each pair of training and test sets by randomly dividing S, we instead randomly divide S into k disjoint sets of equal size, T1 , . . . , Tk . We then conduct k trials. In each trial, the test set is Ti , and the training set is the union of all of the other Tj , j 6= i. The same t statistic is computed. The advantage of this approach is that each test set is independent of the others. However, this test still suffers from the problem that the training sets overlap. In a 10-fold cross-validation, each pair of training sets shares 80% of the examples. This overlap may prevent this statistical test from obtaining a good estimate of the amount of variation that would be observed if each training set were completely independent of previous training sets. To illustrate this point, consider the nearest-neighbor algorithm. Suppose our training set contains two clusters of points: a large cluster belonging to one class and a small cluster belonging to the other class. If we perform a twofold cross-validation, we must subdivide the training data into two disjoint sets. If all of the points in the smaller cluster go into one of those two sets, then both runs of the nearest-neighbor algorithm will have elevated error rates, because when the small cluster is in the test set, every point in it will be misclassified. When the small cluster is in the training set, some of its points may (incorrectly) be treated as nearest neighbors of test set points, which also increases the error rate. Conversely, if the small cluster is evenly divided between the two sets, then the error rates will improve,
1906
Thomas G. Dietterich
because for each test point, there will be a corresponding nearby training point that will provide the correct classification. Either way, we can see that the performance of the two folds of the cross-validation will be correlated rather than independent. We verified this experimentally for 10-fold cross-validation on the letter recognition task (300 total training examples) in an experiment where the null hypothesis was true (described below). We measured the correlation coefficient between the differences in error rates on two folds within a crossvalidation, p(i) and p(j) . The observed value was 0.03778, which according to a t test is significantly different from 0 with p < 10−10 . On the other hand, if the error rates p(i) and p(j) are drawn from independent 10-fold cross-validations (i.e., on independent data sets), the correlation coefficient is −0.00014, which according to a t test is not significantly different from zero. 3.5 The 5 × 2 cv Paired t Test. In some initial experiments with the k-fold cross-validated paired t test, we attempted to determine why the t statistic was too large in some cases. The numerator of the t statistic estimates the mean difference in the performance of the two algorithms (over the k folds), while the denominator estimates the variance of these differences. With synthetic data, we constructed k nonoverlapping training sets and measured the mean and variance on those training sets. We found that while the variance was slightly underestimated when the training sets overlapped, the means were occasionally very poorly estimated, and this was the cause of the large t values. The problem can be traced to the correlations between the different folds, as described above. We found that if we replaced the numerator of the t statistic with the observed difference from a single fold of the k-fold cross-validation, the statistic became well behaved. This led us to the 5 × 2 cv paired t test. In this test, we perform five replications of twofold cross-validation. In each replication, the available data are randomly partitioned into two equalsized sets, S1 and S2 . Each learning algorithm (A or B) is trained on each set (1) and tested on the other set. This produces four error estimates: p(1) A and pB (2) (2) (trained on S1 and tested on S2 ) and pA and pB (trained on S2 and tested on S1 ). Subtracting corresponding error estimates gives us two estimated (1) (2) = p(2) −p(2) . From these two differences, differences: p(1) = p(1) A −pB and p A B 2 (1) the estimated variance is s = (p −p)2 +(p(2) −p)2 , where p = (p(1) +p(2) )/2. Let s2i be the variance computed from the ith replication, and let p(1) 1 be the p(1) from the very first of the five replications. Then define the following statistic, p(1) t˜ = q P1 5 1 5
2 i=1 si
,
Comparing Supervised Classification Learning Algorithms
1907
which we will call the 5 × 2 cv t˜ statistic. We claim that under the null hypothesis, t˜ has approximately a t distribution with 5 degrees of freedom. The argument goes as follows. Let A be a standard normal random variable and B be a χ 2 random variable with n − 1 degrees of freedom. Then by definition, the quantity A √ B/(n − 1)
(3.1)
has a t distribution with n−1 degrees of freedom if A and B are independent. The usual t statistic is derived by starting with a set of random variables 2 Xn having a normal distribution with mean X1 , . . . , P P µ and 2variance σ . Let X = n1 i Xi be the sample mean and S2 = (X − X) be the sum of i i squared deviations from the mean. Then define A=
√
n(X − µ)/σ
2
B = S /σ 2 . Well-known results from probability theory state that A has a standard normal distribution and B has a χ 2 distribution with n−1 degrees of freedom. A more remarkable result from probability theory is that A and B are also independent, provided the original Xi ’s were drawn from a normal distribution. Hence, we can plug them into equation 3.1 as follows: A t= √ B/(n − 1) √ n(X − µ)/σ = p 2 S /(σ 2 · (n − 1)) √ n(X − µ) . = p S2 /(n − 1) This gives the usual definition of the t statistic when µ = 0. We can construct t˜ by analogy as follows. Under the null hypothesis, the numerator of t˜, p(1) 1 , is the difference of two identically distributed proportions, so we can safely treat it as an approximately normal random variable with zero mean and unknown standard deviation σ if the underlying test set contained at least 30 points. Hence, let A = p(1) 1 /σ . Also under the null hypothesis, s2i /σ 2 has a χ 2 distribution with 1 de(2) gree of freedom if we make the additional assumption that p(1) i and pi are independent. This assumption is false, as we have seen, because these two differences of proportions are measured on the opposite folds of a twofold cross-validation. Still, the assumption of independence is probably more appropriate for twofold cross-validation than for 10-fold cross-validation,
1908
Thomas G. Dietterich
because in the twofold case, the training sets are completely nonoverlapping (and, as always in cross-validation, the test sets are nonoverlapping). (We chose twofold cross-validation because it gives large test sets and disjoint training sets. The large test set is needed because we are using ˜ only one paired difference p(1) 1 in t. The disjoint training sets help make (1) (2) pi and pi more independent. A drawback, of course, is that the learning algorithms are trained on training sets half of the size of the training sets for which, under Question 8, we seek their relative performance.) We could set B = s21 /σ 2 , but when we tested this experimentally, we found that the resulting estimate of the variance was very noisy, and often zero. In similar situations, others have found that combining the results of multiple cross-validations can help stabilize an estimate, so we perform five twofold cross-validations and define !Á Ã 5 X 2 si σ 2. B= i=1
If we assume that the s2i from each twofold cross-validation are independent of each other, then B is the sum of five independent random variables, each having a χ 2 distribution with 1 degree of freedom. By the summation property of the χ 2 distribution, this means B has a χ 2 distribution with 5 degrees of freedom. This last independence assumption is also false, because each twofold cross-validation is computed from the same training data. However, experimental tests showed that this is the least problematic of the various independence assumptions underlying the 5 × 2 cv test. Finally, to use equation 3.1, we must make the assumption that the variance estimates si are independent of p(1) 1 . This must be assumed (rather than proved as in the usual t distribution derivation), because we are using only one of the observed differences of proportions rather than the mean of all of the observed differences. The mean difference tends to overestimate the true difference, because of the lack of independence between the different folds of the cross-validation. With all of these assumptions, we can plug in to equation 3.1to obtain t˜. Let us summarize the assumptions and approximations involved in this derivation. First, we employ the normal approximation to the binomial disand p(2) for tribution. Second, we assume pairwise independence of p(1) i i all i. Third, we assume independence between the si ’s. Finally, we assume independence between the numerator and denominator of the t˜ statistic. One way to evaluate the 5 × 2 cv statistic experimentally is to make a quantile-quantile plot (QQ plot), as shown in Figure 2. The QQ plot shows 1000 computed values of the 5 × 2 cv statistic for a case where the null hypothesis is known to apply (the EXP6 task, as described below). To generate a QQ plot, the 1000 values are sorted and assigned quantiles (their rank in the sorted list divided by 1000). Then the inverse cumulative t dis-
Comparing Supervised Classification Learning Algorithms
1909
Figure 2: QQ plot comparing the distribution of 1000 values of ˜t to the values they should have under a t distribution with 5 degrees of freedom. All points would fall on the line y = x if the distributions matched.
tribution (with 5 degrees of freedom) is used to compute for each quantile the value that a t-distributed random variable would have taken if it had had that rank. This value becomes the x-coordinate, and the original value t˜ becomes the y-coordinate. In other words, for each observed point, based on its ordinal position within the 1000 points, we can compute what value it should have had if the 1000 points had been truly drawn from a t distribution. If the 1000 points have a t distribution with 5 degrees of freedom, then they should lie on the line y = x. The figure shows a fairly good fit to the line. However, at the tails of the distribution, t˜ is somewhat more conservative than it should be. Our choice of five replications of cross-validation is not arbitrary. Exploratory studies showed that using fewer or more than five replications increased the risk of type I error. A possible explanation is that there are two competing problems. With fewer replications, the noise in the measurement of the si ’s becomes troublesome. With more replications, the lack of independence among the si ’s becomes troublesome. Whether five is the best value for the number of replications is an open question.
1910
Thomas G. Dietterich
4 Simulation Experiment Design We now turn to an experimental evaluation of these five methods. The purpose of the simulation was to measure the probability of type I error of the algorithms. A type I error occurs when the null hypothesis is true (there is no difference between the two learning algorithms) and the learning algorithm rejects the null hypothesis. To measure the probability of type I error, we constructed some simulated learning problems. To understand these problems, it is useful to think abstractly about the behavior of learning algorithms.
4.1 Simulating the Behavior of Learning Algorithms. Consider a population of N data points and suppose that the training set size is fixed. Then for a given learning algorithm A, define ²A (x) to be the probability that the classifier produced by A when trained on a randomly drawn training set (of the fixed size) will misclassify x. If ²A (x) = 0, then x is always correctly classified by classifiers produced by A. If ²A (x) = 1, then x is always misclassified. Figure 3 shows the measured values of ²(x) for a population of 7670 points with respect to the C4.5 decision tree algorithm (Quinlan, 1993) trained on randomly drawn training sets of 100 examples. The points were sorted by their ² values. Given these ² values, we could simulate the behavior of C4.5 on a randomly drawn test set of points by taking each point x and misclassifying it with probability ²(x). This would not exactly reproduce the behavior of C4.5, because it assumes that the misclassification errors made by C4.5 are independent for each test example, whereas in fact, the classifications of data points that are close together will tend to be highly correlated. However, this simulated C4.5 procedure would have the same average error rate as the real C4.5 algorithm, and it will exhibit a similar degree of variation from one random trial to the next. We can simulate learning algorithms with various properties by defining a population of points X and assigning a value ²(x) to each point. If we want a learning algorithm to have high variance, we can assign values of ² near 0.5, which is the value giving the maximum variance for a binomial random variable. If we want two learning algorithms to have the same error rate on the population, we can ensure that the average value of ² over the population X is the same for both algorithms. In our studies, we wanted to construct simulated learning problems that would provide a worst case for our statistical tests. To accomplish this, we sought to maximize the two main sources of random variation: variation resulting from the choice of test data sets and variation resulting from the choice of training sets. We ignored the issue of classification noise. Because it affects training and test data equally, it can be incorporated into the overall error rate. We also ignored internal randomness in the learning algorithms,
Comparing Supervised Classification Learning Algorithms
1911
Figure 3: Measured values of ²C4.5 (x) for a population of 7670 data points.
since this will manifest itself in the same way as training set variance: by causing the same learning algorithm to produce different classifiers. For tests of type I error, we designed two sets of ² values: ²A (x) for algorithm A and ²B (x) for algorithm B. We established a target error rate ² and chose only two distinct values to use for ²A (x) and ²B (x): 12 ² and 32 ². We generated a population of points. For the first half of the population, we assigned ²A (x) = 12 ² and ²B (x) = 32 ². For the remaining half of the population, we reversed this and assigned ²A (x) = 32 ² and ²B (x) = 12 ². Figure 4 shows this configuration of ² values. The size of the population is irrelevant, because we are sampling with replacement, and there are only two kinds of points. The important property is that the population is evenly divided between these two kinds of training points. The effect is that each algorithm has an overall error rate of ², and each algorithm has the same total variance. However, for any given test example, the algorithms have very different error rates. This makes the effect of the random choice of test data sets very apparent. Indeed, unless the test data set is exactly equally divided between the first half of the population and the second half of the population, there will be an apparent advantage for one algorithm over the other. Our statistical tests will need to
1912
Thomas G. Dietterich
Figure 4: Designed values of ²A (x) and ²B (x) for an overall error rate of ² = 0.10.
avoid being fooled by this apparent difference in error rates. Experimental measurements confirmed that these choices of ²A (x) and ²B (x) did the best job of simultaneously maximizing within-algorithm variance and betweenalgorithm variation while achieving the desired overall error rate of ². In most of our experiments, we used a value of ² = 0.10, although we also investigated ² = 0.20, ² = 0.30, and ² = 0.40. 4.2 Details of the Simulation of Each Statistical Test. Each simulation is divided into a series of 1000 trials. In each trial, a data set S of size 300 is randomly drawn with replacement and then analyzed using each of the five statistical tests described above. The goal is to measure the proportion of trials in which the null hypothesis is rejected. For the first two tests, McNemar’s test and the normal test for the difference of two proportions, the data set S is randomly divided into a training set R containing two-thirds of S and a test set T containing the remaining one-third of S. The training set is ignored, because we do not actually execute the learning algorithms. Rather, the performance of each algorithm on the test set is simulated by classifying each test example x randomly according to its value of ²A (x) and ²B (x). A random number in the range [0,1) is drawn. If it is less than ²A (x), then x is considered to be misclassified
Comparing Supervised Classification Learning Algorithms
1913
by algorithm A. A second random number is drawn, and x is considered misclassified by algorithm B if that number is less than ²B (x). The results of these classifications are then processed by the appropriate statistical test. All tests were performed using two-sided tests with confidence level 0.05. For the resampled paired t test, this process of randomly splitting the data set S two-thirds/one-thirds was repeated 30 times. Each test set was classified as described in the previous paragraph. The 30 differences in the error rates of the two algorithms were collected and employed in the t test. For the k-fold cross-validated paired t test, the data set S was divided into k = 10 random subsets of equal size. Each of the 10 test sets was then classified by both algorithms using the random procedure described. However, during the classification process, to simulate random variation in the quality of each training set, we generated a random value β in the range [−0.02, +0.02] and added this value β to every ²A (x) and ²B (x) before generating the classifications. The results of the classifications were then collected and subjected to the t test. For the 5 × 2 cv test, five replications of twofold cross-validation were performed, and the t statistic was constructed as described above. It is important to note that this experiment does not simulate training set variance. In particular, it does not model the effect of overlapping training sets on the behavior of the cross-validated t test or the 5 × 2 cv test. We will correct this shortcoming below in our experiments with real learning algorithms. 5 Results Figure 5 shows the probability of making a type I error for each of the five procedures when the data set S contains 300 examples and the overall error rate ² was varied from 0.10 to 0.40. Two of the five tests have a probability of type I error that exceeds the target value of 0.05: the difference-of-proportions test and the resampled t test. The remaining tests have acceptable probability of making a type I error according to this simulation. The resampled t test has a much higher probability of type I error than the other tests. This results from the fact that the randomly drawn data set S is likely to contain an imbalance of points from the first half of the population compared to the second half of the population. The resampled t test can detect and magnify this difference until it is “statistically significant.” Indeed, the probability of making a type I error with this test can be increased by increasing the number of resampled training-test splits. Figure 6 shows the effect of various numbers of resampled splits for both the resampled t test and the cross-validated t test. Notice that the cross-validated t test does not exhibit this problem. The difference-of-proportions test suffers from essentially the same problem. When the sample S is unrepresentative, the measured difference in the
1914
Thomas G. Dietterich
Figure 5: Probability of type I error for each statistical test. The four adjacent bars for each test represent the probability of type I error for ² = 0.10, 0.20, 0.30, and 0.40. Error bars show 95% confidence intervals for these probabilities. The horizontal dotted line shows the target probability of 0.05.
two proportions will be large, especially when ² is near 0.5. It is interesting that McNemar’s test does not share this problem. The key difference is that the difference-of-proportions test looks only at the difference between two proportions and not at their absolute values. Consider the following two 2 × 2 contingency tables: 0 60
40 0
40 20
0 40
Both tables have the same difference in error rates of 0.20, so the differenceof-proportions test treats them identically (and rejects the null hypothesis, p < 0.005). However, McNemar’s test finds no significant difference in the left table, but finds an extremely significant difference (p < 0.001) in the right table. This is because in the left table, McNemar’s test asks the question, “What is the probability in 100 tosses of a fair coin that we will receive 40 heads and 60 tails?” In the right table, it asks the question, “What
Comparing Supervised Classification Learning Algorithms
1915
Figure 6: Probability of Type I error for the resampled t test and the k-fold crossvalidated t test as the number of resampling replications is varied. The error rate ² = 0.10.
is the probability in 20 tosses of a fair coin that we will receive 20 heads and 0 tails?” The way we have constructed our simulated learning problems, we are more likely to produce tables like the one on the left, especially when ² is near 0.5. Note that if we had used the corrected version of the differenceof-proportions test, it would not have suffered from this problem. Because of the poor behavior (and high cost) of the resampled t test, we excluded it from further experiments. The biggest drawback of the simulation is that it does not capture or measure training set variance or variance resulting from the internal behavior of the learning algorithm. To address these problems, we conducted a second set of experiments with real learning algorithms and real data sets. 6 Experiments on Realistic Data 6.1 Methods. To evaluate the type I error rates of our four statistical tests with real learning algorithms, we needed to find two learning algorithms that had identical performance when trained on training sets of a given size. We also needed the learning algorithms to be very efficient, so
1916
Thomas G. Dietterich
that the experiments could be replicated many times. To achieve this, we chose C4.5 Release 1 (Quinlan, 1993) and the first nearest-neighbor (NN) algorithm (Dasarathy, 1991). We then selected three difficult problems: the EXP6 problem developed by Kong and Dietterich (1995), the letter recognition data set (Frey & Slate, 1991), and the Pima Indians diabetes task (Merz & Murphy, 1996). Of course, C4.5 and NN do not have the same performance on these data sets. In EXP6 and letter recognition, NN performs much better than C4.5; the reverse is true in the Pima data set. Our next step was to “damage” the learning algorithms so that their performance was identical. In the EXP6 and letter recognition tasks, we modified the distance metric employed by NN to be a weighted Euclidean distance with bad weights. This allowed us to reduce the performance of NN until it matched C4.5 on those data sets. To equalize the performance of the algorithms on the Pima data set, we modified C4.5 so that when classifying new instances, it would make random classification errors at a specified rate. More precisely, each data set was processed as follows. For EXP6, we generated a calibration set of 22,801 examples (spaced on a uniform grid of resolution 0.1). We then generated 1000 data sets, each of size 300, to simulate 1000 separate trials. From each of these 1000 data sets, we randomly drew subsets of size 270, 200, and 150. These sizes were chosen because they are the sizes of training sets used in the 10-fold cross-validated t test, the McNemar and difference-of-proportions tests, and the 5 × 2 cv t test, respectively, when those tests are given an initial data set of 300 examples. For each size of training set (270, 200, and 150), we adjusted the distance metric for NN so that the average performance (over all 1000 data sets, measured on the 22,801 calibration examples) matched the average performance of C4.5 to within 0.1%. For letter recognition, we randomly subdivided the 20,000 examples into a calibration set of 10,000 and an experimental set of 10,000. We then drew 1000 data sets, each of size 300, randomly from the experimental set of 10,000 examples. Again, from each of these data sets, we drew random subsets of size 270, 200, and 150. For each size of training set, we adjusted the distance metric for NN so that the average performance (over all 1000 data sets, measured on the 10,000 calibration examples) matched the average performance of C4.5 to within 0.1%. For the Pima Indians diabetes data set, we drew 1000 data sets of size 300 from the 768 available examples. For each of these data sets, the remaining 468 examples were retained for calibration. Each of the 1000 data sets of size 300 was further subsampled to produce random subsets of size 270, 200, and 150. For each size of training set, we measured the average error rate of C4.5 and NN (over 1000 data sets, when tested on the 468 calibration examples corresponding to each data set). We then adjusted the random noise rate of C4.5 so that the average error rates would be identical.
Comparing Supervised Classification Learning Algorithms
1917
Figure 7: Type I error rates for four statistical tests. The three bars within each test correspond to the EXP6, letter recognition, and Pima data sets. Error bars are 95% confidence intervals on the true type I error rate.
6.2 Type I Error Results. Figure 7 shows the measured type I error rates of the four statistical tests. The 10-fold cross-validated t test is the only test whose type I error exceeds 0.05. All of the other tests, even the differenceof-proportions test (“Prop Test”), show acceptable type I error rates. 6.3 Power Measurements. Type I error is not the only important consideration in choosing a statistical test. It is the most important criterion if one’s goal is to be confident that an observed performance difference is real. But if one’s goal is to detect whether there is a difference between two learning algorithms, then the power of the statistical test is important. The power of a test is the probability that it will reject the null hypothesis when the null hypothesis is false. To measure the power of the tests, we recalibrated the distance metric for nearest neighbor (for the EXP6 and letter recognition tasks) and the random classification error rate (for Pima) to achieve various differences in the performance of C4.5 and NN. Specifically, we did the following. First, we measured the performance for C4.5 and NN when trained on 300 examples and tested on the appropriate calibration examples as before (denote these
1918
Thomas G. Dietterich
Figure 8: Learning curves interpolated between C4.5 and NN for the purpose of measuring power.
error rates ²C4 and ²NN ). Then we chose various target error rates between these extremes. For each target error rate ²target , we computed the fraction λ such that ²target = ²C4.5 − λ(²C4.5 − ²NN ). We then calibrated the error rates of C4.5 and NN for training sets of size 150, 200, and 270 so that the same value of λ applied. In other words, we adjusted the learning curve for the nearest-neighbor algorithm (with damaged distance metric) so that it was positioned at a fixed fraction of the way between the learning curves for C4.5 and for NN (with an undamaged distance metric). Figure 8 shows the calibrated learning curves for the letter recognition task for various values of λ. Figures 9, 10, and 11 plot power curves for the four statistical tests. These curves show that the cross-validated t test is much more powerful than the other three tests. Hence, if the goal is to be confident that there is no difference between two algorithms, then the cross-validated t test is the test of choice, even though its type I error is unacceptable. Of the tests with acceptable type I error, the 5 × 2 cv t test is the most powerful. However, it is sobering to note that even when the performance of the learning algorithms differs by 10 percentage points (as in the letter
Comparing Supervised Classification Learning Algorithms
1919
Figure 9: Power of four statistical tests on the EXP6 task. The horizontal axis plots the number of percentage points by which the two algorithms (C4.5 and NN) differ when trained on training sets of size 300.
recognition task), these statistical tests are able to detect this only about one-third of the time. 7 Discussion The experiments suggest that the 5 × 2 cv test is the most powerful among those statistical tests that have acceptable type I error. This test is also the most satisfying, because it assesses the effect of both the choice of training set (by running the learning algorithms on several different training sets) and the choice of test set (by measuring the performance on several test sets). Despite the fact that McNemar’s test does not assess the effect of varying the training sets, it still performs very well. Indeed, in all of our various experiments, we never once saw the type I error rate of McNemar’s test exceed the target level (0.05). In contrast, we did observe cases where both the 5 × 2 cv and differences-of-proportions tests were fooled.
1920
Thomas G. Dietterich
Figure 10: Power of four statistical tests on the letter recognition task. The horizontal axis plots the number of percentage points by which the two algorithms (C4.5 and NN) differ when trained on training sets of size 300.
The 5 × 2 cv test will fail in cases where the error rates measured in the various twofold cross-validation replications vary wildly (even when the difference in error rates is unchanged). We were able to observe this in some simulated data experiments where the error rates fluctuated between 0.1 and 0.9. We did not observe it during any of our experiments on realistic data. Wild variations cause bad estimates of the variance. It is therefore advisable to check the measured error rates when applying this test. The difference-of-proportions test will fail in cases where the two learning algorithms have very different regions of poor performance and where the error rates are close to 0.5. We did not encounter this problem in our experiments on realistic data, although C4.5 and NN are very different algorithms. We suspect that this is because most errors committed by learning algorithms are near the decision boundaries. Hence, most learning algorithms with comparable error rates have very similar regions of poor performance, so the pathology that we observed in our simulated data experiments does not arise in practice. Nonetheless, given the superior performance of McNemar’s test and the incorrect assumptions underlying our version of the
Comparing Supervised Classification Learning Algorithms
1921
Figure 11: Power of four statistical tests on the Pima task. The horizontal axis plots the number of percentage points by which the two algorithms (C4.5 and NN) differ when trained on training sets of size 300.
difference-of-proportions test, there can be no justification for ever employing the uncorrected difference-of-proportions test. The 10-fold cross-validated t test has high type I error. However, it also has high power, and hence, it can be recommended in those cases where type II error (the failure to detect a real difference between algorithms) is more important. 8 Conclusions The starting point for this article was Question 8: Given two learning algorithms A and B and a small data set S, which algorithm will produce more accurate classifiers when trained on data sets of the same size as S drawn from the same population? Unfortunately, none of the statistical tests we have described and evaluated can answer this question. All of the statistical tests require using holdout or resampling methods, with the consequence that they can tell us only about the relative performance of the learning algorithms on training sets of size |R| < |S|, where |R| is the size of the training set employed in the statistical test.
1922
Thomas G. Dietterich
In addition to this fundamental problem, each of the statistical tests has other shortcomings. The derivation of the 5 × 2 cv test requires a large number of independence assumptions that are known to be violated. McNemar’s test and the difference-of-proportions test do not measure the variation resulting from the choice of training sets or internal randomness in the algorithm and therefore do not measure all of the important sources of variation. The cross-validated t test violates the assumptions underlying the t test, because the training sets overlap. As a consequence of these problems, all of the statistical tests described here must be viewed as approximate, heuristic tests rather than as rigorously correct statistical methods. This article has therefore relied on experimental evaluations of these methods, and the following conclusions must be regarded as tentative, because the experiments are based on only two learning algorithms and three data sets. Our experiments lead us to recommend either the 5 × 2cv t test, for situations in which the learning algorithms are efficient enough to run ten times, or McNemar’s test, for situations where the learning algorithms can be run only once. Both tests have similar power. Our experiments have also revealed the shortcomings of the other statistical tests, so we can confidently conclude that the resampled t test should never be employed. This test has very high probability of type I error, and results obtained using this test cannot be trusted. The experiments also suggest caution in interpreting the results of the 10-fold cross-validated t test. This test has an elevated probability of type I error (as much as twice the target level), although it is not nearly as severe as the problem with the resampled t test. We hope that the results in this article will be useful to scientists in the machine learning and neural network communities as they develop, understand, and improve machine learning algorithms. Acknowledgments I first learned of the k-fold cross-validated t test from Ronny Kohavi (personal communication). I thank Jim Kolsky of the Oregon State University Statistics Department for statistical consulting assistance and Radford Neal, Ronny Kohavi, and Tom Mitchell for helpful suggestions. I am very grateful to the referees for their careful reading, which identified several errors in an earlier version of this article. This research was supported by grants IRI9204129 and IRI-9626584 from the National Science Foundation and grant N00014-95-1-0557 from the Office of Naval Research. References Breiman, L. (1994). Heuristics of instability and stabilization in model selection (Tech. Rep. No. 416). Berkeley, CA: Department of Statistics, University of California, Berkeley.
Comparing Supervised Classification Learning Algorithms
1923
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression Trees. Belmont, CA: Wadsworth International Group. Dasarathy, B. V. (Ed.). (1991). Nearest neighbor (NN) norms: NN pattern classification techniques. Los Alamitos, CA: IEEE Computer Society Press. Dietterich, T. G., Hild, H., & Bakiri, G. (1995). A comparison of ID3 and backpropagation for English text-to-speech mapping. Machine Learning, 18, 51–80. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman and Hall. Everitt, B. S. (1977). The analysis of contingency tables. London: Chapman and Hall. Frey, P. W., & Slate, D. J. (1991). Letter recognition using Holland-style adaptive classifiers. Machine Learning, 6, 161–182. Haussler, D., Kearns, M., Seung, H. S., & Tishby, N. (1994). Rigorous learning curve bounds from statistical mechanics. In Proc. 7th Annu. ACM Workshop on Comput. Learning Theory (pp. 76–87). New York: ACM Press. Hinton, G. E., Neal, R. M., Tibshirani, R., & DELVE team members. (1995). Assessing learning procedures using DELVE (Tech. Rep.) Toronto: University of Toronto, Department of Computer Science. Available from: http://www.cs.utoronto.ca/neuron/delve/delve.html. Kohavi, R. (1995). Wrappers for performance enhancement and oblivious decision graphs. Unpublished doctoral dissertation, Stanford University. Kolen, J. F. & Pollack, J. B. (1991). Back propagation is sensitive to initial conditions. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 860–867). San Mateo, CA: Morgan Kaufmann. Kong, E. B., & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance. In A. Prieditis,& S. Russell (Eds.), The Twelfth International Conference on Machine Learning (pp. 313–321). San Mateo, CA: Morgan Kaufmann. Lindman, H. R. (1992). Analysis of variance in experimental design. New York: Springer-Verlag. Merz, C. J., & Murphy, P. M. (1996). UCI repository of machine learning databases. Available from: http://www.ics.uci.edu/∼ mlearn/ MLRepository.html. Quinlan, J. R. (1993). C4.5: Programs for empirical learning. San Mateo, CA: Morgan Kaufmann. Rasmussen, C. E. (1996). Evaluation of gaussian processes and other methods for nonlinear regression. Unpublished doctoral dissertation, University of Toronto, Toronto, Canada. Snedecor, G. W., & Cochran, W. G. (1989). Statistical methods (8th ed.). Ames, IA: Iowa State University Press. Received October 21, 1996; accepted January 9, 1998.
LETTER
Communicated by Barak Pearlmutter
Probability Density Estimation Using Entropy Maximization Gad Miller David Horn School of Physics and Astronomy, Tel Aviv University, Tel Aviv 69978, Israel
We propose a method for estimating probability density functions and conditional density functions by training on data produced by such distributions. The algorithm employs new stochastic variables that amount to coding of the input, using a principle of entropy maximization. It is shown to be closely related to the maximum likelihood approach. The encoding step of the algorithm provides an estimate of the probability distribution. The decoding step serves as a generative mode, producing an ensemble of data with the desired distribution. The algorithm is readily implemented by neural networks, using stochastic gradient ascent to achieve entropy maximization. 1 Introduction The problem of constructing a probability density function p(x) from a given finite number of data points is well known. There exist many approaches in the statistical literature (see Bishop, 1995, for a short review), which are usually divided into parametric and nonparametric methods, as well as mixtures of the two. The parametric approach starts from an assumed family of distributions and tries to find the best fit. The simplest nonparametric approach is binning the data and smoothing the resulting histogram. Examples of mixture methods are maximum likelihood (ML) and expectationmaximization (EM) algorithms. Neural networks can be naturally included in many of these approaches. The new approach that we propose here is to employ entropy maximization. In this approach we make use of ideas that have recently been applied successfully to blind separation and blind deconvolution problems (Bell & Sejnowski, 1995). While the latter are usually stated within the framework of information maximization, they naturally reduce to entropy maximization (Nadal & Parga, 1994), as pointed out by Roth and Baram (1996), who have developed and employed this method for probability estimation. As will be explained in the next section, this involves, at some stage, the construction of a new, uniformly distributed variable. In our algorithm, this construction will correspond to an encoding step, in which a neural network is trained with the data so as to produce a uniform output. In this process, it creates a representation of the probability distribution that we look for. A second step c 1998 Massachusetts Institute of Technology Neural Computation 10, 1925–1938 (1998) °
1926
Gad Miller and David Horn
of the algorithm can be viewed as a decoding step, in which a uniformly distributed input is fed to another network, which is also trained by the original examples, and serves as a generative model; it produces an output distributed according to the probability density function of the data. The general structure of our algorithm is explained in the next section using the problem of conditional density function estimation. Here the data are presented in pairs (x, y), and the problem is to find p(y|x). Probability density estimation can be viewed as a special simplified case of conditional density function estimation. Numerical examples of probability distributions serve to demonstrate our method. Section 3 discusses the relation of the maximum entropy (ME) approach to the well-known ML one. This is followed in the next two sections by further applications of the ME method. In section 4 we apply it to the case of stochastic variables with deterministic functional relations, and in section 5 we discuss the possible use of ME as a tool for Monte Carlo generation of artificial data. 2 Entropy Maximization In this section we demonstrate our method on the problem of estimation of conditional density functions. Data are given as samples of pairs of variables (x, y) that are generated by some distribution function p(x, y), and we wish to determine the functional form of p(y|x). We propose to solve this problem by constructing a network with inputs (x, y) and an output v that is bounded in the domain 0 ≤ v(x, y) ≤ 1 and is assumed to be one-to-one with respect to y. Moreover, our aim will be to have this new stochastic variable uniformly distributed, V = U(0, 1), and independent of X. As we will show, this can be achieved by requiring entropy maximization. We refer to this stage of our algorithm as the encoding step, replacing the variable y by v. Let us define the joint entropy of the output variables: Z H(X, V) = − p(x, v) ln p(x, v)dxdv. (2.1) This can be rewritten in terms of the input variables as Z p(x, y) dxdy H(X, V) = − p(x, y) ln J Z Z = − p(x, y) ln p(x, y)dxdy + p(x, y) ln Jdxdy = H(X, Y) + E[ln J], where J is the absolute value of the Jacobian, ¯ ¯ ¯ ¯ ¯ ∂(x, v) ¯ ¯ ∂v ¯ ¯ = ¯ ¯. ¯ J=¯ ∂(x, y) ¯ ¯ ∂y ¯
(2.2)
(2.3)
Probability Density Estimation
1927
The principle of our algorithm is entropy maximization; we look for a choice of v(x, y) such that H(X, V) is maximized. The entropy can be decomposed into H(X, V) = H(X) + H(V) − I(X; V),
(2.4)
where I(X; V) is the mutual information between X and V. H(V) reaches its upper bound, H(V) = 0, for a uniformly distributed v ∼ U(0, 1) (Cover & Thomas, 1991). I(X; V) is a nonnegative quantity, lower bounded by 0 for the case where X and V are independent. It remains to be shown that there exists a uniformly distributed V that is independent of X. To derive that function, we note that p(x, v) is related to p(y|x) through p(x, v) =
p(y|x)p(x) p(x, y) = . J J
(2.5)
If V is to satisfy the uniformity and independence conditions, p(x, v) = p(x)p(v) = p(x)
(2.6)
holds. Comparing equations 2.5 and 2.6, we obtain J = p(y|x), or, ∂v (x, y) = ±p(y|x). ∂y
(2.7)
Explicit solutions of v are
v(x, y) =
either
Ry
R∞
or
−∞ y
p(y0 |x)dy0
p(y0 |x)dy0 .
(2.8)
In our applications we try to realize the solution with a network with parameters α that maximize the entropy H(X, V). We learn from equation 2.2 that this is equivalent to maximizing E[ln J]. Using the stochastic gradient ascent method, this can be done through a learning algorithm 1α ∝
∂ ln J ∂J = J−1 = ∂α ∂α
µ
∂v ∂y
¶−1
∂ 2v . ∂y∂ α
(2.9)
When training is completed, we can use its result to estimate the desired conditional distribution function by evaluating ¯ ¯ ¯ ¯ ∂v ¯ (x, y)¯ = p(y|x), ¯ ¯ ∂y
(2.10)
1928
Gad Miller and David Horn
which is the first step of our task. In other words, by performing the encoding step, which is the construction of the uniformly distributed variable V, we have solved the problem of conditional distribution function estimation. A potential pitfall that deserves attention is the one-to-one assumption of the function v, that is, J 6= 0 over the domain of y. This, however, seems quite natural given that two conditions hold: the domain of nonzero probability is connected, and J 6= 0 over this domain for the initial estimate of v(t = 0). If gradient ascent is performed sufficiently slowly, in a “quasi-static” limit, such that equation 2.2 always holds, it follows from the last integral in equation 2.2 that v remains one-to-one. If it does not, the integral diverges to −∞, contradicting the fact that it should monotonically increase when climbing along the gradient. Of course, the above discussion holds only in the limit where the number of samples approaches infinity. For any finite sample number, the discrete analog of equation 2.2, developed in the next section in the form of equation 3.4, may increase even as v ceases to be one-to-one, regardless of the step size. In addition, the optimization space may be very complex so as to result in the finding of a local maximum, potentially entailing a poor estimation of the density function. We have not yet answered these questions, which concern the practicality of the ME method. Next we turn to the decoding step, or the generative step of our algorithm. We wish to construct a network that can generate events distributed according to p(y|x) for a given x. Using the output of the encoding step, (x, y) → (x, v), we train a second network to reproduce the original inputs (x, v) → (x, y). The training of this secondary net follows standard supervised learning. Once the second network is trained on the data set, it can be used in a generative mode: given a value of x, one draws randomly a value of v that is uniformly distributed v ∼ U(0, 1), and uses it as an input to this network. The output y will be appropriately distributed according to p(y|x). That this can be done follows from the fact that the solution v(x, y) to the encoding phase is a monotonic function of y. To be specific, we may stick to the monotonically increasing solution. It is then invertible; a solution y = g(x, v) is well defined. Figure 1 presents the general schematic structure of the two steps of our method. Using real variables (x, y), one can invoke naturally the gradient ascent algorithm for learning 0 ≤ v(x, y) ≤ 1 on a feedforward neural network. The decoding step may require first performing an unsquashing transformation on v, before employing it as an input, together with x, in a standard backpropagation or any other supervised learning procedure. The same kind of formalism can be applied to the more complicated problem of higher-dimensional probability distributions, when x and y are replaced by vector variables of arbitrary dimensions. The general solution to the estimation problem of a conditional distribution function is given by
Probability Density Estimation
1929
Figure 1: A schematic representation of the encoding and decoding steps.
the relation ¯ ¯ ¯ ¯ ∂v p(y|x) = ¯¯ (x, y)¯¯ , ∂y
(2.11)
where the auxiliary vector variable v is of the same dimension as y which may be different from that of x. The proof is outlined in appendix A. v is uniformly distributed in a unit cube and is independent of x. It is straightforward to apply this algorithm to the simpler case of probability density function estimation for a single variable y. In this case, the variable x drops out from the formalism, but the logic of the encoding and decoding steps continues to be the same. Our derivation coincides in this case with the work of Roth and Baram (1996), but our implementation is different, allowing for a general neural network to define the functional dependence of v on y. We use this problem for an illustration of our method. Figures 2 and 3 depict two probability distributions used to generate samples for training the encoding step of our algorithm. The estimated distributions are shown, as well as histograms that were generated from the same samples as those used for the encoding. Figure 2 is a bimodal distribution tested with a (small) set of 50 samples only. The samples turn out to be asymmetrically generated, due to the small statistics. This is evident in the histogram and reflected in the ME estimation. The network employed in this case used two hidden units only; hence its structure is à v=σ
X
! Wi σ (wi y − θi ) − θ0 ,
(2.12)
i
with i = 1, 2. σ is the sigmoid function σ (z) = (1 + e−βz )−1 . In Figure 3 we look at a triangular distribution, this time tested with 200 samples. Correspondingly we use a more complex network, with four hidden units.
1930
Gad Miller and David Horn
Figure 2: Comparison between the ME estimate (solid line) and the histogram of 50 samples drawn from a bimodal distribution (dotted line). The network subserving the ME encoding step contains two hidden units.
In general, we need a network whose complexity (number of parameters) is less than linear in the number of samples, for example, a square root of the latter. Clearly this is independent of the complexity of the distribution and its dimensionality. 3 Maximum Entropy versus Maximum Likelihood It is instructive to draw a comparison between the ML method and the proposed ME one. A connection may be expected in view of previous results (Pearlmutter & Parra, 1996; MacKay, 1996; Cardoso, 1997) casting independent component analysis into the ML formalism. Suppose we have a set χ = {x1 , x2 , . . . , xM } of M statistically independent samples of some random variable X, and that the distribution of X is assumed to belong to some known class of distributions with probability density functions f (x; α). Let us denote by α0 the true parameters of X, that is, f (x; α0 ) = p(x). According to the ML approach, we maximize the
Probability Density Estimation
1931
Figure 3: Comparison of the ME estimate and a histogram of 200 samples drawn from a triangular distribution. The larger sample size allows the use of a more complex network with four hidden units.
expression:
E = p(χ |α) =
M Y
p(xi |α) =
M Y
f (xi ; α).
(3.1)
f (xi ; α) ˆ = E[ln f (x; α)], M
(3.2)
i=1
i=1
This is equivalent to maximizing ln E = M
PM
i=1 ln
where Eˆ denotes the estimated expectancy on an ensemble of size M. In the limit M → ∞, this tends to Z f (x; α0 ) ln f (x; α)dx. (3.3) E[ln f (x; α)] = Since the last integral is maximal for α = α0 , ML is a consistent criterion, that is, when the number of samples approaches infinity, finding the ML
1932
Gad Miller and David Horn
solution yields the correct distribution f (x; α0 ). The ME algorithm can be cast into a similar form. Starting with the auxiliary variable v, which is of the same dimension as the input variable x that it is to encode, we redefine our problem as searching for a one-to-one function in the family {v(x) | 0 ≤ vi (x) ≤ 1 ∀x ∈ Rn , 1 ≤ i ≤ n} that maximizes a cost function ¯ M ¯¯ Y ¯ ∂v ¯ ¯ H= ¯ ∂x (xi ; α)¯ ,
(3.4)
i=1
which will ensure maximization of the entropy. We then seek an α0 for which H is maximal. As before, this is equivalent to maximizing ln H = M
PM
¯
¯
¯ ∂v ¯ i=1 ln ∂x (xi ; α) M
· ¸ ∂v = Eˆ ln | (x; α)| , ∂x
which, in the limit M → ∞, leads to maximization of ¯ ¯ ¸ Z · ¯ ∂v ¯ ∂v ¯ E ln | (x; α)| = p(x) ln ¯ (x; α)¯¯ dx. ∂x ∂x We show in appendix B that H maximization is obtained when ¯ ¯ ¯ ∂v ¯ ¯ ¯ = p(x). ¯ ∂x ¯
(3.5)
(3.6)
(3.7)
Thus we find that v that maximizes H also maximizes the entropy of the auxiliary variable, · ¯ ¯¸ ¯ ∂v ¯ (3.8) H(V) = H(X) + E ln ¯¯ ¯¯ . ∂x This derivation shows ¯ ¯the relation between the ME and ML approaches: ¯ The space of possible ¯ ∂v ∂x serves as the space of candidate functions over which the ME algorithm is operative. These functions are positive definite but are not necessarily properly normalized. The example of equation 2.12 can serve as an illustration. There are regimes in parameter space where v will not reach its upper limit of 1. In other words, our parameter space is not restricted ¯ ¯ to properly normalized functions. However, as training ¯ proceeds, ¯ ∂v ∂x turns into an appropriate probability distribution function, coinciding with an ML solution of the network that we employ. Hence ME is a realization of ML. 4 Functional Dependence Among Random Variables We expect the entropy maximization algorithm to be a useful tool in many applications. In this section we wish to demonstrate how it can come in
Probability Density Estimation
1933
handy in solving problems where there is a known, or expected, dependence among some of the stochastic variables. We restrict ourselves to estimation in the sense of reproducing the desired random variable rather than finding its density function. We will start with the single variable case: a relation of the type Y = g(X) between two stochastic variables. We can think of three interesting cases: 1. g is known, and samples of X are given. An estimate of Y is to be obtained. This case is the simplest and is solved by the estimation of X, to which one applies g(X). 2. g is known and samples of Y are given. An estimate of X is sought. In this case we employ the encoding stage on Y as usual, thus estimating v(y). We then apply a second encoding step, maximizing the entropy of v(g(w(u))), where u is drawn from the usual uniform distribution. The entropy of v is maximal when it is uniformly distributed. The entropy of v is maximal when it is uniformly distributed. By the way v was constructed, this means that g(w) ∼ Y from which w(u) ∼ X. 3. Samples of both X and Y are given in the form {xi } and {yi } (as opposed to {xi , yi }). An estimate of g is required. Again we first use the encoding step on Y, thus determining v(y). In the second encoding step, we maximize the entropy of v(w(x)) using the samples of X. When a uniform distribution of v(w) is attained, w ≡ g.
Things get more interesting when the functional dependence of the random variables becomes more complicated. As an example, suppose that the relation Y = g(X1 , X2 ) holds between the three random variables with g known, and also X1 and X2 are known to be independent. This can be a model of a transmitted signal, where X1 is the signal, X2 is the noise, and Y is the corrupted received signal. Samples of X1 and of Y are given, and the purpose is to estimate X2 . We begin by encoding Y, thus obtaining v(y). We then construct a network that has two inputs, x and u, and two outputs, a and b. The x input is taken from the X1 samples and the u input is drawn uniformly. The outputs have the functional form a = v(g(x, h(u))) and b = w(x, u), with h and w functions realized by the network. When the entropy of (a, b) is maximized, it is uniformly distributed. In particular, a is uniformly distributed, and by the way v was constructed, we then have g ∼ Y, implying that h(u) ∼ X2 . Note that b was introduced here only for computational assistance; it is much easier to maximize the entropy of (a(x, u), b(x, u)) than that of a(x, u) alone, because of the mismatch between the number of inputs and outputs.
1934
Gad Miller and David Horn
5 Generating Artificial Data of a Given Distribution As a further example of the use of entropy maximization, we address the question of generating random deviates (or artificial data) for a given distribution. The latter can be given in a nonnormalized fashion, as a complicated (nonnegative) function f (x) over some domain. In that case we can proceed, in analogy with the analysis of section 3, with defining an integral to be maximized: ¯ ¯ Z ¯ ∂v ¯ ¯ I= f (x) ln ¯ (x; α)¯¯ dx. (5.1) ∂x The maximization procedure is carried out by gradient ascent on a neural network that implements the mapping x → v, with v turning out to be uniformly distributed in a unit cube. In this process, the Jacobian of the transformation takes the shape of the desired distribution. Once this encoding step is completed, this time through numerical evaluation of ∂ I /∂ α rather than through training by data, the decoding step is developed as before. The decoder becomes the desired generator of artificial data. This can be viewed as an alternative to existing methods, such as the rejection method (Press, Flannery, Teukolsky, & Vetterling, 1986). Its generality, as well as the fact that neural networks can serve as universal approximators, makes it a likely candidate for a new tool. 6 Discussion The principle of maximal entropy, which leads to a uniform distribution of the new random variables, is the cornerstone of the mathematical structure of our model. We have seen that it may be viewed as closely related to the maximum likelihood approach. In a simple problem, like the one depicted in Figure 2, it is very similar to conventional realizations of ML, because its result is close to a sum of two gaussians. This is not the case for more complex problems. Even in the result displayed in Figure 3, the output of the network is quite different from the sum of four gaussians, which might be expected from the hidden layer. The reason lies in the nonlinear character of the output node of our network, which is a squashing function producing the bounded variable v. We distinguished between two different steps of our model: the encoding and decoding ones. The encoding step produces the desired probability distribution, which can be read off its internal neural structure. The decoding phase is useful as a generative mode, such as the “sleep” phase of the Helmholtz machine (Dayan, Hinton, Neal, & Zemel, 1995). In our model, the two steps are completely separated. In particular, the encoding step, which is the analog of the “wake” phase of the Helmholtz machine, is completely independent of the decoding one. In the Helmholtz machine, the two modes
Probability Density Estimation
1935
are coupled. This is, of course, part of the elegance of that structure, whose motivation was the idea that the generative mode has to play an inherent role in the brain’s capability of pattern recognition. Although in principle one could structure the Helmholtz machine to be closer to our algorithm, in our case the two networks performing encoding and decoding have very different architectures that are unrelated to one another. Our generative step is introduced only as a computational tool. It may be useful as a Monte Carlo generator of artificial data. The traditional approach of using neural networks in the realm of probability distributions is to endow single neurons with stochastic transfer functions, as in the Boltzmann machine (Ackley, Hinton, & Sejnowski, 1985) and the Helmholtz machine (Dayan et al., 1995). Our model is based on a completely different point of view. Its neural network component is purely deterministic. The probabilistic character comes from interpreting some of the nodes, forming outputs of the encoding step and inputs of the decoding step, as random variables. This type of variable (Roth & Baram, 1996) emerges naturally in the blind separation algorithm of Bell and Sejnowski (1995) and can be viewed as a novel element in the formation of stochastic neural networks. Appendix A: Entropy Maximization for Vector Variables We prove here that the method presented in section 2 can be generalized to x and y of vector form. First, we notice that the discussion in section 2 is immediately generalizable to a vector x simply by substituting x for x. Its results therefore apply for the case (x, y) with x a vector and y a scalar. In particular, this means that there exists a function v0 (x, y) such that the uniformity and independence conditions (see equation 2.6) hold whence H(x, v0 ) = H(x). Now suppose we have a vector y (and accordingly a vector v of the same dimension, each of its components bounded between 0 and 1). By the argument above, it follows that there exists a function v01 (x, y1 ) such that H(x, v01 ) = H(x).
(A.1)
Similarly, defining x0 ≡ (x, v01 (x, y1 )), there exists a function v02 (x0 , y2 ) such that H(x, v01 , v02 ) = H(x0 , v02 ) = H(x0 ) = H(x, v01 ) = H(x).
(A.2)
This procedure may be iterated until finally we obtain a v0 such that H(x, v0 ) = H(x).
(A.3)
1936
Gad Miller and David Horn
However, as in the scalar case, the relation H(x, v) = H(x) + H(v) − I(x, v) ≤ H(x)
(A.4)
holds, with equality iff v is uniformly distributed and independent of x. We deduce that v0 satisfies these conditions, and from now on refer as v0 to any v for which an equality in equation A.4 holds. Indeed, the twofold degeneracy of section 2 is now inflated to at least 2n (n!)2 -fold degeneracy. However, it is not the uniqueness of v0 which is of interest to us but rather the uniqueness of |∂v0 /∂y|, as will become clear shortly. Having the vector analog of equation 2.3, we can continue as in section 2 to obtain ¯ ¯ 0 ¯ ¯ ∂v ¯ (x, y)¯¯ . (A.5) p(y|x) = ¯ ∂y Using equation A.5, for any (x, y) such that p(x, y) > 0, we have ¯ ¯ 0 ¯ ¯ ∂v ¯ ¯ ¯ ∂y (x, y)¯ = p(y|x) > 0,
(A.6)
and v0 (x, y) is therefore invertible in the domain of nonzero probability to the form y0 (x, v). Finally, for a given x0 choose v ∼ Un (0, 1) and define g0 (v) ≡ y0 (x0 , v). For the distribution of g0 we have, using equation A.5, ¯ ¯ ¯ ¯ ¯ ¯ ∂v ¯ ¯ ∂v p(v) (A.7) p(g0 ) = ¯ 0 ¯ = ¯¯ 0 ¯¯ = ¯¯ 0 (x0 , y0 )¯¯ = p(y0 |x0 ), ¯ ∂g ¯ ∂g ∂y ¯ ∂v ¯ which shows that G0 ≡ Y|X = x0 . It is evident from the above derivation that in order to achieve a uniform independent distribution, it suffices for every output element vi to be connected to x and to (v1 , . . . , vi−1 ). This triangular formation was advocated by Roth and Baram (1996). It has the advantage of having a smaller number of parameters, but is also more constrained; its degeneracy is much reduced, making the global maximum potentially harder to attain. Appendix B: Proof of ME Consistency We wish to prove here that the ME criterion is consistent—that is, when the number of samples approaches infinity, the correct distribution is obtained. As explained in section 3, its consistency relies on the following theorem: Theorem 1. Define B = {g(x) | g(x) is one-to-one and 0 ≤ gi (x) ≤ 1 ∀x ∈ density function. If g(x) ∈ B Rn , 1 ≤ i ≤ n} and ¯ let ¯f (x) be a¯ probability ¯ R ¯ ∂g ¯ ¯ ∂g ¯ maximizes f (x) ln ¯ ∂x (x)¯ dx then ¯ ∂x (x)¯ ≡ f (x).
Probability Density Estimation
1937
Proof. Anyone who has read appendix A will not be surprised at what is coming, since this is just a different way of stating the same thing. Define Z x1 Z ∞ Z ∞ dx01 dx02 . . . dx0n f (x01 , x02 , . . . , x0n ), (A.8) gˆ 1 (x1 ) ≡ FX1 (x1 ) ≡ −∞
−∞
−∞
gˆ 2 (x1 , x2 ) ≡ FX2 |X1 (x1 , x2 ) Z ∞ Z Z x2 dx02 dx03 . . . ≡ −∞
−∞
∞
−∞
dx0n f (x1 , x02 , . . . , x0n )/fX1 (x1 ),(A.9)
gˆ 3 (x1 , x2 , x3 ) ≡ FX3 |X1 ,X2 (x1 , x2 , x3 ),
(A.10)
and so forth, where fX1 (x1 ) is the marginal probability distribution of X1 . Then ∂ gˆ i /∂xi = f (xi |x1 , . . . , xi−1 ) and for every i < j,Q∂ gˆ i /∂xj = 0. ˆ The = ni=1 ∂ gˆ i /∂xi = Qn matrix ∂ gˆ i /∂xj is therefore triangular, so that ∂ g/∂x ˆ it is clear that i=1 f (xi |x1 , . . . , xi−1 ) = f (x). From the construction of g(x), ˆ g(x) ∈ B . This proves the attainability of the limit. To complete the proof, note first that since g is one-to-one, ∂g/∂x can be assumed to be positive without loss of generality. Thus, using the inequality ln x ≤ x − 1, Z f (x) ln
Z
∂g dx − ∂x
∂g dx − ∂x
Z
Z
∂g
f (x) ln ∂x dx f (x) ! Ã ∂g Z ∂x − 1 dx = (A.11) ≤ f (x) f (x)
f (x) ln f (x)dx =
Z
Z f (x)dx =
Z dg −
f (x)dx ≤ 1 − 1 = 0,
(A.12)
with equality holding iff ∂g/∂x ≡ f .
Acknowledgments We thank H. Abramowicz, S. Ackerman, I. Meilijson, B. Pearlmutter, E. Ruppin, and S. Solla for helpful discussions. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cog. Sci., 9, 147–169.
1938
Gad Miller and David Horn
Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Comp., 7, 1129–1159. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Cardoso, J.-F. (1997). Infomax and maximum likelihood for source separation. IEEE Letters on Signal Processing, 4, 112–114. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmholtz machine. Neural Comp., 7, 889–904. MacKay, D. J. C. (1996). Maximum likelihood and covariant algorithms for independent component analysis. Unpublished manuscript. Nadal, J.-P., & Parga, N. (1994). Nonlinear neurons in the low noise limit: a factorial code maximizes information transfer. Network, 5, 565–581. Pearlmutter, B. A. & Parra, L. C. (1996). A context-sensitive generalization of ICA. Int. Conf. on Neural Network Processing. Hong Kong. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1986). Numerical recipes. Cambridge: Cambridge University Press. Roth, Z., & Baram, Y. (1996). Multidimensional density shaping by sigmoids. IEEE Trans. on Neural Networks, 7, 1291–1298. Received May 8, 1997; accepted April 2, 1998.
ARTICLE
Communicated by P. Read Montague
Competition for Neurotrophic Factors: Mathematical Analysis T. Elliott N. R. Shadbolt Department of Psychology, University of Nottingham, Nottingham, NG7 2RD, United Kingdom
Neurotrophic factors, particularly the neurotrophin gene family of neurotrophic factors, are implicated in activity-dependent anatomical plasticity in the visual cortex and at the neuromuscular junction. Accumulating evidence implicates neurotrophic factors as possible mediators of activity-dependent competition between afferents, leading to the segregation of afferents’ arbors on the target space. We present a biologically plausible mathematical model of competition for neurotrophic factors. We show that the model leads to anatomical segregation, provided that the levels of neurotrophic factors released in an activity-independent manner, or the levels available by exogenous infusion, are below a critical value, which we derive. Above this critical value, afferent segregation breaks down. We also show that the model segregates afferents even in the presence of very highly correlated patterns of afferent activity. The model is therefore ideally suited for application to the development of ocular dominance columns in the kitten visual cortex. 1 Introduction Activity-dependent, competitive interactions between afferents innervating the same target structure are a common feature of the development of the vertebrate nervous system. The molecular bases of these interactions during target innervation are broadly understood, but much remains to be discovered concerning the mechanisms that lead to the segregation of afferents’ arbors on the target structure during later stages of development. One attractive model is that just as retrograde neurotrophic factors (NTFs) regulate neuronal differentiation and survival during the early stages of development, they might also regulate synaptic growth and survival during later stages (Purves, 1988, reviewed in Snider & Lichtman, 1996). If endogenous NTFs are in limited supply, then competition between afferents will occur not only during target innervation but perhaps also during synaptic remodeling. Furthermore, exogenous infusion of NTFs might prevent or temper competitive interactions between afferents, and thus either prevent or slow down segregation. Evidence is now accumulating that implicates NTFs in competitive plasc 1998 Massachusetts Institute of Technology Neural Computation 10, 1939–1981 (1998) °
1940
T. Elliott and N. R. Shadbolt
ticity phenomena. Ventricular infusion of nerve growth factor (NGF) abolishes the effects of monocular deprivation in the rat lateral geniculate nucleus (LGN) (Domenici, Cellerino, & Maffei, 1993) and visual cortex (Maffei, Berardi, Domenici, Parisi, & Pizzorusso, 1992; Berardi et al., 1993; Yan, Mazow, & Dafny, 1996) and tempers them in the cat visual cortex (Carmignoto, Canella, Candeo, Comelli, & Maffei, 1993). Cortical infusion of the neurotrophin NT-4/5, but no other neurotrophin, prevents the atrophy of LGN cell bodies in the ferret following monocular deprivation (Riddle, Lo, & Katz, 1995), and cortical infusion of brain-derived neurotrophic factor (BDNF) or NT-4/5 prevents the anatomical segregation of geniculocortical afferents into ocular dominance columns in the cat (Cabelli, Hohn, & Shatz, 1995). In addition, blockade of the endogenous ligands of the trkB receptor (BDNF and NT-4/5) also inhibits the formation of ocular dominance columns (Cabelli, Shelton, Segal, & Shatz, 1997). At the vertebrate neuromuscular junction, exogenous application of basic fibroblast growth factor and ciliary neurotrophic factor prevents the elimination of polyneuronal innervation (English & Schwartz, 1995), and application of BDNF slows the anatomical but not the physiological segregation of motor neurons (Kwon & Gurney, 1996). Since afferent segregation in the visual cortex (Reiter, Waitzman, & Stryker, 1986; Stryker & Harris, 1986) and at the neuromuscular junction (Srihari & Vrbova, 1978; Thompson, Kuffler, & Jansen, 1979) is known to depend on both pre– and postsynaptic activity, it is important that the production or release, or both, of NTFs should depend on electrical activity. In the rat visual cortex, dark rearing reduces the level of BDNF mRNA (Castren, Zafra, Thoenen, & Lindholm, 1992; Schoups, Elliott, Friedman, & Black, 1995), but the level of NGF mRNA remains unchanged except for a transient increase during a narrow time window (Schoups et al., 1995), and monocular deprivation also decreases the expression of BDNF mRNA (Bozzi et al., 1995). At the developing neuromuscular junction, activity blockade increases the expression of both BDNF mRNA (Koliatsos, Clatterbuck, Winslow, Cayoutte, & Price, 1993) and insulin-like growth factor-1 mRNA (Caroni & Schneider, 1994), while at the adult neuromuscular junction, activity blockade decreases the expression of NT-4/5 mRNA and electrical stimulation increases it (Funakoshi et al., 1995). Nothing is known about the release of neurotrophic factors from either muscle cells or neurons in the visual cortex. However, in the hippocampus, NGF and BDNF are released by a constitutive mechanism, mainly associated with cell bodies, and also by an activity-dependent mechanism that depends on Na+ influx and intra– ¨ & Thoenen, 1995, 1996; Griesbeck, Blochl, ¨ but not extracellular Ca2+ (Blochl Carnahan, Nawa, & Thoenen, 1995; Goodman et al., 1996). In this article, we develop and analyze a mathematical model of retrograde neurotrophic interactions. Our aim is to construct a plausible model that does not impose competition between afferents by using the conventional device of synaptic normalization (von der Malsburg, 1973). Instead,
A Model of Neurotrophic Interactions
1941
we build a model in which competition occurs in a dynamical fashion (see also Bienenstock, Cooper, & Munro, 1982; Bennett & Robinson, 1989; Elliott & Shadbolt, 1996). Previously we made preliminary attempts to construct such a model (Elliott & Shadbolt, 1996), but the resulting model suffered from a number of undesirable features, to be discussed later. Our present model overcomes these difficulties by being formulated in a very different manner, and it constitutes a more general and more powerful neurotrophic model. We show that a key parameter of the model governs the capacity of the model to segregate afferents, even in the presence of very strongly correlated patterns of afferent activity, and we show that when either exogenous levels of NTF or the levels of NTF available by activity-independent release exceed a certain, critical value, afferent segregation suddenly breaks down. In the next section, we formulate and study the model, and in the last section we discuss the model’s assumptions and some general issues raised. 2 A Model of Neurotrophic Interactions In this section, we formulate and analyze our model of activity-dependent neurotrophic interactions. We first formulate it as an abstract mathematical system and then discuss the validity and biological plausibility of the assumptions behind the model. Next, we study some exact solutions of the model. This is followed by a fixed-point analysis, which demonstrates the existence of a parameter regime in which the model always leads to afferent segregation, independent of the strength of correlations in afferent activity, except for perfectly correlated activity. We then extract a random walk approximation to the model, which is used to estimate the number of time steps above which afferent segregation could occur, but below which afferent segregation cannot occur. Finally, we present results of numerical simulations of the model. For the purposes of analytical ease and tractability, we will usually restrict attention to the case of two afferents innervating a target structure. Numerical results, for the purposes of comparison to analytical results, will therefore also be restricted to two afferents. However, numerically we find little qualitative difference between models of two or more afferents, so this justifies, a posteriori, the restriction to the two-afferent case. Furthermore, two afferents are often enough to represent familiar examples of neuronal development: two eyes innervating the optic tectum or, via the LGN, the visual cortex, or the not untypical case of two motor neurons’ initially innervating a muscle fiber. We provide a summary of the key symbols used in the derivation and analysis of our model in Table 1. 2.1 Formulation of the Model. Let letters such as i, j, and k denote afferent cell positions and letters such as x and y denote target cell positions; the vector character of these positions is left implicit for notational simplicity.
1942
T. Elliott and N. R. Shadbolt
Table 1: Key Symbols Used in the Derivation and Analysis of the Model. Symbol
Meaning
i,j,k x,y sxi ai n t rx T0 T1 fx dx 1xy ρi uxi g a τ ² si c yi ,χ δi ,δxi p, q
Afferent cell positions Target cell positions Number of synapses from afferent i to target x Activity of afferent i Time-step number in discrete time Continuous time NTF released from target x Activity-independent release of NTF from target cells Maximum activity-dependent release of NTF from target cells Mean activity of a synapse averaged over all synapses on target x NTF available at target x following diffusion Function characterizing the diffusion of NTF Number of receptors per synapse on afferent i NTF taken up by afferent i from target x following diffusion Function characterizing the dependence of NTF uptake on activity Parameter characterizing activity-independent NTF uptake An exponential decay constant A “learning rate,” depending on τ Number of synapses from afferent i in one-target cell case Critical parameter influencing model’s behavior, c = T0 /(aT1 ) Variable-transformed versions of si and t, respectively Perturbations about fixed points or initial conditions Probability that ai =aj for two afferents (q = 1 − p)
Let the number of synapses from afferent i to target x be denoted by sxi . All synapses are taken to possess fixed and equal efficacies; this is justified since we consider anatomical and not physiological plasticity. Let the activity of afferent i be denoted by ai ∈ [0, 1]. We will formulate our model initially in discrete time steps. Thus, time-dependent variables such as ai and sxi will possess superscripts to denote the time step: ani and snxi for time step n. We will then make the transition to continuous time, and time-dependent variables will be written as, for example, ai (t) and sxi (t). For notational simplicity we will not normally indicate this time-dependence explicitly. Each target cell x, at time step n, is taken to release an amount of NTF given by rnx = T0 + T1 fxn ,
(2.1)
where 0 ≤ fxn ≤ 1 is a function of the afferent input at time-step n. Thus, there is an activity-independent component given by T0 and an activitydependent component, up to a maximum of T1 . The simplest, parameterfree choice for the function f is P n n s a n (2.2) fx = Pi xin i , i sxi
A Model of Neurotrophic Interactions
1943
which represents the mean activity of a synapse averaged over all synapses. This has the disadvantage that a target cell possessing only one synapse may be induced to release the maximum amount of NTF. However, in practice, we will find that target cells rarely reach such states. Other choices for f are possible, such as a logistic function. This would overcome the disadvantage of the unrealistic release properties inherent in equation 2.2, but a logistic function would introduce two parameters, and its mathematical form would considerably complicate our analysis. In the following we will use only the form given in equation 2.2. We assume that the NTF released by target cells rapidly diffuses through the target field, so the amount available at each target cell following diffusion in time-step n is given by dnx =
X
1xy rny ,
(2.3)
y
a function characterizing the diffusion process and is normalwhere 1xy is P ized so that y 1xy = 1 ∀x. This equation represents the raw release of NTF from target cells convolved with the diffusion function. Biologically, this amounts to the assumption that NTF diffuses from each target cell independent of the diffusion from all other target cells, and that the amount available, after diffusion, at each target cell is just the sum of the amounts reaching that cell from all other cells. The NTF available at any given target cell following diffusion is assumed to be rapidly taken up by the afferents synapsing on it. We take the uptake by afferent i from target x at time-step n to be proportional to the number of terminals i has on x, snxi , and to the “affinity” of each terminal for the factor at time-step n, ρin . The uptake is also assumed to be a function of afferent activity. The affinity is interpreted as the number of receptors for the NTF possessed by each terminal. We will consider two possibilities: ½ ρin =
λ P λ¯ani / x sxi
Case 1 (cell death) Case 2 (no cell death),
(2.4)
where λ is an arbitrary constant and a¯ ni denotes the recent time average of the activity of afferent i computed at time-step n. The cell death case assumes a fixed number of receptors per terminal, independent of the total number of terminals. The no-cell-death case, on the other hand, assumes that the number of receptors per terminal is inversely proportional to the total number of terminals supported by the afferent and proportional to the recent time average of the activity of the afferent. This means that the efficacy of uptake per terminal by afferents with smaller axonal arbors is greater than that of afferents with larger axonal arbors. These two possibilities are called cell death (CD) and no cell death (NCD) because we will find that the CD case permits afferents to retract all their terminals, which we interpret as
1944
T. Elliott and N. R. Shadbolt
indicating the death of afferent cells, while the NCD case does not permit complete retraction, except when the recent time average of afferent activity falls to zero. The uptake, at time-step n, by afferent i from target x is then taken to be unxi = Qnx dnx snxi g(ani )ρin ,
(2.5)
where Qnx is a constant of proportionality and g is some function describing the dependence of NTF uptake on afferent activity. We assume complete uptake by all afferents, so that the available poolPof NTF at each target cell is exhausted at each time step. This means that i unxi = dnx , which defines Qnx as Qnx
=
" X
#−1 snxi g(ani )ρin
.
(2.6)
i
A particularly simple form for the function g is g(ani ) = a + ani ,
(2.7)
where a is a constant determining the capacity of an inactive afferent to take up the NTF. Although other forms are possible, the simple form defined by equation 2.7 satisfies the basic requirement that the level of afferent activity determines the capacity for uptake, subject to a resting level of uptake. We will thus use this form. Putting all this together, we have that the uptake of NTF by afferent cell i from target cell x following diffusion is given by P n n! Ã X sn (a + ani )ρin j syj aj n P xi n 1xy T0 + T1 P n (2.8) uxi = n n. j syj j sxj (a + aj )ρj y We will assume that the number of synapses between afferent i and target x at time-step n + 1 is equal to the recent time average of the uptake of NTF by afferent i from target x, computed at time-step n, so that = u¯ nxi . sn+1 xi
(2.9)
This, together with equation 2.8, defines a set of coupled, nonlinear recurrence relations defining the evolution of the snxi . The order of the recurrence relations depends on the time averaging in equation 2.9. The simplest selection for the recent time average of some set of time-dependent quantities wn is w¯ n =
n X 1 wm , 1 + l m=n−l
(2.10)
A Model of Neurotrophic Interactions
1945
where l is some nonnegative integer defining the number of time steps over which the time average is taken. The order of the associated recurrence relations would then just be l. This choice, however, has at least three difficulties. First, it is biologically implausible since it assumes a strict cutoff with nondecaying contributions from time steps prior to the cutoff. Second, from a mathematical point of view, high-order recurrence relations are difficult to analyze. Third, from a computational perspective, large l would be very memory intensive, so only simulations of small collections of afferent and target neurons could be performed. An alternative and biologically plausible time average can be defined by convolving the quantities wn with a decaying exponential. In continuous time, the time average of w(t) is thus given by ¯ w(t) =
1 τ
Z
t
−∞
0
dt0 w(t0 )e−(t−t )/τ ,
(2.11)
where τ is a decay constant that sets the time scale for the time average. The ¯ overall factor of 1/τ ensures that w(t) = w when w(t) = w, a constant. In discrete time, this becomes w¯ n = (1 − e−1/τ )
n X
wm e−(n−m)/τ .
(2.12)
m=−∞
Defining φ = e−1/τ , the time average at one time step is simply related to the time average at the previous time-step through w¯ n+1 = (1 − φ)wn+1 + φ w¯ n .
(2.13)
Thus, in addition to being biologically plausible, the time average defined by equation 2.12 results in only first-order recurrence relations, which are more mathematically tractable, and also an averaging procedure that is not computer memory intensive. Using equation 2.9 together with equation 2.13, we have, = u¯ nxi = (1 − φ)unxi + φ u¯ n−1 = (1 − φ)unxi + φsnxi , sn+1 xi xi
(2.14)
which gives, n n n sn+1 xi − sxi = ²(uxi − sxi ),
(2.15)
where ² = 1 − φ is a learning rate that emerges automatically from the
1946
T. Elliott and N. R. Shadbolt
time-averaging process. Inserting unxi from equation 2.8, we have, sn+1 xi
−
snxi = ²snxi
" X
P
à 1xy T0 +T1
y
# ! n n (a+an )ρ n j syj aj P n P n i ni n −1 , (2.16) j syj j sxj (a+aj )ρj
which constitute a set of first-order recurrence relations in the snxi . In the continuous time limit, these recurrence relations give the first-order differential equations, # " Ã ! P X (a + ai )ρi dsxi j syj aj P = ²sxi 1xy T0 + T1 P −1 , dt j syj j sxj (a + aj )ρj y
(2.17)
that is, we interpret equation 2.16 as constituting a one-step, Euler method for numerically integrating equation 2.17, where we have absorbed the step size into a redefinition of ². (In fact, the continuous time limit in equation 2.17 follows straightforwardly from taking the time derivative of equation 2.11 with w = uxi and sxi = u¯ xi , in which case we obtain that ² = 1/τ , without absorbing a step size.) These equations represent the final form of our model for both the CD and the NCD cases. 2.2 Exact Solutions. We now extract some exact solutions from the set of equations defined by equation 2.17 and examine their behavior. For increased tractability, we assume that there is no diffusion of NTF between target cells, so that 1xy = δxy , the Kronecker delta function. For the neuromuscular junction, this is likely to be a reasonable approximation, since it appears that motor neurons compete for muscle fibers on a fiber-by-fiber basis, rather than for groups of neighboring fibers. In the visual cortex, such an assumption would similarly result in LGN cells’ competing for cortical neurons on a neuron-by-neuron basis, which would be inconsistent with the extended size of ocularity domains. While this is unsatisfactory, we do not expect the inclusion of diffusion otherwise to affect significantly the dynamics of our model; this is discussed in the next section. To obtain exact solutions, we also restrict to the CD case, so that ρi = λ ∀i, since analysis of the NCD case appears to be more difficult. We may then safely consider only one target cell and drop the subscript x so that si denotes the number of synapses between afferent i and the one target cell. Then, dsi = ²si dt
"Ã
P
j sj aj
T0 + T1 P
j sj
!
# a + ai P −1 , j sj (a + aj )
(2.18)
from which we see that the ratio si /sj evolves as d dt
P µ ¶ µ ¶ sk ak si si 1 P (ai − aj ), =² T0 + T1 Pk sj sj s s (a k k k k + ak )
(2.19)
A Model of Neurotrophic Interactions
and the sum
P
k sk
1947
evolves as "µ
P ¶ X # s a k k T0 + T1 Pk sk − k sk k ! Ã X sk . ≤ ² T0 + T1 −
d X sk = ² dt k
(2.20)
k
Therefore, the sum evolves to a value bounded from above by T0 + T1 . (This upper bound also applies to each target cell in the NCD case.) In the limit that a → ∞, so that the resting uptake of NTF by afferents dominates the activity-dependent component of uptake, the ratio si /sj remains fixed. This shows that if segregation is to occur, then a must be sufficiently small. Also, when ai > aj , the ratio si /sj grows; when ai < aj , it decays; and when ai = aj , it Premains fixed. So if, say, a¯ i always exceeds a¯j , then si /sj → ∞, and since k sk remains finite, sj → 0. We therefore restrict to the nontrivial case when the afferents have the same mean activities. With initial conditions given at t = t1 , equation 2.18 may be transformed into a more transparent form by using the new variables yi = T11 e²(t−t1 ) si and χ = e²(t−t1 ) . Defining ac = T0 /T1 , we obtain P dyi j yj (ac + aj ) (a + ai )yi P = P . dχ j yj (a + aj ) j yj
(2.21)
While the “critical” point c = 1, corresponding to aT1 = T0 , dramatically simplifies this equation, in fact the behavior of the solutions at this point is untypical. We shall return to this issue. We will find in both the CD and NCD cases that the size of the parameter c determines whether afferent segregation occurs. Modulo the factor of a, c is simply the ratio of the activity-independent release (or, equivalently, the exogenous infusion) to the maximum activity-dependent release of NTF. It therefore has a fairly direct biological interpretation. From equation 2.21 we immediately obtain the result that a + ai yi dyi = . dyj a + aj yj
(2.22)
Thus, if the ai are constant in the time interval [t1 , t2 ], which corresponds to the χ -interval [χ1 , χ2 ], where χ1 = 1, then we have that µ
yi (χ2 ) yi (χ1 )
¶1/(a+ai )
µ =
yj (χ2 ) yj (χ1 )
¶1/(a+aj )
,
(2.23)
1948
T. Elliott and N. R. Shadbolt
or µ
yi (χ2 ) yj (χ2 ) = yj (χ1 ) yi (χ1 )
¶ a+aj
a+ai
pji
= Aji yi (χ2 ),
(2.24)
pji
where Aji = yj (χ1 )/yi (χ1 ) and pji = (a + aj )/(a + ai ). For ai = b ∀i, where b is some constant,P we have that pji = 1 ∀i, j. Thus, equation 2.24, together with the result that k [yk (χ2 ) − yk (χ1 )] = (ac + b)(χ2 − χ1 ), gives · ¸ 1 (ac + b)(χ2 − χ1 ) , yi (χ2 ) = yi (χ1 ) 1 + P k yk (χ1 )
(2.25)
· ³ ´¸ T1 (ac + b) 1 − e−²(t2 −t1 ) . si (t2 ) = si (t1 ) e−²(t2 −t1 ) + P k sk (t1 )
(2.26)
or
For arbitrary ai , but still constant in [t1 , t2 ], equation 2.24 allows us to decouple the differential equations in equation 2.21 to obtain Z
yi (χ2 ) yi (χ1 )
dy
(
P j
P Aji ypji −1 )( j Aji pji ypji −1 ) = (ac + ai )(χ2 − χ1 ), P c pji −1 j Aji pji y
(2.27)
where pjic = (ac + aj )/(ac + ai ). (Such simple decoupling does not appear to be possible in the NCD case because equation 2.22 contains additional factors on the right-hand side.) In general, the multiple exponents in the integrand prevent the integral from being evaluated exactly. This difficulty may be avoided by assuming that the ai take only two possible values, so that the pji take only two values. This is essentially equivalent to assuming that in the time interval [t1 , t2 ], there are only two afferents, since different afferents with identical activity would be indistinguishable in terms of the evolution equations. We therefore restrict to the explicit case of only two afferents. For notational convenience, in expressions or sentences in which the subscripts i and j appear simultaneously, but in which neither i nor j is summed over, we will use the subscript j to denote the afferent not denoted by the subscript i, so that when i = 1, j = 2 and when i = 2, j = 1. For the two-afferent case, equation 2.27 becomes Z
yi (χ2 ) yi (χ1 )
dy
(1 + Aji ypji −1 )(1 + Aji pji ypji −1 ) 1 + Aji pjic ypji −1
= (ac + ai )(χ2 − χ1 ).
Substituting z = ypji −1 , writing bji = (pji − 2)/(pji − 1), βji =
1 pjic (pji
(2.28)
+1−
pji pjic ),
and γji = (1 − βji )/(1 − pji ), and assuming that a 6= 0 and ai 6= aj to avoid the
A Model of Neurotrophic Interactions
1949
points bji = 1 and bji = 2, we obtain Aji pji pji c [yi (χ2 ) − yi (χ1 )] + βji [yi (χ2 ) − yi (χ1 )] pji Z −γji
pji −1
yi
(χ2 )
pji −1 (χ1 ) yi
dz
z−bji = (ac + ai )(χ2 − χ1 ). 1 + Aji pjic z
(2.29)
The integral on the left-hand side of this equation must, in general, be evaluated by a power series expansion. However, for bji an integer, the integral can be evaluated in closed form. If bji = −l, an integer, then pji = (l + 2)/(l + 1). If l ≥ 0, then an elegant assignment is a = l + 1, ai = 0, and aj = 1, that is, one afferent “on” and the other “off.” If l < −2, then we may select a = −(l + 2), ai = 1, and aj = 0. Since we require that a is not too large from equation 2.19, we shall simply take a = 1. Thus, we have two selections of parameters: Selection 1: l = 0, bji = 0, pji = 2, ai = 0, aj = 1. Selection 2: l = −3, bji = 3, pji = 12 , ai = 1, aj = 0. These two cases fortunately represent opposite patterns of afferent activity, and thus permit a (quasi-)analytic study of the solutions to equation 2.21 and an examination of the parameter regimes in which afferent segregation might occur. For pji = 2 (selection 1) we obtain the exact solution, Aji (pjic )2 [y2i (χ2 ) − y2i (χ1 )] + pjic (3pjic − 2)[yi (χ2 ) − yi (χ1 )] +
(pjic − 1)(pjic − 2) Aji
log
1 + Aji pjic yi (χ2 ) 1 + Aji pjic yi (χ1 )
= (pjic )2 (ac + aj )(χ2 − χ1 ), and for pji =
1 2
(2.30)
(selection 2) we have, 1/2
1/2
Aji (3 − 2pjic )[yi (χ2 ) − yi (χ1 )] + [yi (χ2 ) − yi (χ1 )] 1/2
+
Aji2 (2pjic
−
1)(pjic
− 1) log
= (ac + ai )(χ2 − χ1 ).
Aji pjic + yi (χ2 ) 1/2
Aji pjic + yi (χ1 ) (2.31)
It is straightforward to transform these into solutions for the si . Although these equations represent exact solutions, they are implicit in the si , so, given si (t1 ), si (t2 ) must be determined numerically. However, for the specific point c = 1, at which pjic = pji , the coefficients of the log terms in equations 2.30
1950
T. Elliott and N. R. Shadbolt
and 2.31 vanish, leaving quadratic equations whose solutions are trivial. At this point, for pji = 2 (selection 1) we obtain the explicit solution, yi (χ2 ) = and for pji =
yi (χ1 ) yj (χ1 )
1 2
1/2 yi (χ2 ) =
½q £
¾ ¤2 yi (χ1 )+yj (χ1 ) +2yj (χ1 )(χ2 −χ1 )−yi (χ1 ) ,
(2.32)
(selection 2) we have, 1
½q £
1/2
yi (χ1 )
¾ yi (χ1 )+yj (χ1 ) +2yi (χ1 )(χ2 −χ1 )−yj (χ1 ) . (2.33) ¤2
This simplification occurs because the coefficient of the integral in equation 2.29 vanishes at c = 1, which reflects a cancellation in the integrand in equation 2.27, which in turn reflects the simplification that occurs in equation 2.21 at c = 1. We can now examine the evolution of the si , i = 1, 2, in one or two time intervals during each of which the ai , i = 1, 2, are constant. We set T1 = 1 without loss of generality since T1 sets the overall scale for the si , i = 1, 2. Because we have solutions that require that a = 1, we therefore have that c = T0 . We shall examine the solutions as a function of the parameter c, which determines the ratio between the activity-independent and the maximum activity-dependent release of NTF by target cells. We will restrict attention to the region 0 < c ≤ 1; we will consider c > 1 in subsequent sections. We take ² = 0.018, which means that φ 250 ∼ 0.01, and we consider time intervals 1t of unit size. The quantity P
j sj aj
T0 + T1 P
(2.34)
j sj
sets the overall scale for
P
j sj ,
and, assuming that a¯ i =
1 2,
i = 1, 2, it has
an average value given by T0 + 12 T1 = 12 + c. Thus, we employ the initial conditions at t = 0 given by si (0) = 14 (1 + 2c), i = 1, 2. Figure 1 shows the changes in the si , i = 1, 2 that occur during the time
interval [0, 1] when one afferent, i, is “on” (ai = 1) and the other, j, is “off” (aj = 0), plotted as a function of c, obtained by numerically solving equations 2.30 and 2.31 by Newton iteration. At c = 0, pjic = aj /ai , which is undefined for ai = 0, so we avoid this point. Figure 1a shows the shift in the two solutions, 1si = si (t) − si (0), i = 1, 2, at t = 1. We see that one (si ) increases, while the other (sj ) decreases, with the extent of the change increasing with c. Figure 1b shows the relative difference between the two solutions, 1ij = (si −sj )/(si +sj ), at t = 1. The relative difference is marginally larger (less than 1%) for c = 10−3 (the smallest value of c we consider) than for c = 1.
A Model of Neurotrophic Interactions
A
1951
0.005 0.004 0.003 0.002
Shift
0.001 0 -0.001 -0.002 -0.003 -0.004 -0.005
B
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 c
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 c
1
0.006012
Relative Difference
0.00601 0.006008 0.006006 0.006004 0.006002 0.006 0.005998
Figure 1: The change in the system after one time step in which one afferent is on and the other is off, for the CD case, for initial conditions si = 14 (1 + 2c), i = 1, 2. The shifts (a) from the initial conditions and (b) the relative difference are shown. The shift increases as a function of c, but the relative difference decreases as a function of c.
In Figure 2 we take ai = 1 and aj = 0 in the time interval [0, 1], but then reverse the afferents’ activities, so that ai = 0 and aj = 1 in the time interval (1, 2]. Figure 2a shows the change 1si , i = 1, 2, and Figure 2b shows the relative difference 1ij , both at t = 2. This figure shows that for c ∼ 1, the changes that accrued during the first time period are almost entirely reversed in the second, while for c ∼ 0, the changes are not completely reversed. This suggests that statistical fluctuations in afferent activity may
1952
T. Elliott and N. R. Shadbolt
A
6e-06 4e-06
Shift
2e-06 0 -2e-06 -4e-06 -6e-06 -8e-06
B
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 c
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 c
1
2.5e-05
Relative Difference
2e-05 1.5e-05 1e-05 5e-06 0 -5e-06
Figure 2: The change in the system after two time steps, in the first of which one afferent is on and the other is off, and in the second of which the activity pattern in the first time step is reversed, for the CD case, for initial conditions si = 14 (1 + 2c), i = 1, 2. Both (a) the shift from the initial conditions and (b) the relative difference decrease as a function of c.
result in changes that might accumulate in the si , i = 1, 2, for c sufficiently small, but that might not accumulate in the si , i = 1, 2, for c close to unity. Hence, afferent segregation might be possible for c sufficiently small, while it might not be possible for c close to unity. To investigate the role of the initial conditions si (0) = 14 (1 + 2c), i = 1, 2, in this behavior, we drop the c-dependence and consider only si (0) = 14 ,
A Model of Neurotrophic Interactions
1953
i = 1, 2. Figures 3 and 4 show the equivalents of Figures 1 and 2 for this case. We see that the relative difference 1ij increases rather than decreases as a function of c. However, it is important to notice that Figures 3a and 4a indicate that P both the si , i = 1, 2, grow for most values of c. This is because the sum j sj (0) is not matched to the support provided by the target cell; the target cell initially provides a large, generalized stimulus for growth that prevents significant competitive interactions between the two afferP ents. Once the sum j sj reaches a level that matches the support provided, however, the evolution of the si , i = 1, 2, basically reverts to that exhibited in Figures 1 and 2. This is demonstrated in Figure 5, where the stimuli used to generate Figure 4 are applied 2500 times in succession. 2.3 Fixed-Point Analysis. The analysis performed so far has only hinted at the possibility that afferent segregation might be possible for c sufficiently small and might not for c sufficiently large; the results in Figure 5 demonstrate this, but they are derived from repeated, numerical re-solution of equations 2.30 and 2.31. Furthermore, we have studied the solutions only for strictly anticorrelated patterns of afferent activity and only for 0 < c ≤ 1. Moreover, we have studied analytic solutions for the CD case only. In order to characterize the solutions of equation 2.17 more generally, we now perform a fixed-point analysis, although we will continue to consider only two afferents, and either one (CD case) or two (NCD case) targets cell. We also continue to assume that 1xy = δxy . 2.3.1 The CD Case. Equation 2.18 manifestly possesses a fixed point at si = 0. We must determine whether this point is stable or unstable to small perturbations about it. We set si = δi and sj = (T0 + 12 T1 )+δj = T1 (ac+ 12 )+δj , where, as before, j denotes the afferent not denoted by i, and δi and δj are small perturbations. We then linearize the evolution equations to obtain ! à ac + aj a + ai dδi = ² δi −1 , (2.35) dt ac + 12 a + aj à ! ¶ µ ac + aj dδj 1 = ² T1 aj − −² − 1 δj dt 2 ac + 12 ¸ · (ac + aj )(a + ai ) 1 +² δi − (ac + aj )(δi + δj ) . (ac + ai )δi − a + aj ac + 12 (2.36) The second of these indicates that T1 (ac + 12 ) is not a fixed point for sj unless we average over afferent activity and assume that the average values of ai and aj are both 0.5. To perform the averaging over afferent activity, we will consider binary activity only, so that ai ∈ {0, 1}, i = 1, 2. Let p be the probability that both afferents have the same activity in any given time
1954
T. Elliott and N. R. Shadbolt
A
0.014 0.012 0.01
Shift
0.008 0.006 0.004 0.002 0 -0.002
B
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 c
1
0
0.1
1
0.018
Relative Difference
0.016 0.014 0.012 0.01 0.008 0.006
0.2
0.3
0.4
0.5 c
0.6
0.7
0.8
0.9
Figure 3: As Figure 1, but with initial conditions si = 14 , i = 1, 2. (a) The shift from the initial conditions is now positive for both afferents, for most values of c. This means that both afferents experience a generalized stimulus for growth. (b) The relative difference increases as a function of c.
interval. Then we have four possible activity patterns with probabilities given by: P(ai = 0 & aj = 0) = p/2,
(2.37)
P(ai = 0 & aj = 1) = (1 − p)/2,
(2.38)
P(ai = 1 & aj = 0) = (1 − p)/2,
(2.39)
P(ai = 1 & aj = 1) = p/2,
(2.40)
A Model of Neurotrophic Interactions
A
1955
0.018 0.016 0.014
Shift
0.012 0.01 0.008 0.006 0.004 0.002 0
B
0
0.1
0.2
0.3
0.4
0.5 c
0.6
0.7
0.8
0.9
1
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 c
1
0.0006
Relative Difference
0.0005 0.0004 0.0003 0.0002 0.0001 0
Figure 4: As Figure 2, but with initial conditions si = 14 , i = 1, 2. After two time steps, both afferents continue to grow.
assuming an unbiased distribution of “on” and “off” states. Averaging equations 2.35 and 2.36 over these activity patterns, we obtain, (c − 1)(1 − p) d hδi i = ² hδi i , dt (2ac + 1)(a + 1) d (hδi i + hδj i) = −² (hδi i + hδj i), dt
(2.41) (2.42)
where hi denotes the averaging. We thus see that for c < 1, the point si = 0, sj = T1 (ac + 12 ) is a stable fixed point (an “attractor”) of equation 2.18
1956
T. Elliott and N. R. Shadbolt
A
0.7 0.6 0.5 0.4
Shift
0.3 0.2 0.1 0 -0.1 -0.2 -0.3
B
0
0.1
0.2
0.3
0.4
0.5 c
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5 c
0.6
0.7
0.8
0.9
1
1 0.9
Relative Difference
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Figure 5: As Figure 4, except that the two alternating activity patterns used to generate Figure 4 are applied 2500 times in succession. After reaching a point at which the afferents match the support provided by the target cell, the behavior of the system basically reverts to that in Figures 1 and 2. We see strict segregation of the afferents for c less than approximately 0.45. Increasing the number of times that the activity patterns in Figure 4 are applied increases this value.
averaged over afferent activity, for all values of p < 1. For c > 1, this point is unstable (a “saddle”), again for all values of p < 1.
A Model of Neurotrophic Interactions
1957
By symmetry, the fixed point si = 0, sj = T1 (ac + 12 ) is paired with another fixed point at si = T1 (ac + 12 ), sj = 0. There is, in addition, another fixed point of equation 2.18 averaged over afferent activity, given by si = sj = 12 T1 (ac + 12 ). Expanding about this point as usual, and averaging the linearized equations as before, we obtain d (hδi i + hδj i) = −²(hδi i + hδj i), dt 2a(c − 1)(1 − p) d (hδi i − hδj i) = −²(hδi i − hδj i) . dt (2a + 1)2 (2ac + 1)
(2.43) (2.44)
For c < 1, this point is unstable (a “saddle”), while for c > 1 this point is stable (an “attractor”), for all values of p < 1. These three fixed points provide a complete understanding of the evolution of the si , i = 1, 2, determined from equation 2.18 averaged over all afferent activity patterns. For c < 1, except when si = sj initially, afferent segregation always occurs, for all values of p < 1. Conversely, for c > 1, except when either si = 0 or sj = 0 initially, afferent segregation never occurs, P for all values of p < 1. For c = 1, the entire line j sj = T1 (ac + 12 ) is a stable attractor, with flow approaching it from either below or above along straight lines projecting radially outward from the origin. Biologically, this means that afferent segregation breaks down when the activity-independent release (or exogenous supply) of NTF exceeds a certain, critical value (e.g., Cabelli et al., 1995; English & Schwartz, 1995; Kwon & Gurney, 1996). Although numerical simulation could establish that segregation breaks down beyond some point, only analysis can establish precisely at which point segregation breaks down. Figure 6 shows the flow derived from equation 2.18 by averaging over afferent activity. In Figure 6a we set c = 12 and T1 = 1 and a = 1, in Figure 6b we set c = 1 and T1 = 23 , and in Figure 6c we set c = 32 and T1 = 12 , and we have changed the values of T1 in Figure 6b and Figure 6c so that the fixed points are at the same positions as in Figure 6a. For visual impact, we have analytically continued to the region p < 0 and set p = −3. This has the effect of accentuating the flow; the figure is qualitatively identical for 0 ≤ p < 1. 2.3.2 The NCD Case. To perform the fixed-point analysis for the NCD case, we drop the a¯ i term in ρi , i = 1, 2, which is equivalent to assuming that the time averages of all afferents’ activities are fixed and equal. For studies of normal development, this is a reasonable assumption. We restrict to two target cells and, as usual, two afferent cells. We will employ an equivalent convention for target cells as that used for afferent cells: that in expressions or sentences in which the subscripts x and y appears simultaneously, but in which neither x nor y is summed over, the subscript y denotes the target cell not denoted by the subscript x.
1958
T. Elliott and N. R. Shadbolt
A
1.4 1.2
Solution 2
1 0.8 0.6 0.4 0.2 0
B
0
0.2
0.4
0.6 0.8 Solution 1
1
1.2
1.4
0
0.2
0.4
0.6 0.8 Solution 1
1
1.2
1.4
0
0.2
0.4
0.6 0.8 Solution 1
1
1.2
1.4
1.4 1.2
Solution 2
1 0.8 0.6 0.4 0.2 0
C
1.4 1.2
Solution 2
1 0.8 0.6 0.4 0.2 0
Figure 6: Flows derived from the fixed-point analysis of the linearized, afferentactivity-averaged equations for the CD case. For (a) c < 1, segregated final states attract almost all the flow, while for (c) c > 1, an unsegregated final state attracts almost all the flow. At (b) c = 1, the flow consists of straight lines projecting radially from or toward the origin, with flow either P upward or downward, depending on the initial conditions, toward the line j sj = T1 (ac + 12 ) (dashed line).
A Model of Neurotrophic Interactions
1959
One fixed point is at sxi = 0, sxj = T1 (ac + 12 ), syi = T1 (ac + 12 ), syj = 0, and this is paired with another fixed point at sxi = T1 (ac + 12 ), sxj = 0, syi = 0, syj = T1 (ac + 12 ). The x-behavior decouples from the y-behavior at each of these points, and it is easy to establish that these two fixed points are identical to the equivalent fixed points in the CD case. Thus, for c < 1, these points are stable attractors, while for c > 1 they are unstable saddles, for all values of p < 1. The analysis of the fixed point at sxi = 12 T1 (ac + 12 ), sxj = 12 T1 (ac + 12 ), syi = 1 1 1 1 2 T1 (ac+ 2 ), syj = 2 T1 (ac+ 2 ) is, however, not identical to the equivalent point in the CD case. Expanding and linearizing about this point, and averaging over afferent activity patterns as usual, after lengthy algebra we obtain, hδxi i qα − 3 −qα − 1 qβ − 1 −qβ + 1 d qα − 3 −qβ + 1 qβ − 1 hδxj i = ² −qα − 1 qβ − 1 −qβ + 1 qα − 3 −qα − 1 hδyi i dt 4 −qβ + 1 qβ − 1 −qα − 1 qα − 3 hδyj i hδxi i hδxj i (2.45) × hδyi i , hδyj i where q = 1 − p, the δxi , and so forth, denote small perturbations about the fixed points, and α and β are given by α=
1 + 4a − 2ac , (2a + 1)2 (2ac + 1)
(2.46)
β=
1 . (2a + 1)2
(2.47)
Because of the symmetries in this matrix, we can read off the eigenvectors and associated eigenvalues by inspection: Eigenvector: (1, −1, −1, 1)T , Eigenvector: (1, −1, 1, −1)T , Eigenvector: (1, 1, 0, 0)T , Eigenvector: (0, 0, 1, 1)T ,
Eigenvalue: ² q(α − β)/2 Eigenvalue: ² [q(α + β) − 2]/2 Eigenvalue: −² Eigenvalue: −²,
where the superscript T denotes the transpose. The first eigenvalue in this list reduces to 2a(c − 1)(1 − p) , −² (2a + 1)2 (2ac + 1) and the second is ¶ µ 1−p . −² 1 − (2a + 1)(2ac + 1)
1960
T. Elliott and N. R. Shadbolt
Thus, two eigenvalues are negative definite, one eigenvalue is negative semidefinite, and the remaining eigenvalue changes sign at c = 1. For c < 1, this fixed point is therefore an unstable saddle, and for c > 1, it is a stable attractor. Although the analysis is more complicated, this fixed point behaves identically to the equivalent fixed point in the CD case. The NCD case is essentially identical to the CD case. There are three fixed points in each case. Two correspond to segregated final states, and the other corresponds to an unsegregated final state. For c < 1, the segregated final states are stable and attract almost all the flow, while for c > 1, the unsegregated final state is stable and attracts almost all the flow. In the CD case we proved that the limit a → ∞ prevents afferent segregation (see equation 2.19), but this proof does not go through for the NCD case, unless we assume that ρi = ρj ∀i, j. The fixed-point analysis, however, for both the CD and the NCD cases, proves that the limit a → ∞ behaves identically to the point c = 1, at least for the afferent-activity-averaged equations, since (at least) one of the eigenvalues of each of the matrices determining the linear flow becomes zero when either a → ∞ or c = 1. Thus, it is important to realize that the transition from segregation to nonsegregation that occurs at c = 1 is valid only for finite a, since if we take the limit a → ∞ in the expression for c, c = T0 /(aT1 ), we obtain c = 0 always, which (wrongly) predicts segregation always. 2.4 Random Walk Approximation. The fixed-point analysis performed above indicates that the transition from segregation to nonsegregation is discontinuous and occurs at the point T0 = aT1 for all values of p < 1. As p approaches unity, the flow slows down, so that it takes longer to segregate (for c < 1) or longer to abolish partial segregation (for c > 1). Biologically, the time needed to segregate or abolish segregation might be important, because if the capacity for plasticity is under some form of control, then the required time might exceed the time during which the system is able to change. We now examine this issue. To do this, we extract a random walk approximation. We do this for the CD case only. While we thus consider the time taken to segregate in models to be important, we are generally skeptical about claims that models are temporally realistic inasmuch as the time-step sizes that they use can be compared to real times in real biological systems. Because models typically contain many parameters, it is usually easy to adjust some of them (particularly the “learning rate”) so that the models’ time steps can be compared to real times. Provided that ² is sufficiently small, the one-step Euler method defined by equation 2.16 provides a good approximation to the solutions of equation 2.17 with time steps 1t of unit size. Given initial conditions s0i , if we , then we can insert either s0i or sni into the right-hand wish to calculate sn+1 i side of the one-step Euler method corresponding to equation 2.18, since the difference between the resulting values is of order ² 2 , which we can take to be negligible for ² sufficiently small. Using s0i instead of sni , however, has
A Model of Neurotrophic Interactions
1961
the considerable advantage that 1sn+1 = sn+1 − sni depends only on the s0i , i i which are fixed, and the afferents’ activities only at time-step n, and not at are strictly independent any earlier times steps through sni . Thus, the 1sn+1 i are independent of the ani , and the 1sn+1 of the 1sni provided that the an+1 i i take values from the same, possibly infinite set as the 1sni . These are exactly the conditions that define the evolution of the si as a random walk away from their initial values. We may most simply calculate the mean and variance of a random variable W corresponding to a random £ ¤ walk process by using its probabilitygenerating function GW (r) = E rW , where E denotes the expectation value. 2 , are given by Then the mean, µW , and variance, σW µW =
¯ dGW ¯¯ , dr ¯r↑1
(2.48)
and " 2 σW
=
d2 GW dGW − + 2 dr dr
µ
dGW dr
¶2 #¯¯ ¯ ¯ ¯
,
(2.49)
r↑1
when these limits exists. As in the fixed-point analysis, we assume binary afferent activity with probabilities given by equations 2.37 through 2.40. The corresponding step sizes 1sni take values from the set {200 , 201 , 210 , 211 }, where the first subscript denotes the value of ai and the second the value of aj . Taking a = 1 and T1 = 1, and employing the initial conditions s0i = 14 (1 + δi )(1 + 2c), where δi is such that |δi | ≤ 1 and represents a small perturbation about the value 14 (1 + 2c) used earlier, and taking δi + δj = 0, so that one afferent is just above and the other is just below this value, the 2ai aj take the following values: ² 200 = − (1 + δi ), 4 ² 1 + δi [(1 + δi ) + 2(1 − δi )c], 201 = − 4 3 − δi ² 1 + δi [(1 + 3δi ) + 2(1 − δi )c], 210 = + 4 3 + δi ² 211 = + (1 + δi ). 4
(2.50) (2.51) (2.52) (2.53)
The probability-generating function of the random variable 1s for an arbitrary afferent and an arbitrary, single time step is thus given by G1s (r) =
¡ ¢ 1 ¢ 1 ¡ 200 p r + r211 + (1 − p) r201 + r210 , 2 2
(2.54)
1962
T. Elliott and N. R. Shadbolt
and the random variable N X
Si =
1sm i ,
(2.55)
m=1
which denotes the shift of afferent i away from its initial value s0i after timestep N, has a probability-generating function given by GSi (r) = [G1S (r)]N ,
(2.56)
the mean, µi , and variance, σi2 , of which are given by µi =
1 − δi2 1 N²(1 − p)(1 − c)δi , 2 9 − δi2
(2.57)
and σi2 =
½ 1 2 (1 − δi )2 2 1 N² (1 + δi )2 p + (1 − p2 ) δ (1 − c)2 4 4 (9 − δi2 )2 i · ¸¾ 1−p 1 2 2 (1 + 3δ + )(1 + δ ) + (1 + 2δ )(1 − δ )c + (1 − δ ) c . i i i i i 9 − δi2 4 (2.58)
There are two things to notice about the expression for µi . First, for c < 1 and δi > 0 (δi < 0), si is expected to increase (decrease) and sj is expected to decrease (increase). This expectation is consistent with the nature of the fixed points in Figure 6a. However, for N sufficiently large, if δi > 0, then sj is expected to become negative. This is an artifact of the assumption of fixed-step sizes. Analytically, as sj approaches zero, the rate of change of sj decreases, violating the fixed-step size assumption. Similarly, for N sufficiently large, if δi > 0, then si is expected to exceed the average value of P 1 k sk , which is 2 +c. Again, this is an artifact of the assumption of fixed-step sizes. For c > 1, all these results are reversed: si and sj are expected to evolve toward each other, in accord with the fixed-point analysis. However, for N sufficiently large, si and sj are expected to move toward each other but overshoot, so that one becomes large and positive and the other large and negative. Again, this is an artifact of the fixed-step size assumption. Second, for δi = 0, we have that µi = 0. That is, there is no expected shift away from the initial conditions when both afferents start with si = sj = 14 (1 + 2c). This is in accord with the fixed-point analysis. However, our analysis of the exact solutions above indicated that even for δi = 0, specific patterns of afferent activity might induce segregation. This analysis used anticorrelated afferent activity with activity patterns reversing at each time step. In this case, for c near to unity or below, segregation does indeed
A Model of Neurotrophic Interactions
1963
occur, as shown in Figure 5. However, such activity patterns are such that are not independent of the ani , since an+1 = 1 − ani ∀n, but the the an+1 i i random walk approximation requires the assumption of strict independence for its validity. The fixed-point analysis also requires this assumption of independence for the averaging over afferent activity to be valid. For the independent of the ani , as both the random walk and the fixed-point an+1 i analysis require, segregation does not occur for δi = 0. In Figure 7, for ² = 0.018 (as before), N = 106 , and δi = 0.01 we show graphs of the interval [µi −σi , µi +σi ] by showing graphs of µi and µi ±σi , for 0 ≤ c ≤ 1, for three different values of p. Since both µi and σi are proportional to ², the quantitative features of these graphs are independent of ², except for the scale of the vertical axes. As p increases, the intervals broaden and are pulled downward. Each graph in this figure shows a minimum value of c, cseg , below which the expected shift is not consistent with zero and above which the expected shift is consistent with zero. (There is also a value for c > 1 beyond which the expected shift becomes significantly negative for δi > 0, but, as explained above, this is an artifact and is ignored.) We take this as the value of c below which afferent segregation could occur within the number of time steps permitted and above which afferent segregation cannot occur within the number of time steps permitted. Strictly speaking, this interpretation violates the assumption of linearization required to derive the random walk approximation. However, we will find reasonable qualitative agreement with numerical results. The value of cseg can be calculated by solving µi = σi , from which we obtain a messy expression that we do not reproduce here. To be real valued, the expression requires that p≤
Nδi2 . 9 + (1 + N)δi2
(2.59)
As N → ∞, p may approach unity, as expected from the fixed-point analysis. For finite N, the range of values of p in which afferent segregation might be possible within the number of time steps permitted is reduced. Figure 8 shows a graph of p plotted as a function of cseg for δi = 0.01 and for a range of values of N. The qualitative character of the curves is unchanged by changing δi ; broadly speaking, decreasing (increasing) δi for fixed N has the same effect as decreasing (increasing) N for fixed δi . 2.5 Numerical Simulations. We now turn to numerical simulations of the model defined by equation 2.17. We consider two afferents and either one target cell for the CD case or two target cells for the NCD case. In the latter case, we continue to assume that NTF does not diffuse through the target field. We also continue to assume that a = 1 and T1 = 1 so that the numerical results can be compared to the analytical results discussed above. We take the initial conditions as before, s0i = 14 (1 + δi )(1 + 2c), i = 1, 2, except that we
1964
T. Elliott and N. R. Shadbolt
A
10
Expected Range
8
6
4
2
0
B
0
0.2
0.4
0
0.2
0.4
0
0.2
0.4
c
0.6
0.8
1
0.6
0.8
1
0.6
0.8
1
10
Expected Range
8
6
4
2
0
C
c
10
Expected Range
8
6
4
2
0
c
Figure 7: The expected shift, µi ± σi , from the initial conditions given by the random walk approximation. The dashed line is µi , and the solid lines are µi ±σi . Shifts for three different values of p are shown: (a) p = 0.3, (b) p = 0.5, and (c) p = 0.7. In each case N = 106 and δi = 0.01.
A Model of Neurotrophic Interactions
1965
1
Probability
0.8
0.6
0.4
0.2
0
0
0.2
0.4 0.6 Minimal Segregation Point
0.8
1
Figure 8: Graphs of p against cseg , for five different values of N. Moving from bottom left to top right: N = 105 , N = 106 , N = 107 , N = 108 , and N = 109 . For each value of N, the associated curve divides the c-p plane into two pieces. The upper piece is the region in which afferent segregation is predicted by the random walk approximation not to be possible within the number of steps, N, permitted. For each value of N, δi = 0.01.
now take the δi , i = 1, 2, to be random numbers in the interval [−0.01, 0.01] and we do not assume that δi + δj = 0. Afferent activity is binary valued, ai ∈ {0, 1}, i = 1, 2, and such that the probability that both afferents have the same activity in any given time step is p. In each time step, one afferent, i, is assigned activity ai = 0 with probability 0.5; otherwise it is assigned activity ai = 1. The other afferent, j, is assigned activity aj = ai with probability p; otherwise it is assigned activity aj = 1 − ai . We take the time steps to be of unit size and ² = 0.018. First we characterize the general behavior of the model in the c-p plane. To this end we consider 21 different values of c uniformly distributed in the interval [0, 1], and similarly for p. For each pair of values of c and p, we solve equation 2.17 numerically 103 times using different random selections of initial conditions. We numerically integrate equation 2.17 rather than numerically solve the exact but implicit expressions in equations 2.30 and 2.31 for the CD case because it is computationally faster to do the former; for the NCD case, we are forced to employ numerical integration. Each solution consists of 106 individual time steps. For each point in the c-p plane, we define a segregation probability, S, which is the ratio of the number of times segregation is observed to the total number of times the differential equations are solved.
1966
T. Elliott and N. R. Shadbolt
In Figure 9 we show contours of S in the c-p plane. Figure 9a shows contours for the CD case, and Figure 9b shows contours for the NCD case. The “wobbles” in the contours are artifacts of poor resolution along both the c- and the p-axes and would also be smoothed by increasing the number of simulations at each point. However, increasing the resolution and the number of simulations significantly would make the generation of the contours computationally intractable. The shape of the contours in Figure 9 is in good qualitative agreement with those derived from the random walk approximation shown in Figure 8. Quantitatively, however, there is poor agreement, since the random walk approximation significantly overestimates the maximum number of steps below which segregation is not expected. This disagreement is not too surprising, since the random walk approximation assumes fixed-step sizes. Figure 9 indicates that the transition from segregation to nonsegregation is not abrupt but possesses a continuous profile, because the number of time steps permitted is not infinite. However, the transition is nevertheless very rapid. Typical examples of the evolution of the si , i = 1, 2, for three different values of c are shown in Figure 10, for the CD case, for uncorrelated activity (p = 0.5), obtained by numerical integration. In Figure 10a, for c = 0.8, we see segregation of the two afferents, so that si → 0 for one of them. In contrast, in Figure 10b, for c = 1.0, the “critical” value at which equation 2.21 becomes dramatically simplified and the fixed points flip from being attractors to being saddles or vice versa, the evolution seems to be chaotic, with afferents appearing to segregate, but then, after varying lengths of time, suddenly reversing roles, with one rapidly growing and the other rapidly decaying. Figure 10c, for c = 1.2, exhibits some of the chaotic behavior of Figure 10b, but the afferents never approach segregation. Figure 11 shows examples of the evolution for the NCD case, again for uncorrelated afferent activity. Here we consider two target cells, but show for clarity the evolution associated with only one target cell; the evolution associated with the other target cell is the reverse of that shown. This figure shows that the transition from segregation (Figure 11a, c = 0.92) to nonsegregation (Figure 11c, c = 0.98) is smooth, with intermediate, partially segregated states being present rather than the chaotic oscillations of the CD case (Figure 11b, c = 0.95). To rule out the possibility that the behavior exhibited in Figure 10b is due to a numerical instability associated with the numerical integration, we show, in Figure 12, the results obtained from the exact and explicit analytic solutions in equations 2.32 and 2.33. Figure 12 reveals qualitative similarities to Figure 10b, but there are quantitative differences. The chaotic oscillations are still present but are, in fact, enhanced. Thus, the numerical integration results, for c = 1 in the CD case, tend to overestimate the average oscillation period. However, this quantitative difference is not important. Numerical solution of the exact but implicit equations 2.30 and 2.31 for c = 0.8 and c = 1.2 reveals both qualitative and quantitative similarity to the numerical
A Model of Neurotrophic Interactions
A
1967
1 0.9 0.8
Probability
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
B
0
0.1
0.2
0.3
0.4
0.5 c
0.6
0.7
0.8
0.9
1
0
0.1
0.2
0.3
0.4
0.5 c
0.6
0.7
0.8
0.9
1
1 0.9 0.8
Probability
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Figure 9: Contours of the segregation probability S in the c-p plane obtained by full numerical simulations of the (a) CD case and the (b) NCD case. From bottom left to top right in each case: S = 0.95, S = 0.75, S = 0.50, S = 0.25, and S = 0.05. Almost all points under the S = 0.95 contours segregate, while almost all points above the S = 0.05 contours do not segregate.
integration results. Thus, possible numerical instabilities associated with the numerical integration do not appear to be significant. Finally, although the fixed-point analysis of the afferent-activity-averaged equations for the NCD case results in the same fixed points with identical properties as for the CD case, Figures 9 to 11, while demonstrating much qualitative similarity between the CD and the NCD cases, exhibit some
1968
T. Elliott and N. R. Shadbolt
A Number of Synapses (Scaled)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
B
0
1e+04
2e+04 3e+04 Time Step
4e+04
5e+04
0
2e+06
4e+06 6e+06 Time Step
8e+06
1e+07
0
2e+06
4e+06 6e+06 Time Step
8e+06
1e+07
Number of Synapses (Scaled)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
C Number of Synapses (Scaled)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Figure 10: Three examples of the typical behavior of the CD case for p = 0.5, for different values of c: (a) c = 0.8, (b) c = 1.0, and (c) c = 1.2. The transition from segregation to nonsegregation is characterized by chaotic oscillations.
A Model of Neurotrophic Interactions
A
1969
Number of Synapses (Scaled)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
B
0
2e+05
4e+05 Time Step
6e+05
8e+05
Number of Synapses (Scaled)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
C
0
2e+06
4e+06 6e+06 Time Step
8e+06
1e+07
0
2e+06
4e+06 6e+06 Time Step
8e+06
1e+07
Number of Synapses (Scaled)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
Figure 11: Three examples of the typical behavior of the NCD case for p = 0.5, for different values of c: (a) c = 0.92, (b) c = 0.95, and (c) c = 0.98. The transition from segregation to nonsegregation is smooth and continuous, with partially segregated states in between.
1970
T. Elliott and N. R. Shadbolt
Number of Synapses (Scaled)
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0
0
2e+06
4e+06 6e+06 Time Step
8e+06
1e+07
Figure 12: Analytical solutions for the CD case at c = 1. The chaotic oscillations are still present.
quantitative differences. In the fixed-point analysis, the various ρi basically make no contribution at first order except to expand and rotate the coordinate system, which is why the NCD fixed points are identical in character to the CD fixed points. However, the dynamics associated with the ρi are important, since they ensure that, for example, two afferents will each win one of two target cells. In the CD case, this never happens: one afferent always wins both target cells, and the other always retracts all its terminals. In addition, to perform the analysis in the NCD case, we dropped the a¯ i term from ρi . Numerically, fluctuations in this term will contribute and will serve to “smear” the transition point c = 1 over a range of values of c. Thus, while we expect qualitative similarity in terms of the parameter regime in which segregation is expected, we would not expect exact quantitative agreement. 3 Discussion We have shown that a mathematical model of retrograde neurotrophic interactions leads to the segregation of afferents’ arbors on the target space even for very strongly correlated afferent activity patterns. Segregation occurs provided that the level of NTF released in an activity-independent manner, or, equivalently, the level of NTF available by exogenous infusion, is not too high, relative to the resting uptake of NTF by afferents and the maximum amount of NTF available by activity-dependent release. A key assumption in our model is that the (recent average) level of NTF taken up by an afferent from part of the target structure determines the
A Model of Neurotrophic Interactions
1971
number of synapses the afferent projects to that part of the target structure. Thus, we assume that a (local) decline in NTF levels results in a (local) atrophy of the afferent’s axonal arbor, and, similarly, a (local) increase in NTF levels results in a (local) growth of the afferent’s axonal arbor. Evidence exists that indicates that this assumption is not unreasonable. NTFs can promote axonal growth: retinotectal afferent arbors in Xenopus grow in response to increased levels of BDNF (Cohen-Cory & Fraser, 1995). Furthermore, localized atrophy or growth of axonal arbors in response to NGF shortage or supply, respectively, has been observed (Campenot, 1982a,b). A further key assumption in our model is that the uptake of NTFs depends in part on afferent activity. To our knowledge, no experimental evidence exists to suggest whether this is a plausible assumption. However, much evidence suggests that competitive interactions between afferents depend, at least in part, on the relative rather than the absolute levels of afferent activity. For example, monocular deprivation has a profound effect on LGN cells driven by the deprived eye, while binocular deprivation has little or no effect (Guillery & Stelzner, 1970; Guillery, 1972). Although it is possible that activity might affect the ability of afferents to utilize, rather than to take up, NTFs (Meyer-Franke, Kaplan, Pfrieger, & Barnes, 1995; discussed in Snider & Lichtman, 1996), it is unlikely that this could lead to differences depending on relative levels of activity, unless an enhanced capacity to utilize NTFs also enhances the rate of uptake. For example, if activity affects only the ability of afferents to utilize NTFs, then deprived-eye LGN cells should undergo atrophy whether or not the other eye is deprived. That is, activity-mediated utilization would appear to be noncompetitive. The status of our assumption regarding NTF uptake is therefore that of a tentative postulate, driven by theoretical considerations concerning the possible mechanisms of afferent competition. Analysis of our model demonstrates that activity-dependent uptake of NTF is the driving force behind competitive dynamics, since if the activity-independent component of uptake dominates the activity-dependent component (the limit a → ∞), then afferent segregation cannot occur. This should be regarded as a prediction: if it were possible pharmacologically to dissect out and remove the postulated activity-dependent component of afferent NTF uptake, so that only the activity-independent component remains, then we predict that in an otherwise normal preparation, afferent segregation would not occur. Another assumption is that of rapid diffusion of NTF through the target field. This assumption is not critical; it serves to make the equations and the resulting analysis more tractable. Slowing the rate of diffusion, allowing NTF to accumulate over time, and introducing some method of degrading NTF so that its levels do not grow unboundedly would not significantly alter the dynamics of our model, although the time taken to segregate would be increased. A model based on such an approach, using nitric oxide (NO) as a retrograde messenger, has been presented (Montague, Gally, & Edelman, 1991), and arguments concerning the theoretical advantages conferred
1972
T. Elliott and N. R. Shadbolt
by spatial and temporal averaging have been advanced (Gally, Montague, Reeke, & Edelman, 1990; Edelman & Gally, 1992). However, inhibition of NO synthase in the visual system does not alter ocular dominance plasticity (Reid, Daw, Czepita, Flavin, & Sessa, 1996; Ruthazer, Gillespie, Dawson, Snyder, & Stryker, 1996), so the status of NO as a relevant retrograde messenger is unclear. In the NCD case, we assumed that the affinity of each terminal for NTFs is proportional to the time average of the afferent’s activity and inversely proportional to the total number of terminals supported by the afferent. The affinity of a terminal for NTFs was interpreted as proportional to the number of receptors for the NTFs possessed by the terminal. We therefore assumed that NTF receptors are regulated by afferent activity. Evidence indicates that kindling and seizures in the rat result in an elevation of the mRNAs for trkB and trkC, the receptors for BDNF, neurotrophin-3 (NT-3) and NT-4/5 (Bengzon et al., 1993; Dugich-Djordjevic et al., 1995; Salin et al., 1995), suggesting that afferent activity might regulate NTF receptors. Furthermore, depolarizing media directly influence NTF receptor expression in cell culture (Birren, Verdi, & Anderson, 1992; Cohen-Cory, Elliott, Dreyfus, & Black, 1993). The assumption that the affinity of a terminal is inversely proportional to the number of terminals amounts to the assumption that, for fixed mean activity, an afferent possesses a fixed number of NTF receptors that are redistributed around its axonal arbor in response to sprouting or retraction. Although no evidence bears directly on this assumption, some indirect evidence does suggest that it is not implausible. Recently it has been established that at the Drosophila neuromuscular junction, the expression of the cell adhesion molecule fasciclin II (Fas II) by motor neurons controls the size of motor neuronal arbors: mutants that express approximately 50% lower levels of Fas II than controls possess arbors with approximately twice as many synaptic boutons (Schuster, Davis, Fetter, & Goodman, 1996a,b). The synaptic efficacy of motor neurons in such mutants is unchanged, suggesting that the same levels of synaptic machinery as in controls are simply distributed over a larger number of boutons (Schuster et al., 1996b). However, if the mutation reducing Fas II by 50% is combined with one that expresses the cAMP response element-binding protein activator, then synaptic efficacy is also increased (Davis, Schuster, & Goodman, 1996). The fact that only mutants with reduced Fas II have motor neurons with the same synaptic efficacy as controls thus suggests that the assumption of a fixed number of NTF receptors, for fixed mean activity, is not implausible. In this model, we have considered only the possible role of NTFs in activity-dependent, competitive, anatomical changes. We have specifically ignored any possible physiological changes that might precede anatomical changes (Bailey & Kandel, 1993). However, evidence is accumulating that implicates NTFs not only in competitive, anatomical plasticity but also in physiological plasticity. For example, at the neuromuscular junction, BDNF
A Model of Neurotrophic Interactions
1973
and NT-3 rapidly potentiate synaptic transmission (Lohof, Ip, & Poo, 1993), and in the rat visual cortex, BDNF blocks long-term depression (Akaneya, Tsumoto, & Hatanaka, 1996) and, in addition to NGF, enhances excitatory transmission (Carmignoto, Pizzorusso, Tia, & Vicini, 1997). In the hippocampus, BDNF and NT-3 elevate intracellular Ca2+ levels (Berninger, Garcia, Inagaki, Hahnel, & Lindholm, 1994), and NGF, BDNF, NT-3, and NT-4/5 all potentiate or modulate synaptic transmission (Knipper, Leung, Zhao, & Rylett, 1994; Lessmann, Gottman, & Heumann, 1994; Kang & Schuman, 1995). Furthermore, BDNF-knockout mice exhibit significantly impaired long-term potentiation (LTP) (Korte et al., 1995). This impairment can be reversed by a virus-mediated reexpression of the BDNF gene (Korte et al., 1996). These results, the last two especially (Korte et al., 1995, 1996), suggest that NTFs might, in addition to their neurotrophic properties, act as retrograde messengers during LTP, inducing rapid physiological presynaptic changes. The non-Hebbian character of LTP in the hippocampus and the rat visual cortex suggests that if NTFs do act as retrograde messengers during LTP, they diffuse widely through the target field and act on, or are taken up by, active afferents (Bonhoeffer, Staiger, & Aertsen, 1989; Kossel, Bonhoeffer, & Bolz, 1990). If NTF-mediated physiological plasticity is competitive in character, then NTFs might provide a natural bridge between physiological and anatomical plasticity. It would be of considerable interest to extend the present model in order to take account of these possibilities. In the CD case, we find a discontinuous transition from segregation to nonsegregation. Although such a discontinuity is also predicted for the NCD case, in fact it is not observed, because the ρi and the a¯ i play an important role in the dynamics and serve to smooth out the transition. A discontinuous transition, although a logical possibility, is unlikely to occur in real, biological systems. Segregation in both cases also occurs even in the presence of very highly correlated patterns of afferent activity. This is likely to be unrealistic. Preliminary indications are that these difficulties do not exist in full-scale simulations of large numbers of afferent and target neurons, which include the diffusion of NTF through the target field. However, mathematical analysis of the model, including NTF diffusion, is very much more complicated than the analysis considered here. Another difficulty is that the transition from segregation to nonsegregation in the CD case is chaotic, involving oscillations in the sizes of afferents’ arbors. While this is, again, a logical possibility, it is almost certainly biologically implausible. One way of overcoming this difficulty is to discretize the sxi . This is natural in a model of anatomical plasticity, in which the sxi represent the (possibly scaled) numbers of synapses between afferents and target cells. Such discretization thus has the effect of setting a limit below which the sxi are deemed to be zero. Once set to zero, they can never regrow. This would serve to eliminate the oscillatory behavior at c = 1. However, since the CD case permits cell death, it is of no interest for studying afferent segregation in later development. For this, the NCD case is required, and
1974
T. Elliott and N. R. Shadbolt
this does not exhibit a chaotic transition from segregation to nonsegregation. It might be considered that the existence of specific classes of fixed points in our model (perfectly segregated or exactly balanced innervation in the two-afferent case) is biologically unrealistic, since in the striate cortex, for example, many different balances of control between the two eyes exist. However, it should be realized that our model indicates only that if competition and plasticity occur for indefinitely long, then afferents either will segregate perfectly or will achieve perfectly balanced innervation. The random walk approximation gives insight into how long segregation or the abolition of partial segregation might take. If these processes are long compared to the duration of the period in which the developing system is plastic, then the system will not have time to reach the fixed points. This would be reflected in the continued presence of neurons falling, for example, into many different ocular dominance classes. Furthermore, diffusion of NTF through the target field is likely to affect the dynamics of segregation. Indeed, preliminary results indicate that large-scale simulations of the formation of ocular-dominance columns can retain considerable binocular overlap at ocular-dominance column boundaries, even when the termination of the critical period for plasticity is not considered. One result for which the present model cannot account, is the fact that infusion of the γ -aminobutyric acid (GABAA ) receptor agonist muscimol into the striate cortex during development causes a paradoxical shift of ocular dominance toward the deprived eye (Reiter & Stryker, 1988; Hata & Stryker, 1994). Since we postulate that the uptake of NTFs is activity dependent, active-eye afferents should take up more NTFs than inactive-eye afferents (assuming comparable axonal arborisations), despite the fact that excess GABAergic stimulation via muscimol infusion would be expected to cause a marked decline in the levels of NTFs available from target cells (Zafra, Castren, Thoenen, & Lindholm, 1991). A plausible way of extending our model to overcome this difficulty is to assume that more active afferents require greater neurotrophic support than less active afferents (Elliott & Shadbolt, 1996). In this case, more active afferents receiving inadequate neurotrophic support for their level of activity and extent of axonal arborizations would be expected to undergo an atrophy of their axonal arborizations. Less active afferents, on the other hand, might receive sufficient neurotrophic support to maintain their axonal arborizations, or, at least, they might atrophy less quickly than those of more active afferents. In combination, the result would be that the more active afferents would retract, leaving the less active afferents in control. Evidence supports the view that electrical activity exerts a regressive influence on the size of axonal arborizations (Cohan & Kater, 1986; Sussdorf & Campenot, 1986; Lipton & Kater, 1989; Mattson & Kater, 1989; Fields, Neale, & Nelson, 1990). This account of the paradoxical shift of ocular dominance toward the deprived eye requires only that the postsynaptic cortical neurons play a permissive role in synaptic plasticity, by regulating the production and re-
A Model of Neurotrophic Interactions
1975
lease of NTFs according to the balance of excitatory and inhibitory synaptic input (Zafra et al., 1991). An alternative view is that cortical neurons might play an instructive role in the plasticity of geniculocortical afferents. For example, the disparity in activity between the undeprived-eye afferents and the muscimol-silenced cortical neurons might lead to a long-term depression of the active, but not the inactive, geniculocortical synapses via mechanisms involving the N-methyl-D-aspartate receptor (discussed in Shatz, 1997). However, neither this account nor our own can explain the paradoxical shifts in ocular dominance that occur following cortical BDNF infusion in kittens (Galuske, Kim, Castren, Thoenen, & Singer, 1996) or cortical NGF infusion in adult cats (Gu, Liu, & Cynader, 1994). Although it is possible that no single unitary mechanism exists to account for all these paradoxical results, it is also possible that both muscimol and NTF infusion might interfere with inhibitory circuitry and cause paradoxical shifts (discussed in Thoenen, 1995). Recent data indicate that NTFs affect not only axonal arborizations, but also dendritic arborizations (McAllister, Katz, & Lo, 1995, 1996, 1997; see also Snider, 1988). However, BDNF and NT-3 exert opposing influences on dendritic development, with these influences reversing in different cortical layers (McAllister et al., 1997). In contrast, while BDNF and NT-4/5 do affect ocular dominance column development, NGF and NT-3 do not (Cabelli et al., 1995, 1997). It therefore seems premature to attempt to construct a model of the action of any particular NTF in the striate cortex. However, this does not preclude the possibility of theoretically exploring the general implications of neurotrophic interactions for development, plasticity, and competition, which is what we have attempted to do here. An interesting neurotrophic model of the elimination of polyneuronal innervation at the vertebrate neuromuscular junction has been presented (Bennett & Robinson, 1989). It is similar to our model inasmuch as it considers the role of neurotrophic molecules and their receptors in development. However, the model does not consider the role of afferent activity in the process of synaptic rearrangement at the neuromuscular junction. Recently a neurotrophic model of the development of ocular dominance columns has been presented (Harris, Ermentrout, & Small, 1997). Unfortunately, this model appears to suffer from a number of difficulties. First, it disregards anatomical plasticity, yet the evidence for the breakdown of segregation of LGN afferents following exogenous infusion of NTFs is anatomical, not physiological (Cabelli et al., 1995). Also, the model assumes a constant pool of available NTF, so that NTF production and release is not regulated by neuronal activity. Finally, the model makes the rather implausible assumption that the uptake of NTF by afferents depends on the afferents’ synaptic efficacy. We have also presented a rather simpler model of neurotrophic interactions than the one presented here (Elliott & Shadbolt, 1996). There are, however, a number of difficulties with our earlier model. First, hand set-
1976
T. Elliott and N. R. Shadbolt
ting was required to prevent uncontrolled growth or uncontrolled decay of afferents. The present model dynamically controls growth and decay. Second, segregation of afferents occurred even when afferent activity was perfectly correlated. This indicates that the postulated mechanisms underlying plasticity in that model are likely to be wrong. In the present model, segregation does not occur in the presence of perfectly correlated patterns of afferent activity. With such patterns, afferents grow in such a way that the ratio of their numbers of synapses remains fixed. Last, growth in the absence of afferent activity was not possible in our earlier model. However, much evidence suggests that afferent growth is possible even when afferent activity is silenced with tetrodotoxin (e.g., Brown & Ironton, 1977; Sretavan, Shatz, & Stryker, 1988; Antonini & Stryker, 1993). Afferent inactivity in the present model does not prevent growth. We thus regard the present model as superior to our earlier model. Future work on our latest model will involve its application to full-scale simulations of the development of ocular dominance columns and the possible inclusion of physiological plasticity.
Acknowledgments T. E. thanks the Royal Society for the support of a Royal Society University Research Fellowship. References Akaneya, Y., Tsumoto, T., & Hatanaka, H. (1996). Brain-derived neurotrophic factor blocks long-term depression in the rat visual cortex. J. Neurophysiol., 76, 4198–4201. Antonini, A., & Stryker, M. P. (1993). Development of individual geniculocortical arbors in cat striate cortex and effects of binocular impulse blockade. J. Neurosci., 13, 3549–3573. Bailey, C. H., & Kandel, E. R. (1993). Structural changes accompanying memory storage. Annu. Rev. Physiol., 55, 397–426. Bengzon, J., Kokaia, Z., Ernfors, P., Kokaia, M., Leanza, G., Nilsson, O. G., Persson, H., & Lindvall, O. (1993). Regulation of neurotrophin and trkA, trkB and trkC tyrosine kinase receptor mRNA expression in kindling. Neuroscience, 53, 433–446. Bennett, M. R., & Robinson, J. (1989). Growth and elimination of nerve terminals at synaptic sites during polyneuronal innervation of musle cells: A trophic hypothesis. Proc. R. Soc. Lond. Ser. B, 235, 299–320. Berardi, N., Domenici, L., Parisi, V., Pizzorusso, T., Cellerino, A., & Maffei, L. (1993). Monocular deprivation effects in the rat visual cortex and lateral geniculate nucleus are prevented by nerve growth factor (NGF). I. Visual cortex. Proc. R. Soc. Lond. Ser. B, 251, 17–23. Berninger, B., Garcia, D. E., Inagaki, N., Hahnel, C., & Lindholm, D. (1994). BDNF
A Model of Neurotrophic Interactions
1977
and NT-3 induce intracellular Ca2+ elevation in hippocampal neurones. NeuroReport, 4, 1303–1306. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Birren, S. J., Verdi, J. M., & Anderson, D. J. (1992). Membrane depolarisation induces p140trk and NGF responsiveness, but not p75LNGFR , in MAH cells. Science, 257, 395–397. Blochl, ¨ A., & Thoenen, H. (1995). Characterization of nerve growth factor (NGF) release from hippocampal neurons: Evidence for a constitutive and an unconventional sodium-dependent regulated pathway. Eur. J. Neurosci., 7, 1220– 1228. Blochl, ¨ A., & Thoenen, H. (1996). Localization of cellular storage compartments and sites of constitutive and activity-dependent release of nerve growth factor (NGF) in primary cultures of hippocampal neurons. Mol. Cell. Neurosci., 7, 173–190. Bonhoeffer, T., Staiger, V., & Aertsen, A. (1989). Synaptic plasticity in rat hippocampal slice cultures: Local “Hebbian” conjunction of pre- and postsynaptic stimulation leads to distributed synaptic enhancement. Proc. Natl. Acad. Sci. U.S.A., 86, 8113–8117. Bozzi, Y., Pizzorusso, T., Cremisi, F., Rossi, F. M., Barsacchi, G., & Maffei, L. (1995). Monocular deprivation decreases the expression of messenger RNA for brain-derived neurotrophic factor in the rat visual cortex. Neuroscience, 69, 1133–1144. Brown, M. C., & Ironton, R. (1977). Motor neurone sprouting induced by prolonged tetrodotoxin block of nerve action potentials. Nature, 265, 459–461. Cabelli, R. J., Hohn, A., & Shatz, C. J. (1995). Inhibition of ocular dominance column formation by infusion of NT-4/5 or BDNF. Science, 267, 1662–1666. Cabelli, R. J., Shelton, D. L., Segal, R. A., & Shatz, C. J. (1997). Blockade of endogenous ligands of trkB inhibits formation of ocular dominance columns. Neuron, 19, 63–76. Campenot, R. B. (1982a). Development of sympathetic neurons in compartmentalized cultures. I. Local control of neurite outgrowth by nerve growth factor. Dev. Biol., 93, 1–12. Campenot, R. B. (1982b). Development of sympathetic neurons in compartmentalized cultures. II. Local control of neurite survival by nerve growth factor. Dev. Biol., 93, 13–22. Carmignoto, G., Canella, R., Candeo, P., Comelli, M. C., & Maffei, L. (1993). Effects of nerve growth factor on neuronal plasticity of the kitten visual cortex. J. Physiol., 464, 343–360. Carmignoto, G., Pizzorusso, T., Tia, S., & Vicini, S. (1997). Brain-derived neurotrophic factor and nerve growth factor potentiate excitatory synaptic transmission in the rat visual cortex. J. Physiol., 498, 153–164. Caroni, P., & Schneider, C. (1994). Signaling by insulin-like growth factors in paralyzed skeletal muscle: Rapid induction of IGF1 expression in muscle fibers and prevention of interstitial cell proliferation by IGF-PBS and IGFBP4. J. Neurosci., 14, 3378–3388.
1978
T. Elliott and N. R. Shadbolt
Castren, E., Zafra, F., Thoenen, H., & Lindholm, D. (1992). Light regulates expression of brain-derived neurotrophic factor mRNA in rat visual cortex. Proc. Natl. Acad. Sci. U.S.A., 89, 9444–9448. Cohan, C. S., & Kater, S. B. (1986). Suppression of neurite elongation and growth cone motility by electrical activity. Science, 232, 1638–1640. Cohen-Cory, S., Elliott, R. C., Dreyfus, C. F., & Black, I. B. (1993). Depolarizing influences increase low affinity NGF receptor gene expression in cultured Purkinje neurons. Exp. Neurol., 119, 165–173. Cohen-Cory, S., & Fraser, S. E. (1995). Effects of brain-derived neurotrophic factor on optic axon branching and remodelling in vivo. Nature, 378, 192–196. Davis, G. W., Schuster, C. M., & Goodman, C. S. (1996). Genetic dissection of structural and functional components of synaptic plasticity. III. CREB is necessary for presynaptic functional plasticity. Neuron, 17, 669–679. Domenici, L., Cellerino, A., & Maffei, L. (1993). Monocular deprivation effects in the rat visual cortex and lateral geniculate nucleus are prevented by nerve growth factor (NGF). II. Lateral geniculate nucleus. Proc. R. Soc. Lond. Ser. B, 251, 25–31. Dugich-Djordjevic, M. M., Ohsawa, F., Okazaki, T., Mori, N., Day, J. R., Beck, K. D., & Hefti, F. (1995). Differential regulation of catalytic and noncatalytic trkB mRNAs in the rat hippocampus following seizures induced by systemic administration of kainate. Neuroscience, 66, 861–877. Edelman, G. M., & Gally, J. A. (1992). Nitric oxide: Linking space and time in the brain. Proc. Natl. Acad. Sci. U.S.A., 89, 11651–11652. Elliott, T., & Shadbolt, N. R. (1996). A mathematical model of activity-dependent, anatomical segregation induced by competition for neurotrophic support. Biol. Cybern., 75, 463–470. English, A. W., & Schwartz, G. (1995). Both basic fibroblast growth factor and ciliary neurotrophic factor promote the retention of polyneuronal innervation of developing skeletal muscle fibers. Dev. Biol., 169, 57–64. Fields, R. D., Neale, E. A., & Nelson, P. G. (1990). Effects of patterned electrical activity on neurite outgrowth from mouse neurons. J. Neurosci., 10, 2950–2964. Funakoshi, H., Belluardo, N., Arenas, E., Yamamoto, Y., Casabona, A., Persson, H., & Ibanez, C. (1995). Muscle-derived neurotrophin-4 as an activitydependent trophic signal for adult motor neurons. Science, 268, 1495–1499. Gally, J. A., Montague, P. R., Reeke, G. N., & Edelman, G. M. (1990). The NO hypothesis: Possible effects of a short-lived, rapidly diffusible signal in the development and function of the nervous system. Proc. Natl. Acad. Sci. U.S.A., 87, 3547–3551. Galuske, R. A. W., Kim, D.-S., Castren, E., Thoenen, H., & Singer, W. (1996). Brain-derived neurotrophic factor reverses experience-dependent synaptic modifications in kitten visual cortex. Eur. J. Neurosci., 8, 1554–1559. Goodman, L. J., Valverde, J., Lim, F., Geschwind, M. D., Federoff, H. J., Geller, A. I., & Hefti, F. (1996). Regulated release and polarized localization of brainderived neurotrophic factor in hippocampal neurons. Mol. Cell. Neurosci., 7, 222–238. Griesbeck, O., Blochl, ¨ A., Carnahan, J. F., Nawa, H., & Thoenen, H. (1995). Characterization of brain-derived neurotrophic factor (BDNF) secretion from hippocampal neurons. Soc. Neurosci. Abstr., 21, 1046.
A Model of Neurotrophic Interactions
1979
Gu, Q., Liu, Y., & Cynader, M. S. (1994). Nerve growth factor–induced ocular dominance plasticity in adult cat visual cortex. Proc. Natl. Acad. Sci. U.S.A., 91, 8408–8412. Guillery, R. W. (1972). Binocular competition in the control of geniculate cell growth. J. Comp. Neurol., 44, 117–130. Guillery, R. W., & Stelzner, D. J. (1970). The differential effects of unilateral lid closure upon the monocular and binocular segments of the dorsal lateral geniculate nucleus in the cat. J. Comp. Neurol., 139, 413–422. Harris, A. E., Ermentrout, G. B., & Small, S. L. (1997). A model of ocular dominance column development by competition for trophic factor. Proc. Natl. Acad. Sci. U.S.A., 94, 9944–9949. Hata, Y., & Stryker, M. P. (1994). Control of thalamocortical afferent rearrangement by postsynaptic activity in developing visual cortex. Science, 265, 1732– 1735. Kang, H., & Schuman, E. M. (1995). Long-lasting neurotrophin-induced enhancement of synaptic transmission in the adult hippocampus. Science, 267, 1658–1662. Knipper, M., Leung, L. S., Zhao, D., & Rylett, R. J. (1994). Short-term modulation of glutamatergic synapses in adult rat hippocampus by NGF. NeuroReport, 5, 2433–2436. Koliatsos, V. E., Clatterbuck, R. E., Winslow, J. W., Cayouette, M. H., & Price, D. L. (1993). Evidence that brain-derived neurotrophic factor is a trophic factor for motor neurons in vivo. Neuron, 10, 359–367. Korte, M., Carroll, P., Wolf, E., Brem, G., Thoenen, H., & Bonhoeffer, T. (1995). Hippocampal long-term potentiation is impaired in mice lacking brainderived neurotrophic factor. Proc. Natl. Acad. Sci. U.S.A., 92, 8856–8860. Korte, M., Griesbeck, O., Gravel, C., Carroll, P., Staiger, V., Thoenen, H., & Bonhoeffer, T. (1996). Virus-mediated gene transfer into hippocampal CA1 region restores long-term potentiation in brain-derived neurotrophic factor mutant mice. Proc. Natl. Acad. Sci. U.S.A., 93, 12547–12552. Kossel, A., Bonhoeffer, T., & Bolz, J. (1990). Non-Hebbian synapses in rat visual cortex. NeuroReport, 1, 115–118. Kwon, Y. W., & Gurney, M. E. (1996). Brain-derived neurotrophic factor transiently stabilizes silent synapses on developing neuromuscular junctions. J. Neurobiol., 29, 503–516. Lessmann, V., Gottmann, K., & Heumann, R. (1994). BDNF and NT-4/5 enhance glutamatergic synaptic transmission in cultured hippocampal neurones. NeuroReport, 6, 21–25. Lipton, S. A., & Kater, S. B. (1989). Neurotransmitter regulation of neuronal outgrowth, plasticity, and survival. Trends Neurosci., 12, 265–270. Lohof, A. M., Ip, N. Y., & Poo, M. M. (1993). Potentiation of developing neuromuscular synapses by the neurotrophins NT-3 and BDNF. Nature, 363, 350–353. Maffei, L., Berardi, N., Domenici, L., Parisi, V., & Pizzorusso, T. (1992). Nerve growth factor (NGF) prevents the shift in ocular dominance distribution of visual cortical neurons in monocularly deprived rats. J. Neurosci., 12, 4651– 4662.
1980
T. Elliott and N. R. Shadbolt
Mattson, M. P., & Kater, S. B. (1989). Excitatory and inhibitory neurotransmitters in the generation and degeneration of hippocampal neuroarchitecture. Brain Res., 478, 337–348. McAllister, A. K., Katz, L. C., & Lo, D. C. (1996). Neurotrophin regulation of cortical dendritic growth requires activity. Neuron, 17, 1057–1064. McAllister, A. K., Katz, L. C., & Lo, D. C. (1997). Opposing roles for endogenous BDNF and NT-3 in regulating cortical dendritic growth. Neuron, 18, 767–778. McAllister, A. K., Lo, D. C., & Katz, L. C. (1995). Neurotrophins regulate dendritic growth in developing visual cortex. Neuron, 15, 791–803. Meyer-Franke, A., Kaplan, M. R., Pfrieger, F. W., & Barnes, B. A. (1995). Characterization of the signaling interactions that promote the survival and growth of developing retinal ganglion cells in culture. Neuron, 15, 805–819. Montague, P. R., Gally, J. A., & Edelman, G. M. (1991). Spatial signaling in the development and function of neural connections. Cereb. Cortex, 1, 199–220. Purves, D. (1988). Body and brain: A trophic theory of neural connections. Cambridge, MA: Harvard University Press. Reid, S. N. M., Daw, N. W., Czepita, D., Flavin, H. J., & Sessa, W. C. (1996). Inhibition of nitric oxide synthase does not alter ocular dominance shifts in kitten visual cortex. J. Physiol., 494, 511–517. Reiter, H. O., & Stryker, M. P. (1988). Neural plasticity without postsynaptic action potentials: Less-active inputs become dominant when kitten visual cortical cells are pharmacologically inhibited. Proc. Natl. Acad. Sci. U.S.A., 85, 3623–3627. Reiter, H. O., Waitzman, D. M., & Stryker, M. P. (1986). Cortical activity blockade prevents ocular dominance plasticity in the kitten visual cortex. Exp. Brain Res., 65, 182–188. Riddle, D. R., Lo, D. C., & Katz, L. C. (1995). NT-4-mediated rescue of lateral geniculate neurons from effects of monocular deprivation. Nature, 378, 189– 191. Ruthazer, E. S., Gillespie, D. C., Dawson, E. M., Snyder, S. H., & Stryker, M. P. (1996). Inhibition of nitric oxide synthase does not prevent ocular dominance plasticity in kitten visual cortex. J. Physiol., 494, 519–527. Salin, T., Mudo, G., Jiang, X.H., Timmusk, T., Metsis, M., & Belluardo, N. (1995). Up-regulation of trkB mRNA expression in the rat striatum after seizures. Neurosci. Lett., 194, 181–184. Schoups, A. A., Elliott, R. C., Friedman, W. J., & Black, I. B. (1995). NGF and BDNF are differentially modulated by visual experience in the developing geniculocortical pathway. Dev. Brain Res., 86, 326–334. Schuster, C. M., Davis, G. W., Fetter, R. D., & Goodman, C. S. (1996a). Genetic dissection of structural and functional components of synaptic plasticity. I. Fasciclin II controls synaptic stabilization and growth. Neuron, 17, 641–654. Schuster, C. M., Davis, G. W., Fetter, R. D., & Goodman, C. S. (1996b). Genetic dissection of structural and functional components of synaptic plasticity. II. Fasciclin II controls presynaptic structural plasticity. Neuron, 17, 655–667. Shatz, C. J. (1997). Neurotrophins and visual system plasticity. In W. M. Cowan, T. M. Jessell, & S. L. Zipursky (Eds.), Molecular and cellular approaches to neural development (p. 509). New York: Oxford University Press.
A Model of Neurotrophic Interactions
1981
Snider, W. D. (1988). Nerve growth factor enhances dendritic arborization of sympathetic ganglion cells in developing mammals. J. Neurosci., 8, 2628–2634. Snider, W. D., & Lichtman, J. W. (1996). Are neurotrophins synaptotrophins? Mol. Cell. Neurosci., 7, 433–442. Sretavan, D. W., Shatz, C. J., & Stryker, M. P. (1988). Modification of retinal ganglion cell morphology by prenatal infusion of tetrodotoxin. Nature, 336, 468–471. Srihari, T., & Vrbova, G. (1978). The role of muscle activity in the differentiation of neuromuscular junctions in slow and fast chick muscles. J. Neurocytol., 7, 529–540. Stryker, M. P., & Harris, W. (1986). Binocular impulse blockade prevents the formation of ocular dominance columns in cat visual cortex. J. Neurosci., 6, 2117–2133. Sussdorf, W. S., & Campenot, R. B. (1986). Influence of the extracellular potassium environment on neurite growth in sensory neurons, spinal cord neurons and sympathetic neurons. Dev. Brain Res., 25, 43–52. Thoenen, H. (1995). Neurotrophins and neuronal plasticity. Science, 270, 593–598. Thompson, W. J., Kuffler, D. P., & Jansen, J. K. S. (1979). The effect of prolonged, reversible block of nerve impulses on the elimination of polyneuronal innervation of new-born rat skeletal muscle fibers. Neuroscience, 4, 271–281. von der Malsburg, C. (1973). Self-organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100. Yan, H. Q., Mazow, M. L., & Dafny, N. (1996). NGF prevents the changes induced by monocular deprivation during the critical period in rats. Brain Res., 706, 318–322. Zafra, F., Castren, E., Thoenen, H., & Lindholm, D. (1991). Interplay between glutamate and γ –aminobutyric acid transmitter systems in the physiological regulation of brain-derived neurotrophic factor and nerve growth factor synthesis in hippocampal neurons. Proc. Natl. Acad. Sci. U.S.A., 88, 10037–10041. Received October 3, 1997; accepted February 20, 1998.
NOTE
Communicated by Garrison Cottrell
Why Does the Somatosensory Homunculus Have Hands Next to Face and Feet Next to Genitals? A Hypothesis Martha J. Farah Department of Psychology, University of Pennsylvania, Philadelphia, PA 19104, U.S.A.
The somatotopic mapping of the body surface onto its cortical representation is a paradigm case of topographic mapping in the nervous system. It has functioned as the model case for physiological studies of plasticity in cortical maps (Kaas, 1991; Ramachandran, Stewart, & Rogers-Ramachandran, 1992a, 1992b) and computational studies of self-organization (Pearson, Finkel, & Edelman, 1987; Merzenich, 1987; Willshaw & von der Malsburg, 1976). However, the map is not perfect. There are two major discontinuities: the hands and face of the homunculus are adjacent to one another, as are the genitals and feet (Penfield & Rasmussen, 1950; Ramachandran, Stewart, & Rogers-Ramachandran, 1992b). Why is this? I have been unable to find any explanations of this curious feature of somatosensory cortical organization in the literature. Here I point out that mechanisms of self-organization, in combination with the normal position of the fetus in the womb, will incline the map toward just this organization. In self-organizing systems, adjacent neurons come to represent coincident sources of activation, and the somatotopic map results from the greater likelihood of coincident stimulation to adjacent body surfaces than nonadjacent. The role of prenatal self-organization was demonstrated in a recent experiment by Florence et al. (1996). They cut the median nerve to the hand in fetal monkeys and demonstrated that after repair, the pattern of nerve termination was grossly disordered. Nevertheless, microelectrode mapping of the primary somatosensory cortex revealed a normal, orderly topography. The surprising implication of this finding is that cortical somatotopy does not reflect the organization of peripheral inputs; that is, the cortical homunculus is not simply the result of more peripheral somatotopic organizations being conserved through orderly ascending inputs. Rather, the statistics of prenatal sensory experience is the predominant factor in the genesis of the somatosensory homunculus. One way in which prenatal sensory experience differs markedly from postnatal experience is in the frequency with which hands touch face and feet touch genitals, and this can explain the discontinuities in the homunculus. The flexed posture of the limbs in the fetal position places hands near face and feet near genitals, as shown in Figure 1. Normal fetal movement c 1998 Massachusetts Institute of Technology Neural Computation 10, 1983–1985 (1998) °
1984
Martha J. Farah
Figure 1: Typical positions of fetus, shown at three gestational ages.
thus results in the most frequent coinicident stimulations being between hands and face, and feet and genitals. By the mechanisms of self-organization already mentioned, this will drive together the representations of these pairs of body parts. Note that the touching of hands and feet with other body parts would not alter this organization unless some other pairing were as frequent as or more frequent than hands-face or feet-genitals. Why do the discontinuities persist postnatally? Computationally, selforganized maps tend to retain their overall topology once they have become organized, with only local shifting of boundaries possible (Pearson et al., 1987; Willshaw & von der Malsburg, 1976). Physiologically, the ability to
Somatosensory Homunculus
1985
self-organize independent of orderly sensory afferents is lost in postnatal development, again permitting only relatively local remapping (Florence et al., 1996). Acknowledgments The writing of this Note was supported by NIH grants R01-NS34030, R01AG14082, and K02-AG0056. I thank Gary Cottrell, Leif Finkel, Sherre Florence, Jon Kaas, Matthew Kurbat, V. S. Ramachandran, and an anonymous reviewer for helpful comments on the idea presented here, and little Theodora Farah for suggesting the idea. References Florence, S. L., Jain, N., Pospichel, M. W., Beck, P. D., Sly, D. L., & Kaas, J. H. (1996). Central reorganization of sensory pathways following peripheral nerve regeneration in fetal monkeys. Nature, 381, 69–71. Kaas, J. H. (1991). Plasticity of sensory and motor maps in adult mammals. Annual Review of Neuroscience, 14, 137–167. Merzenich, M. M. (1987). Dynamic neocortical processes and the origins of higher brain functions. In J.-P. Changeaux & M. Konishi (Eds.), The neural and molecular bases of learning. New York: Wiley. Pearson, J. C., Finkel, L. H., & Edelman, G. M. (1987). Plasticity in the organization of adult cerebral cortical maps: A computer simulation based on neuronal group selection. Journal of Neuroscience, 7, 4209–4223. Penfield, W., & Rasmussen, (1950). The cerebral cortex of man: A clinical study of localization of function. New York: Macmillan. Ramachandran, V. S., Stewart, M., & Rogers-Ramachandran, D. C. (1992a). Perceptual correlates of massive cortical reorganization. Science, 258, 1159–1160. Ramachandran, V. S., Stewart, M., & Rogers-Ramachandran, D. C. (1992b). Perceptual correlates of massive cortical reorganization. NeuroReport, 3, 583–586. Willshaw, D. J., & von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organization. Proceedings of the Royal Society of London, Series B, 194, 431–445. Received August 22, 1997; accepted April 16, 1998.
LETTER
Communicated by Peter Konig ¨
Extracting Oscillations: Neuronal Coincidence Detection with Noisy Periodic Spike Input Richard Kempter ¨ ¨ Physik-Department der TU Munchen, D-85747 Garching bei Munchen, Germany
Wulfram Gerstner Swiss Federal Institute of Technology, Center of Neuromimetic Systems, EPFL-DI, CH-1015 Lausanne, Switzerland
J. Leo van Hemmen ¨ ¨ Physik-Department der TU Munchen, D-85747 Garching bei Munchen, Germany
Hermann Wagner ¨ Biologie II, D-52074 Aachen, Germany RWTH Aachen, Institut fur
How does a neuron vary its mean output firing rate if the input changes from random to oscillatory coherent but noisy activity? What are the critical parameters of the neuronal dynamics and input statistics? To answer these questions, we investigate the coincidence-detection properties of an integrate-and-fire neuron. We derive an expression indicating how coincidence detection depends on neuronal parameters. Specifically, we show how coincidence detection depends on the shape of the postsynaptic response function, the number of synapses, and the input statistics, and we demonstrate that there is an optimal threshold. Our considerations can be used to predict from neuronal parameters whether and to what extent a neuron can act as a coincidence detector and thus can convert a temporal code into a rate code.
1 Introduction Synchronized or coherent oscillatory activity of a population of neurons is thought to be a vital feature of temporal coding in the brain. Oscillations have been observed in the visual cortex (Eckhorn et al., 1988; Gray & Singer, 1989), the sensorimotor cortex (Murthy & Fetz, 1992), the hippocampus (Burgess, Recce, & O’Keefe, 1994), and the olfactory system (Freeman, 1975; Davis & Eichenbaum, 1991; Wehr & Laurent, 1996). Coherent firing of neurons might be used for solving the problems of feature linking and pattern segmentation (von der Malsburg & Schneider, 1986; Eckhorn et al., c 1998 Massachusetts Institute of Technology Neural Computation 10, 1987–2017 (1998) °
1988
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
1988; Wang, Buhmann, & von der Malsburg, 1990; Ritz, Gerstner, Fuentes, & van Hemmen, 1994; Ritz, Gerstner, & van Hemmen, 1994) and could also support attentional mechanisms (Murthy & Fetz, 1992). Another prominent example where coherent or phase-locked activity of neurons is known to be important is the early auditory processing in mammals, birds, and reptiles (Carr, 1992, 1993). Spikes are found to be phase locked to external acoustic stimuli with frequencies up to 8 kHz in the barn owl (Koppl, ¨ 1997). In the barn owl and various other animals, the relative timing of spikes is used to transmit information about the azimuthal position of a sound source. In performing this task, the degree of synchrony of two groups of neurons is read out and transformed into a firing rate pattern, which can then be used for further processes and to control motor units. The essential step of translating a temporal code into a rate code is performed by neurons that work as coincidence detectors. Similarly, if neural coding in the cortex is based on oscillatory activity, then oscillations should lead to behavioral actions. Motor output requires a mean level of activity in motor efferents of the order of, say, a hundred milliseconds. But then somewhere in the brain, there must be a neural “unit” that transforms the temporally coded oscillatory activity into a rate-coded mean activity that is suitable for motor output. We do not want to speculate here what this “unit” looks like. It might be composed of an array of neurons, but it is also possible that single neurons perform this transformation. Here we focus on the question of whether the task of transforming a spike code into a rate code can be done by a single neuron. The issue of how neurons read out the temporal structure of the input and how they transform this structure into a firing rate pattern has been addressed by several authors and is attracting an increasing amount of interest. Konig, ¨ Engel, & Singer (1996) have argued that the main prerequisite for coincidence detectors is that the mean interspike interval is long compared to the integration time that neurons need to sum synaptic potentials effectively. The importance of the effective (membrane) time constant of neurons has also been emphasized by Softky (1994). In addition, Abeles (1982) has shown that the value of the spike threshold and the number of synapses are relevant parameters as well. Some general principles have already been outlined, but a mathematical derivation of conditions under which neurons can act as coincidence detectors is still not available. In this article, we will substantiate the statements we have already made and show explicitly the dependence of the precision of neuronal coincidence detection on the shape of the postsynaptic potential, the input statistics, and the voltage threshold at which an action potential is generated. Specifically, we tackle the question of whether and to what extent a neuron that receives periodically modulated input can detect the degree of synchrony and convert a time-coded signal into a rate-coded one.
Extracting Oscillations
1989
2 Methods This section specifies the input and briefly reviews the neuron model. 2.1 Temporal Coding of the Input. We consider a single neuron that receives stochastic input spike trains through N independent channels. Input spikes are generated stochastically and arrive at a synapse with a timedependent T-periodic rate λin (t) = λin (t + T) ≥ 0. The probability of having a spike in the interval [t, t + 1t) is λin (t)1t as 1t → 0. In this way we obtain a nonstationary or inhomogeneous Poisson process (cf., e.g., Tuckwell, 1988, Sec. 10.8) where input spikes are more or less phase locked to a T-periodic stimulus. According to the definition of Theunissen and Miller (1995), this kind of input is a temporal code. The average number of spikes that arrive during one period at a synapse will be called p. The time-averaged mean R t +T input rate is λ¯ in := 1/T t00 dt0 λin (t0 ) and equals p/T. To parameterize the input, we take the function λin (t) := p
∞ X
Gσ (t − mT),
(2.1)
m=−∞
where Gσ (.) denotes a normalized gaussian distribution with zero mean and standard deviation σ > 0. In Figure 1 we present a few examples of spike trains generated by the time-dependent rate in equation 2.1. We assume that the neuron under consideration receives input from N À 1 presynaptic terminals. At each input terminal, spikes arrive independently of the other terminals and with a probability density given by equation 2.1. We note that equation 2.1 is an idealization of biological spike trains because there is no refractoriness. The degree of synchrony of the input is parameterized by the jitter σ ∈ [0, ∞), the standard deviation of the gaussian distribution. In the case σ = 0, the input spikes arrive perfectly phase locked and occur only at the times tm = mT with integer m, and the number of spikes arriving at time tm has a Poissonian distribution with parameter p. Instead of σ , we often consider another measure of synchrony, the so-called vector strength rin (Goldberg & Brown, 1969). This measure of synchrony can be defined as the amplitude of the first Fourier component of the periodic rate in equation 2.1 divided by the Fourier component of order zero. For the input (see equation 2.1) we find " # µ ¶ 1 2π 2 2 in σ . (2.2) r = exp − 2 T By construction, we have 0 ≤ rin ≤ 1. Many neuron models start from a gain function where the mean output firing rate increases with increasing mean input rate. This is certainly cor-
1990
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
-1
in
λ (t)
T
σ
T
2T
-1
0
0
2T
4T
6T
t
Figure 1: Input spikes are phase locked with period T (horizontal bar) and temporal jitter σ . The upper part of the figure shows the time-dependent rate λin (t) (solid line) given in equation 2.1 of a single input channel with parameters p = 1 and σ = T/4, the width of the underlying gaussian (dashed line). In the lower part of the figure, we present four samples of spike trains (vertical bars) generated by a Poisson process with the time-dependent rate specified above. The times t = mT with integer m where the rate λin (t) has its maxima are indicated by vertical dotted lines.
rect for most biological neurons. For integrate-and-fire (I&F) neurons, this property has been studied by Stemmler (1996). In this article, we go at least one step further. We vary the input under the constraint of a constant mean input rate, λ¯ in = const. The assumption of a constant mean input is not a limitation of our approach but a simplification we make here in order to illustrate our main point. We want to show that even with a constant mean input rate, the mean output rate λ¯ out varies as a function of the temporal structure of the input, parameterized, for example, by rin . In other words, the neuron can “read out” a temporal code. This property is essential for coincidence detection.
Extracting Oscillations
1991
2.2 Neuron Model and Spike Processing. We describe our neuron as an I&F unit with membrane potential u. The neuron fires if u(t) approaches the threshold ϑ from below. This defines a firing time tn with integer n. After an output spike, which need not be described explicitly, the membrane potential is reset to 0. Between two firing events, the membrane voltage changes according to the linear differential equation, 1 d u(t) = − u(t) + i(t), dt τm
(2.3)
where i is the total input current and τm > 0 the membrane time constant. The input is due to presynaptic activity. The spike arrival times at a given f synapse j are labeled by tj where f = 1, 2, . . . is a running spike index. We assume that there are many synapses 1 ≤ j ≤ N with N À 1. Each presynaptic spike evokes a small postsynaptic current (PSC) that decays exponentially with time constant τs > 0. All synapses are equal in the sense that the incoming spikes evoke PSCs of identical shape and amplitude. The total input of the neuron is then taken to be f N X t − tj 1 X θ (t − t f ), exp − i(t) = j τs j=1 f τs
(2.4)
where θ(.) denotes the Heaviside step function with θ (s) = 0 for s ≤ 0 and θ(s) = 1 for s > 0. We substitute equation 2.4 in 2.3 and integrate. This yields the membrane potential at the hillock, X X X f ²(t − tj ) + η(t − tn ). (2.5) u(t) = j
f
n
The first term on the right in equation 2.5, · µ ¶ µ ¶¸ s s τm exp − − exp − θ (s), ²(s) = τm − τs τm τs
(2.6)
describes the typical time course of an excitatory postsynaptic potential (EPSP). If τs = τm , we have instead of equation 2.6 the so-called alpha function, ²(s) = (s/τm )· exp(−s/τm ) θ(s). The argument below does not depend on the precise form of ². The second contribution to equation 2.5, ¶ µ s θ (s), (2.7) η(s) = −ϑ exp − τm accounts for the reset of the membrane potential after each output spike and incorporates neuronal refractoriness.
1992
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
3 Analysis of Coincidence Detection We are going to examine the coincidence detection properties of our model neuron. To study the dependence of the output firing rate on the temporal structure of the input and to answer the question of how this is influenced by neuronal parameters, we use the I&F model and the temporally coded spike input already introduced. Qualitative considerations, useful definitions, and illustrating simulations are presented in section 3.1. They explain the gist of why, and how, coincidence detection works. We then return to a mathematical treatment in section 3.2 and finish with some examples in section 3.3. 3.1 The Quality of a Coincidence Detector. We now explain how the ability of a neuron to act as a coincidence detector depends on the leakiness of the integrator (section 3.1.1), the threshold ϑ (section 3.1.2), the time constants τm and τs (section 3.1.3), and the number N of synapses as well as the mean input rate λ¯ in (section 3.1.4). 3.1.1 Leaky or Nonleaky Integrator? The most important parameter of the neuron model is the membrane time constant τm . If we take the limit τm → ∞, we are left with a simple nonleaky integrator (cf. equation 2.3). In this case, the mean output rate can be calculated explicitly. Integrating equation 2.3 from the nth output spike at tn to the next one at tn+1 we obtain R tn+1 ϑ = tn dt i(t). A summation over M spikes yields 1 tn+M − tn N λ¯ in + ϑ= M M
tZn+M
dt [i(t) − N λ¯ in ],
(3.1)
tn
where we have separated the right-hand side into a first term that represents the contribution of the mean input current N λ¯ in and a second term that is the fluctuation around the mean. In order to calculate the mean output rate λ¯ out , we have to consider the limit M → ∞. We introduce the mean output rate by defining λ¯ out := limM→∞ M/(tn+M − tn ). As M → ∞, the contribution from the second term in the right-hand side of equation 3.1 vanishes, and we are left with λ¯ out = Nλ¯ in /ϑ. The mean output rate is independent of the explicit form of the time-dependent input rate λin (t), especially rin , which is demonstrated graphically by Figure 2a. Hence we must have a finite τm and thus a leaky integrator, if we want to end up with a coincidence detector, whose rate varies with rin . But what is meant by a “finite” τm ? The answer depends on the value of the threshold ϑ. 3.1.2 Voltage Threshold. We address the problem of how to adjust the threshold so that an I&F neuron can be used as a coincidence detector. Let
Extracting Oscillations
m
a)
1993
!1
ϑ u(t)
T
0
0
5T t
10T
m = T
b)
u(t)
T u
oo
0
0
5T t
10T
Figure 2: Membrane potential u of an I&F neuron as a function of time t. The neuron receives input from 400 synapses with noise parameter p = 0.5 and vector strength rin = 1 (perfectly phase-locked input: solid lines) and rin = 0 (completely random input: dashed lines); cf. eq. 2.1. There is a spike at t = 0. The time constant of the synaptic input current is τs = T. (a) For an infinitely large membrane time constant τm , the intervals between output spikes are nearly independent from the vector strength in the input. Such a neuron is a “bad” coincidence detector. (b) For a finite membrane time constant (here: τm = T), the mean interval between output spikes does depend on the vector strength in the input, if the threshold is chosen appropriately. For a threshold ϑ = u¯ ∞ /2 (lower dotted line), the first spike would have occurred, nearly independently of the input vector strength rin , somewhere near t = T, whereas for a threshold ϑ = u¯ ∞ (upper dotted line) the time to the first spike depends strongly on rin . For rin = 1, the first spike appears near t = 3 T (left arrow) and a second spike near t = 7 T (middle arrow). For rin = 0 an output spike occurs at t = 8 T (right arrow).
1994
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
us assume for the moment that the firing threshold is very high (formally, ϑ → ∞), and let us focus on the temporal behavior of the membrane voltage u(t) with some input current i. The membrane potential cannot reach the threshold so that there is neither an output spike nor a reset to baseline, and the membrane voltage fluctuates around the mean voltage u¯ ∞ = N λ¯ in τm ; see Figure 2b. (The voltage u¯ ∞ equals u(t) as t → ∞, provided the total input current is equal to its mean value i = N λ¯ in .) We now lower the threshold so that the neuron occasionally emits a spike. The coincidence-detection properties of this neuron depend on the location of the threshold ϑ relative to u¯ ∞ . We analyze the dependence of the output firing rate on the threshold and the input vector strength rin . As shown in Figure 3a, the mean output firing rate λ¯ out is rather insensitive to the input vector strength rin for ϑ < u¯ ∞ . We therefore get a poorly performing coincidence detector. In contrast, a threshold ϑ > u¯ ∞ leads to a large variation of the mean output firing rate as a function of the input vector strength rin . Consequently we seem to obtain a better coincidence detector. The underlying mechanism of this improvement is illustrated by Figure 2b, where the trajectory of the membrane voltage u(t), after a reset at time t = 0, is shown for two cases: random and phase-locked input. Let us imagine a threshold ϑ well below u¯ ∞ , say, at u¯ ∞ /2. In this case, the next spike following the one at t = 0 is triggered after a short time interval, the length of which depends only marginally on the degree of synchrony in the input. We are close to the regime of a nonleaky integrator. Formally, this can be seen from eq. 2.3. Between two firings, the membrane potential always stays below threshold, u(t) < ϑ. If the average current is much larger than ϑ/τm , then the first term in the right-hand side of equation 2.3 can be neglected, and we do have a nonleaky neuron. In contrast, let us consider the case ϑ > u¯ ∞ . The threshold ϑ can be reached only if the fluctuations of u(t) are large enough. The fluctuations consist of a statistic contribution, due to spike input (shot noise), and periodic oscillations, due to phase-locked (coherent) input. The key observation is that with increasing synchrony in the input, the periodic oscillations get larger and therefore the output firing rate increases. In order to quantify this effect, we introduce a new parameter. Definition 1. The ratio of the mean output firing rate λ¯ out for coherent input with vector strength rin > 0 to the rate for completely random input with vanishing vector strength is called the coherence gain E. E(rin ) :=
λ¯ out (rin ) , with E(rin ) ≥ 1. λ¯ out (0)
(3.2)
A coherence gain E(rin ) ≈ 1 means that the I&F neuron does not operate as a coincidence detector, whereas E(rin ) À 1 hints at good coincidence detection.
Extracting Oscillations
10
a)
1995
0
ϑ-1 ϑ0
λ
out
-1
10
ϑ1 ϑ2
-2
10
0.0
0.5 in r
1.0
1
10
E(1)-1
b)
0
10
-1
10
0.2
γ
c)
0.1 ϑopt 0.0
ϑ-1
ϑ0=uoo
ϑ
ϑ1
ϑ2
1996
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
With the above definition of the coherence gain, the four graphs in Figure 3a can be summarized by Figure 3b, where the dependence of E(rin ) on the threshold is shown for the special case rin = 1. The coherence gain E(1) increases with increasing threshold ϑ. From Figure 3b also E(rin ) for any desired rin can be estimated. Since in first approximation λ¯ out depends linearly on rin , we can use the approximations £ ¤ λ¯ out (rin ) ≈ λ¯ out (0) + rin λ¯ out (1) − λ¯ out (0)
(3.3)
E(rin ) ≈ 1 + rin [E(1) − 1] .
(3.4)
and
Equation 3.4 tells us that E(rin ) increases with increasing input vector strength rin . Furthermore, E(rin ) inherits from E(1) the property that it increases with ϑ. The measure E(rin ) is useful but not sufficient for characterizing the performance of a coincidence detector because the output of a coincidence detector must convey a signal concerning the nature of the input (coherent or not coherent) in a finite amount of time. Neurons with E(1) À 1 but with a very low mean output rate are basically useless as coincidence detectors. Figure 3: Facing page. Coincidence-detection properties depend on the value of the spiking threshold. We show numerical results (solid lines) and approximadec ref tions (dashed lines) based on equation 3.21 with p τ = 3/2 τm , τ = 2 τm , and ρ √ determined from equation 3.15 with A τ = τm /2. (a) Mean output firing rate λ¯ out (in units of spikes per period T) as a function of the input vector strength rin for four different values of the threshold (ϑ−1 = u¯ ∞ − 1ustoch , ϑ0 = u¯ ∞ , ϑ1 = stoch u¯ ∞ + 1ustoch , ϑ2 = u¯ ∞ + 21u ). The mean voltage is u¯ ∞ = 200 (as in Fig√ stoch ure 2b) and 1u = 10/ 2; cf. equation 3.9. For large ϑ (e.g., ϑ2 ) the output rate varies by an order of magnitude if the temporal coherence of the input increases from rin = 0 to rin = 1. On the other hand, for ϑ = ϑ−1 the rate λ¯ out hardly depends on the temporal structure of the input. With rin fixed, the rate λ¯ out increases with decreasing ϑ in all cases. (b) E(1) − 1 is plotted against the threshold ϑ. The coherence gain E is defined in equation 3.2. For the fits in b and c we have ρ = 0.70 (for rin = 1) in equation 3.21. (c) The parameter γ (in units of T−1/2 I1/2 ) that indicates the quality of a coincidence detector shows a maximum at ϑopt , which is above the mean voltage u¯ ∞ = ϑ0 . The value of ϑopt is the optimal choice for the threshold of this coincidence detector. In practical situations, it is immaterial, though, whether we take ϑopt or, for example, ϑ1 . That is, the choice of the threshold is not critical. Points have been obtained through a combination of equation 3.6 and data from a and b. Simulation parameters in a, b, and c: τm = τs = T, Np = 200. Data points have been obtained by measuring time intervals that were needed to produce 104 output spikes.
Extracting Oscillations
1997
This is the case, for example, for leaky I&F units with a voltage threshold well above u¯ ∞ (cf. Figure 3a). In this regime the mean output rate is very low. If there is (even for rin = 1) hardly any output spike in a certain time interval, then decisions about the neuron’s input can be made only with great error. The other extreme case is a high mean output rate, which implies a threshold well below u¯ ∞ . In this regime E(1) is low; we are in the regime of a nonleaky integrator where neurons are not able to perform a coincidence detection either (cf. Figure 2a). Between the above two limits is an intermediate regime with an optimal threshold for coincidence detection. It is “optimal” in the sense that both the spike rate and the coherence gain are high. The reason that we need a high spike rate is that the number n of output spikes in a finite interval I, from which the rate has to be estimated, is a random quantity. This is because of the noisy input. Here we assume that n is Poisson distributed, an approximation that is very good if spiking is driven by the variance of the neuron’s input (Troyer & Miller, 1997). For two different input vector strengths, for example, rin = 0 and rin = rˆin > 0, we have two different distributions P(rin ). The task is to distinguish the two cases based on a single measurement of the number of output spikes in I. We therefore need a number n0 as a decision boundary. If n ≥ n0 , then we classify the input to have rin = rˆin > 0; if n < n0 , then we classify rin = 0. Clearly the probability for the correct decision depends on n0 . In general, the optimal choice for the decision boundary n0opt is the point where the two spike distributions P(rin = 0) and P(rin = rˆin ) cross (see, e.g., Duda & Hart, 1973). In order to be able to distinguish reliably the two alternatives, the two respective distributions should not overlap too much. The error increases with decreasing “distance” between them. The distance is a helpful quantity that measures the probability of error or the discriminability between the two possibilities. The quality factor for coincidence detection is defined as follows: Definition 2. Let the number n of output spikes of a neuron in an interval of ¯ in ) = λ¯ out (rin )I, where λ¯ out (rin ) length I be Poisson distributed with parameter n(r is the mean output firing rate given an input vector strength rin . The normalized distance between the distribution for random input with vanishing vector strength and the distribution for coherent input with some rin > 0 is defined to be the quality factor γ for coincidence detection. It is obtained by dividing the difference of the distributions’ mean values by the sum of their standard deviations, p p ¯ ¯ in ) − n(0) n(r ¯ in ) − n(0). ¯ = n(r γ := p √ in ¯ ¯ n(r ) + n(0)
(3.5)
Using the definition of the coherence gain E in equation 3.2, we obtain
1998
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
from equation 3.5 γ =
q ³p ´ I λ¯ out (0) E(rin ) − 1 .
(3.6)
Equation 3.6 shows how the quality factor γ increases with increasing I, λ¯ out (0), and E(rin ). It is important to realize that λ¯ out (0) and E(rin ) are not independent variables. Stemmler (1996) has related a quantity similar to γ to more sophisticated signal-detection quantities such as the “mutual information between the spike counts and the presence or absence of the periodic input” and the “probability of correct detection in the discrimination between the two alternatives,” both of which can be expanded in powers of γ . We do not calculate these quantities here as a function of γ . To classify the quality of coincidence detection, γ itself suffices. In Figure 3c we have plotted γ as a function of the threshold ϑ. This graph clearly shows that there is an optimal choice for the threshold ϑopt that maximizes γ . The quality factor γ as a function of the threshold ϑ generally exhibits a maximum. We argue that γ (ϑ) approaches zero for ϑ → 0 and also for ϑ → ∞. Thus, there must be at least one maximum in between. The case ϑ → 0 corresponds to an infinitely high membrane time constant. This means that the neuron is effectively a nonleaky integrator. For this kind of integrator, the mean output rate does not depend on the input structure. Thus, E = 1 and γ (ϑ) → 0 as ϑ → 0 (cf. equation 3.6). In the case ϑ → ∞, ¯ in ) > n(0) ¯ ¯ in ) → 0 for ϑ → ∞. Since n(r it follows from we argue that n(r equation 3.5 that γ → 0 as ϑ → ∞. The value of the optimal threshold for coincidence detection will be estimated in section 3.2. As we will show there, for a high-quality factor γ , it is not necessary that the threshold be exactly at its optimal value ϑopt . Since γ depends only weakly on the threshold, the latter is not a critical parameter for coincidence detection. In contrast to that, γ varies strongly if we change, say, neuronal time constants. 3.1.3 Neuronal Time Constants. We now point out briefly the dependence of the coherence gain E in equation 3.2, the rate λ¯ out (0), and the quality factor γ on the time constants τm and τs for the special case τm = τs . Figures 4a–c show that shorter neuronal time constants yield better coincidence detectors for each of the four different threshold values around ϑopt . The reason for this effect will be clarified in section 3.2. 3.1.4 Number of Synapses and Mean Input Rate. Since we have N identical and independent synapses receiving, on the average, p spike inputs per period and a linear neuron model (except for the threshold process), the variables N and p enter the problem of coincidence detection only via the product Np, the total number of inputs per period. The quantity Np will be
Extracting Oscillations
1
2
10
0
1 0
-1
-1
10
2
0
1 0 -1
-1
0
10
-1
0 1
-2
2
-1
0 1
-2
2
10 10
f)
0
10
-1
0
-1
10
2 1 0 -1
2
1
10
-1 -2
10
-2
10
-1
-1 out
10
0
γ
out
λ (0)
10
1
e) 10 10
γ
10
10
b)
c)
d)
λ (0)
E(1)-1
10
E(1)-1
a)
1999
0.1
1.0 τm/T
1
10
2
10
3
10 Np
4
10
Figure 4: The coherence gain E(1) as defined in equation 3.2 for rin =1, the mean output rate λ¯ out (0) for random input in units of spikes per period T, and the quality factor γ in units of T−1/2 I1/2 (cf. equation 3.5) depend on neuronal parameters. Here we illustrate the dependence on the membrane time constant τm (left graphs) and the number Np of input spikes per period (right graphs) for four different threshold scenarios in each graph. For the EPSPs we have used equation 2.6 with τs = τm . To indicate the position of the threshold, we have assigned the numbers −1, 0, 1, and 2 to each trace (solid lines) that correspond to the indices of the threshold values. We have ϑ−1 = u¯ ∞ − 1ustoch , ϑ0 = u¯ ∞ , stoch ϑ1 = u¯ ∞ + 1u√ , and ϑ2 = u¯ ∞ + 21ustoch , respectively, where u¯ ∞ = Np τm /T and 1ustoch = u¯ ∞ /2 (cf. also equation 3.9). Fits to the numerical results (dashed lines) are based on equation 3.21 with τ dec = 3/2 τm and τ ref = 2 τm . (a–c) The dependence on τm is shown. To get a reasonable coincidence detector for Np = 200, the time constant τm should be at least of the same order of magnitude as T. In fact, this figure has a plain message: The smaller τm , the larger E(1), λ¯ out (0), and γ . The nonmonotonous part near τm = 0.35 T is due to resonance effects that occur for widths of EPSPs of the order of T. (d–f) The performance of a coincidence detector also depends on the number Np of input spikes per period. With fixed parameters τm = τs = T, the number Np has to exceed 1 substantially so as to get a reasonable coincidence detector. All data points have been obtained by measuring the number of output spikes in a time interval of width 105 T.
2000
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
treated as a single degree of freedom. The dependence of E(1), λ¯ out (0), and γ on Np is shown in Figures 4d–f. The larger Np, the better is the neuron’s performance as a coincidence detector (cf. section 3.2). To summarize this section, two quantities determine the quality of a coincidence detector neuron: (1) the rate λ¯ out (0) and (2) the coherence gain E(rin ), which both enter the quality factor γ in equation 3.6. Both quantities depend on neuronal parameters. If, for example, the threshold is increased, then the coherence gain E is enhanced, but at the same time the rate λ¯ out (0) is lowered. We note that both the coherence gain E(rin ) and λ¯ out (0) are, at least in the frame of an I&F model, determined by the neuron’s time constants τm and τs , the period T, the mean number Np of inputs per period, and the threshold ϑ. 3.2 Mathematical Treatment. To transfer the observations from I&F neurons to biological neurons, the quantities τm , τs , T, and Np have to be determined experimentally. To draw conclusions about the quality of a coincidence detector, knowledge about the spontaneous rate λ¯ out (0) and, thus, the threshold ϑ is necessary (cf. equation 3.6). But usually there is no direct access to E(rin ) or γ . To close this gap, we present a method of estimating E(rin ) and γ from experimentally available parameters. The mathematical analysis that we present in this subsection is not limited to the I&F model introduced above. We derive our results for a more general class of threshold neurons whose response to an input spike can be described by an EPSP. This class of neuron models has been called the Spike Response Model (SRM) (Gerstner & van Hemmen, 1992, 1994) and can emulate other neuron models, such as the Hodgkin-Huxley model (Kistler, Gerstner, & van Hemmen, 1997). 3.2.1 Signal-to-Noise Ratio Analysis. We perform a signal-to-noise analysis of the membrane potential. Let us assume that the neuron has fired at times {tm ; m ≤ n}. We study the trajectory for tn < t < tn+1 and set ¯ + δustoch (t) + δuper (t), u(t) = u(t)
(3.7)
P ¯ = nm η(t − tm ) + u¯ ∞ is the reference trajectory of a neuron that where u(t) receives a constant input current N λ¯ in . The membrane potential u(t) follows ¯ + δuper (t) if it is driven by an input current N λin (t). Therethe trajectory u(t) stoch (t) and δuper (t) are the stochastic fluctuations and the periodic fore, δu oscillations, respectively. For the signal-to-noise ratio analysis, we de facto presuppose that the noise is gaussian. A normal distribution is the only one that is determined by its first and second moment, which will be used in the following analysis. For a large number N of independent synapses, this is an excellent approximation, as is shown in detail in section A.3.
Extracting Oscillations
2001
We have seen before (cf. Figure 2b) that “good” coincidence-detection properties require a threshold ϑ above u¯ ∞ . In this case, spike generation is driven by the fluctuations, not by the mean trajectory. The magnitude of the stochastic fluctuations is determined by the mean number of input spikes per period and the shape of the EPSP. The amplitude of the periodic fluctuations depends, in addition, on the amount rin of synchrony of the signal. Roughly speaking, the neuron will be able to distinguish between the coherent (rin > 0) and the incoherent case (rin = 0), if the total amount of fluctuation is different in the two cases. The typical amplitude of the fluctuations will be denoted by 1ustoch and 1uper . We define an order parameter ρ :=
1uper (rin ) , 1ustoch (0)
(3.8)
which will be related to the coherence gain E and the quality factor γ . The parameter ρ can be considered as a signal-to-noise ratio, where the signal is given by the periodic modulation and the noise as the stochastic fluctuation of the membrane potential. For ρ ≈ 0 a low-coherence gain E and quality factor γ is to be expected; for ρ À 1 there should be a large E and γ . To confirm this conjecture, we relate 1ustoch and 1uper to the parameters of the input and the neuron model. The calculation of 1ustoch for rin = 0 and 1uper for rin ≥ 0 is carried out for a class of typical EPSPs. The only requirement is that EPSPs should, as in equation 2.6, vanish for s ≤ 0, rise to a maximum, and decay thereafter back to zero (at least exponentially for s → ∞). The amplitude of an EPSP will be called A. The time window preceding any particular point in the neuron’s activity pattern during which a variation in the input could have significantly affected the membrane potential is called τ (without a lower index, in contrast to τm and τs ). This is the definition of the integration window of a neuron given by Theunissen and Miller (1995), which can be approximated by the full width at half maximum of the EPSP. The variance of the stochastic fluctuations is then proportional to the average number of inputs the neuron receives in a time interval τ times the amplitude A of the EPSP (for details see appendix A). For N λ¯ in τ À 1 the standard deviation is to good approximation s 1ustoch (0) ≈ A
N λ¯ in τ . 2
(3.9)
√ Using, for example, equation 2.6 in equation A.8, we obtain A τ = τm / √ τm + τs . To determine the amplitude of periodic oscillations, we average over the Poisson process of the membrane voltage in equation 2.5 and denote the ¯ average by angular brackets h.i. From equation 3.7 we have hu − ui(t) =
2002
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
hδuper i(t), since hδustoch i(t) = 0, and thus Z∞ hδuper i(t) = N
h i ds λin (t − s) − λ¯ in ²(s).
(3.10)
0
The amplitude 1uper of the T-periodic oscillations will be estimated by the absolute value of the first Fourier coefficient of hδuper i(t). The kth Fourier coefficient of a T-periodic function x(t) is defined by 1 x˜ k := T
Zt+T 2π . dt0 x(t0 ) exp(−i k ω t0 ), with ω := T
(3.11)
t
Now the amplitude of the periodic oscillations can be written as ¯D E¯ ¯ per ¯ 1uper = ¯ δ u˜ 1 ¯ .
(3.12)
To calculate the right-hand side of equation 3.12, we also have to introduce the Fourier transform of quadratically integrable functions, for example, of the response kernel ² defined in equation 2.6, Z∞ ²˜ (ω) :=
dt0 ²(t0 ) exp(−i ω t0 ).
(3.13)
−∞
Carrying out the Fourier transform in equation 3.12 and using equation 3.13, in = λ ˜ in ¯ in rin is the first we obtain 1uper = N |λ˜ in 1 ²˜ (ω)|, where λ1 = p/T r in Fourier coefficient of λ defined in equation 2.1. The final result for the signal amplitude is then 1uper = N λ¯ in rin |˜² (ω)|,
(3.14)
where the definition of the vector strength rin in equation 2.2 has been used. The order parameter ρ in equation 3.8 can be rewritten with equations 3.9 and 3.14 in terms of experimentally accessible parameters of the input and neuronal parameters: ρ ≈ rin
p |˜² (ω)| . 2 N λ¯ in τ Aτ
(3.15)
One has to keep in mind that equation 3.15 contains the signal-to-noise ratio in the membrane potential and concerns only one of two aspects of coincidence detection. The second aspect is related to the mean output firing
Extracting Oscillations
2003
rate and, thus, the threshold, which should be chosen appropriately, as we discussed in section 3.1 (see also section 4). The order parameter ρ alone can be used only to derive necessary conditions for coincidence detection. For small ρ, it seems unlikely that a neuron acts as a coincidence detector, but a small ρ could be compensated for by a pool of such neurons (see also sections 3.3 and 4). We think that as a rule of thumb, one can exclude that a neuron is a coincidence detector if ρ < 0.1. Taking advantage of equation 3.15, we now derive an expression for the mean output rate λ¯ out (rin ) and for γ . 3.2.2 Mean Output Rate. As we stated in the previous section, in the absence of a firing threshold, the neuron’s membrane voltage fluctuates around a mean value u¯ ∞ with a standard deviation 1ustoch , which is due to noisy input. From appendix A, we also know that, to excellent approximation, the voltage fluctuations are gaussian. Then the probability that at an arbitrary time the membrane voltage u is above ϑ can be estimated by 1 w(ϑ) = √ 2π 1ustoch
Z∞ ϑ
· ¸ (u − u¯ ∞ )2 du exp − , 2 (1ustoch )2
which can be rewritten in terms of the error function, · µ ¶¸ ϑ − u¯ ∞ 1 1 − erf √ . w(ϑ) = 2 2 1ustoch
(3.16)
(3.17)
If ϑ > u¯ ∞ , the average time that a voltage fluctuation stays above ϑ is called τ dec , which is expected to be of the order of the width τ of the integration window. The time τ dec is needed for a decay of any voltage fluctuation. Therefore, the mean time interval between two events u > ϑ can be approximated by τ dec /w(ϑ). A neuron’s dynamics is such that firing occurs if the membrane voltage reaches the threshold ϑ from below. After firing, there is a refractory period τ ref during which the neuron cannot fire. This prolongs the mean waiting time τ dec /w(ϑ) until the next event u = ϑ by an amount of τ ref . Taking refractoriness into account, the mean interspike interval τ isi can be approximated by τ isi ≈ τ dec /w(ϑ) + τ ref .
(3.18)
The mean output rate is λ¯ out = 1/τ isi . Substituting equation 3.17 into 3.18, we obtain the mean output rate for a random input, ¾−1 ½ h ³ √ ´i−1 ref ¯λout (0) = 2τ dec 1 − erf ϑ 0 / 2 +τ .
(3.19)
2004
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
Here we have introduced the normalized threshold, ϑ 0 = (ϑ − u¯ ∞ )/1ustoch .
(3.20)
If the neuron’s input has a periodic contribution, then the output rate is increased. We now calculate the rate λ¯ out (rin ) for arbitrary rin ≥ 0. The key assumption is that we are allowed to take the oscillatory input into account through a modified threshold ϑ 0 only. The periodic contribution enhances the standard deviation of the membrane potential around its mean value. Thus, the threshold is lowered by the normalized amplitude of the periodic oscillations ρ(rin ) of the membrane potential (cf. equation 3.8). A generalization of equation 3.19 leads to ¾−1 ½ h ³ √ ´i−1 + τ ref . λ¯ out (rin ) ≈ 2τ dec 1 − erf ϑ 0 − ρ/ 2
(3.21)
Since we have an expression for the mean output rate in equation 3.21, we also have an expression for the coherence gain E and the quality factor γ (cf. equations 3.2 and 3.6). In Figure 3 the numerical results for λ¯ out , E, and γ can be fitted at least within an order of magnitude by using τ dec = 3/2 τm and τ ref = 2 τm . 3.2.3 Quality Factor. To arrive at a better understanding of the dependence of the quality factor γ on the normalized threshold ϑ 0 and the signalto-noise ratio ρ, we have plotted the result for γ (ϑ 0 ) in Figure 5 while taking τ dec = 3/2 τm and τ ref = 2 τm (as in Figures 3 and 4) for different values of ρ. The graphs illustrate that γ varies only weakly as a function of ϑ 0 , if ρ is held constant. The maximum value is, at least for 0 < ρ < 1, close to ϑ 0 = 1. This confirms the result of Figure 3c and the conjecture that the optimal threshold lies above the mean voltage by an amount equal to the amplitude of the stochastic fluctuations. To get a handy expression for γ , we use an approximation of the error function (Ingber, 1982), 1 1 ¡ [1 − erf(x)] ≈ √ ¢, 2 1 + exp 4 x/ π
(3.22)
which is better than 2% for all x. At least for 0 < ρ ¿ 1 we are able to derive a simple expression for γ . Using the definition for γ in equation 3.6 and a linearization in ρ, we obtain r γ =ρ
i−3/2 2 I h ˆ exp ϑ, 1 + exp ϑˆ + τ ref /τ dec dec π τ
(3.23)
Extracting Oscillations
10
2005
0
ρ=3
γ
2
10
1
-1
0.7 0.3
10
0.1
-2
0
1
2
ϑ’
3
4
5
−1/2
Figure 5: The quality factor γ (in units of τm I1/2 ) defined in equation 3.6 for the model of λ¯ out in equation 3.21 as a function of the dimensionless threshold ϑ 0 defined in equation 3.20. The quantity γ has been plotted (solid lines) for six different values of ρ, as indicated in the graph. The time constants in equation 3.21 have been chosen to agree with numerical results from simulations of an I&F model when EPSPs are “alpha” functions. Thus, it is reasonable to assume τ ref = 2 τm (cf. equation 2.7) and τ dec = 3/2 τm (see also Figures 3 and 4). For each value of ρ, the quantity γ (ϑ 0 ) is a smoothly varying function of ϑ 0 and shows a single maximum. The dashed line connects the maxima. The value of ϑ 0 at the maximum is the optimal threshold for coincidence detection. For ρ < 1, the optimal threshold is approximately at ϑ 0 = 1. Since the maxima are broad, the threshold value is not critical for coincidence detection.
√ where we have used ϑˆ = 4 ϑ 0 / 2π . For 0 < ρ ¿ 1, the quality factor has a maximum as a function of the threshold at h i (3.24) ϑˆ opt = ln 2 (1 + τ ref /τ dec ) . Substituting ϑˆ opt to equation 3.23 and inserting ρ from equation 3.15, we get an upper bound for the quality factor, which is the key result of this article: in
γ ≤ γopt = r
p
|˜² (ω)| · 2 N λ¯ in τ Aτ
s I 4 . √ τ dec + τ ref 54π
(3.25)
3.3 Examples. We are using the response kernel ²(s) = (s/τm ) exp(−s/τm )· θ(s) in equation 2.5. The absolute value of the Fourier transform is |˜² (ω)| =
2006
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
¡ ¢ τm / 1 + ω2 τm2 . From equations 3.15 and A.8 (see also the remark after equation 3.9), we then obtain the signal-to-noise ratio, q ρ=
N λ¯ in τm
2 rin . 1 + ω2 τm2
(3.26)
Let us calculate ρ for cortical neurons. Due to background activity, the (effective) membrane time constant for voltages in the neighborhood of the threshold can be as low as 10 ms (Bernander, Douglas, Martin, & Koch, 1991; cf. also Rapp, Yarom, & Segev, 1992). We assume EPSPs as “alpha” functions and N = 104 synapses firing at λ¯ in = 5 Hz. If the input has a periodic component of 40 Hz (T = 25 ms), we calculate from equation 3.26 relative a value of ρ = 6.1 rin . The vector strength rin can be related to the p modulation amplitude Arel of almost periodic correlograms by rin ≈ Arel /2 (see appendix B), which is a good approximation for small Arel . A value of Arel ≈ 0.3 is reasonable (Gray & Singer, 1989). We find rin ≈ 0.4, and a signalto-noise ratio of ρ ≈ 2.4. Therefore cortical neurons possess all prerequisites necessary for coincidence detection. As a further application of equation 3.26, we turn to neurons in the nucleus laminaris of the barn owl. Model studies have shown that the laminar neurons can indeed act as coincidence detectors (Gerstner, Kempter, van Hemmen, & Wagner, 1996). These neurons have (almost) no dendritic trees. A set of parameters N = 200, τm = τs = 0.1 ms, λ¯ in = 0.5 kHz, λ¯ out = 100 Hz, and rin = 0.5 for T−1 = 5 kHz is reasonable (Carr & Konishi, 1990; Carr & Boudreau, 1991; Reyes, Rubel, & Spain, 1996). From equation 3.26, we obtain ρ = 0.29. This value should be compared to the much better signal-to-noise ratio ρ ≈ 2.4 that we find for cortical neurons. We now compare the values of the upper bound of the quality factor γopt in equation 3.25 for the two types of neurons. We take an interval of, say, I = 100 ms to get some numerical values that can be interpreted, though the length of I is not important for the comparison. For cortical neurons we assume τ dec = 3/2 τm and τ ref = 2 τm , the parameters used throughout the whole article. From equation 3.25, we then obtain γopt = 1.2. For laminar neurons, we also assume τ dec = 3/2 τm and τ ref = 2 τm , whereby τm = 0.1 ms, and obtain γopt = 1.5. We conclude that both types of neurons have comparable coincidence-detection properties. In laminar neurons, the relatively low number N of synapses is compensated by a high mean input rate λ¯ in in order to achieve the same performance as cortical neurons. In both examples, the ratio τm /T and the input vector strength rin were almost identical. What does a quality factor of, say, γ = 1.5 mean? We remind readers of the definition of γ in equation 3.5. The quality factor is a quantity that measures the distance between the spike count distributions for random and coherent input. For γ = 1, the two distributions are just distinguishable, and for γ À 1 they are well separated. The error probability or, better, the extent
Extracting Oscillations
2007
to which random and coherent input can be discriminated can be calculated from γ . The corresponding algorithm is the subject of further investigation. 4 Discussion This study demonstrates the influence of the parameters of an I&F neuron on the capability of the neuron to work as a coincidence detector for periodically modulated spike input. The dependence on the membrane time constant τm has been demonstrated in Figures 2 and 4a–c, the influence of the number of inputs per period Np was treated in Figures 4d–f, and the relation to the threshold ϑ has been shown in Figures 3 and 5. An order parameter ρ for coincidence detection has been defined in equation 3.15 by dividing the amplitude of the periodic oscillations (when the neuron receives phase-locked input) by the amplitude of the stochastic fluctuations of the membrane voltage (when the neuron receives random input). Finally, ρ has been related to the quality factor γ for coincidence detection. Our reasoning is not limited to I&F neurons. It can be applied directly to neurons whose excitatory synapses have almost equal strengths and evoke similar EPSPs that sum up linearly, at least below the threshold. The extension to a distribution of synaptic strengths and forms of EPSPs is also straightforward. With some additional effort, phase-locked inhibitory inputs could also be incorporated. Our model does not include, though, nonlinear effects in the dendritic tree (Softky, 1994). The shape of the EPSP plays the most important role for coincidence detection. More precisely, the relevant parameter is the absolute value of the Fourier component of the response kernel ² at the frequency of the external stimulus (cf. equation 3.15), which expresses the (low-pass) filtering property of synaptic transmission. Nevertheless, the rule of thumb holds that the briefer the EPSPs, the better are the coincidence-detection properties of the corresponding neuron. The width of the EPSP has to be at least of the same order of magnitude as the minimal temporal structure it should resolve (cf. Figure 4a). In addition, the number of synapses and their mean activity determine whether a neuron is able to perform coincidence detection. With T-periodically modulated input, our results show that the more input spikes per time, the easier is coincidence detection (cf. Figure 4b). This is due to the fact that the ratio between the signal (= oscillations of membrane voltage) and the noise (= random fluctuations) increases with increasing mean input rate. The contribution of many synapses also enhances coincidencedetection properties, which is extremely important for neurons receiving a large number of inputs, such as cortical pyramidal cells with about 104 synapses or cerebellar purkinje cells with approximately 2 · 105 synapses. One can summarize the influence of the width τ of an EPSP and the number of inputs per time a neuron receives on coincidence detection as follows. The neuron’s “memory” extends over an interval τ back in time,
2008
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
so the neuron cannot “see” earlier input spikes. They have little influence on the membrane potential because the corresponding tails of the EPSPs are small and decay exponentially in time. (For the moment, we neglect refractory effects.) Hence, the number of inputs in the neuron’s integration time window of length τ determines the state (the membrane potential) of the neuron. If the number of inputs in this shifting time window shows a significant T-periodic oscillation, then it is in principle possible that the neuron is able to perform coincidence detection. This is a rate-coding scheme where the firing rate has to be measured within an averaging time window. This argument shows that the width of an EPSP, which corresponds to the averaging time window, should be small. If it greatly exceeds one period, then averaging will be of no use at all. For coincidence detection there is an optimal threshold value, as illustrated by Figures 3c and 5. For optimal coincidence detection, the threshold ϑ has to surpass the mean membrane voltage u¯ ∞ = N λ¯ in τm of the neuron by an amount equal to the noise amplitude. A higher threshold implies a lower mean output firing rate, which destroys the advantage of a high-coherencegain E. A lower threshold leads to the regime of a nonleaky integrator, which is not at all suited to coincidence detection. Thus coincidence detection in “real” neurons requires an adaptive mechanism to control the threshold. There are several different possibilities for that. First, we could imagine a control loop that adjusts the threshold in the appropriate regime. This might be difficult to implement but could be achieved, for example, if each spike is followed by a long-lasting hyperpolarizing afterpotential. Alternatively, we could envisage a feedback loop of inhibitory neurons that adjust the mean input. A control of input strength is also possible through synaptic changes (Tsodyks & Markram, 1997; Abbott, Varela, Sen, & Nelson, 1997). Finally, it has been shown in a model study (Gerstner et al., 1996) that potentiation and depression of synaptic weights can balance each other so that the effective input strength is always normalized (see also Markram, Lubke, ¨ Frotscher, & Sackmann, 1997). However, the threshold is not a critical parameter for coincidence detection, as is illustrated by the broad maximum of γ as a function of the threshold in Figures 3c and 5. The threshold also determines the mean firing rate of a neuron. For reasonable firing rates, the quality factor remains of the same order of magnitude as its optimal value γopt (cf. equation3.25). The existence of an optimal threshold can be related to the phenomenon of stochastic resonance (Wiesenfeld & Moss, 1995) in that in the presence of noise, the detection of weak signals can be enhanced and there is an optimal noise level. It seems unlikely, though, that neurons are able to change the level of noise in their input. A neuron has potentially the chance of adapting its threshold to an optimal value, as we have discussed before. We have shown that the optimal threshold for coincidence detection is, similar to stochastic resonance, always above u¯ ∞ by an amount that is of the same order of magnitude as the noise amplitude.
Extracting Oscillations
2009
Having the parameter γ at hand, one still has to be careful with rash conclusions about a neuron’s task. Let us consider a neuron whose γ is small. One might argue that such a neuron cannot function as a coincidence detector, and this is certainly correct if we consider the neuron as a single unit. But if there is a pool of neurons operating in the same pathway and receiving the same type of input, the output of all these neurons together could provide a secure cue for a decision. Also, the waiting time, which is necessary to make a correct decision with high reliability, can be reduced by using a pool of neurons. That is, the following two counting methods are equivalent: a system can use either the output spike count of a single neuron in an interval I or the number of spikes of L statistically independent, identical neurons operating in the same pathway during a period of time I/L. Although we have considered the transition from spike to rate coding, the output spikes remain phase locked to a periodic input. This means that the neuronal transmission always retains some of the temporal information, an aspect that we think is important to signal processing in the brain. Appendix A: Inhomogeneous Poisson Process In this appendix we define and analyze the inhomogeneous Poisson process. This notion has been touched on by Tuckwell (1988, pp. 218–220) and others (e.g., Ash & Gardner, 1975, pp. 28–29), but neither of them explains the formalism itself or the way of computing expectation values. Since both are used extensively, we do so here, despite the fact that the issue is treated by Snyder and Miller (1991, secs. 2.1–2.3). Our starting assumptions in handling this problem are the same as those of Gnedenko (1968, sec. 51) for the homogeneous (uniform) Poisson process, but the mathematics is different. Neither does our method resemble the Snyder and Miller approach, which starts from the other end, equation A.11. In the context of theoretical neurobiology, an analysis such as this one, focusing on the local behavior of a process, seems to us far more natural. We proceed by evaluating the mean and the variance and finish this appendix by estimating a third moment, which is needed for the Berry-Esseen inequality, that tells us how good a gaussian approximation to a sum of independent random variables is. A.1 Definitions. Let us suppose that a certain event, in our case a spike, occurs at random instances of time. Let N(t) be the number of occurrences of this event up to time t. We suppose that N(0) = 0, that the probability of getting a single event during the interval [t, t + 1t) with 1t → 0 is Pr{N(t + 1t) − N(t) = 1} = λ(t) 1t ,
λ≥0,
(A.1)
and that the probability of getting two or more events is o(1t). Finally, the process has independent increments; events in disjoint intervals are
2010
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
independent. The stochastic process obeying the above conditions is an inhomogeneous Poisson process. Under conditions on λ to be specified below, there are only finitely many events in a finite interval. Hence, the process lives on a space Ä of monotonically nondecreasing, piecewise constant functions on the positive real axis, having finitely many unit jumps in any finite interval. The expectation value corresponding to this inhomogeneous Poisson process is simply an integral with respect to a probability measure µ on Ä, a function space whose existence is guaranteed by the Kolmogorov extension theorem (Ash, 1972, sec. 4.4.3). A specific realization of the process, a function on the positive real axis, is a “point” ω in Ä. The discrete events corresponding to ω are denoted by t f (ω) with f labeling them. As we have seen in eq. 2.5, spikes generate postsynaptic potentials ². We now compute the average, denoted by angular brackets, of the postsynaptic potentials generated by a specific neuron during the time interval [t0 , t), * X
+ ²(t − t f (ω)) .
(A.2)
f
Here it is understood that t f = t f (ω) depends on the realization ω and t0 ≤ t f (ω) < t. We divide the interval [t0 , t) into L subintervals [tl , tl+1 ) of length 1t so that at the end 1t → 0 and L → ∞ while L 1t = t − t0 . We now evaluate the integral (see equation A.2), exploiting the fact that ² is a continuous function. Let #{tl ≤ t f (ω) < tl+1 } denote the number of events (spikes) occurring at times t f (ω) in the interval [tl , tl+1 ) of length 1t. In the limit 1t → 0, the expectation value (see equation A.2) can be written "
Z Ä
d µ(ω)
X
# ²(t − tl ) #{tl ≤ t f (ω) < tl+1 } ,
(A.3)
l
so that we are left with the Riemann integral, Z
t
d s λ(s) ²(t − s) .
(A.4)
t0
We spell out why. The function 1I{...} is to be the indicator function of the set {. . .} in Ä; that is, 1I{...} (ω) = 1, if ω ∈ {. . .} and it vanishes if ω does not belong to {. . .}, so it “indicates” where the set {. . .} lives. With the benefit of hindsight, we single out mutually independent sets in Ä with indicators 1I{tl ≤tf (ω)
d µ(ω)
X l
1I{tl ≤tf (ω)
(A.5)
Extracting Oscillations
2011
Each indicator function in the sum equals 1I{tl ≤tf (ω)
(A.6)
In view of equations A.2 and A.5, we multiply this by ²(t − tl ) #{tl ≤ t f (ω) < tl+1 }, interchange integration and summation in equation A.5, and integrate with respect to µ. The first term on the right contributes nothing; the second gives λ(tl )²(t − tl )1t and thus produces a term in the Riemann sum leading to equation A.4; and the last term can be neglected since it is of order o(1t). The proof of the pudding is that only a single event in the interval [tl , tl+1 ) counts as 1t → 0. Since ²(t) is a function that decreases at least exponentially as fast as t → ∞, there is no harm in taking t0 = −∞. A.2 Second Moment and Variance. It is time to compute the second moment, + * X 2 ²(t − t f )] . (A.7) [ t f
In a similar vein as before, we obtain, in the limit 1t → 0, + * X 0 ²(t − t f )²(t − t f ) t f ,t0f
Z
= =
Ä
d µ(ω)
X
l,m
Z
Z tZ t0
·Z =
1I{tl ≤tf (ω)
[λ(tl )1t λ(tm )1t] ²(t − tl )²(t − tm ) +
l6=m
=
X
Ä t
d µ(ω)
X
1I2{tl ≤tf (ω)
l
Z
d t1 d t2 λ(t1 )λ(t2 )²(t − t1 )²(t − t2 ) +
t0 t t0
¸2 Z t d s λ(s)²(t − s) + d s λ(s)² 2 (t − s).
t
d s λ(s)² 2 (t − s)
t0
(A.8)
t0
Hence the variance is the last term on the right in equation A.8. It is a simple exercise to verify that when λ(t) ≡ λ and ²(t) ≡ 1 in equations A.4 and A.8, we regain the mean and variance of the usual Poisson distribution. We finish the argument by computing the probability of getting k events in the interval [t0 , t). For the usual, homogeneous Poisson process it is Pr{N(t) − N(t0 ) = k} = exp[−λ(t − t0 )] ·
[λ(t − t0 )]k . k!
(A.9)
2012
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
We now break up the interval [t0 , t) into many subintervals [τl , τl+1 ) of length 1t and condition with respect to the first, second, . . . , arrival. The arrivals come one after the other, and the probability of a specific sequence of events in [t1 , t1 + 1t), [t2 , t2 + 1t), . . . , [tk , tk + 1t) is made up of elementary events such as Pr{First spike in [t1 , t1 + 1t)} = Pr{no spike in [t0 , t1 )} Pr{spike in [t1 , t1 + 1t)} = [1 − λ(τ1 )1t][1 − λ(τ2 )1t] . . . [1 − λ(t1 − 1t)1t] λ(t1 )1t ¸ · Z t1 d τ λ(τ ) λ(t1 )1t. = exp −
(A.10)
t0
Here we have exploited the independent-increments property and taken the limit 1t → 0 to obtain the last equality. Repeating the above argument for the following events, including the no-event tail in [tk + 1t, t), multiplying the probabilities, and summing over all possible realizations, we find Pr{N(t) − N(t0 ) = k} ¸Z t · Z t Z d τ λ(τ ) d tk λ(tk ) . . . = exp −
t3
·Z t ¸ · Z t 1 d τ λ(τ ) · d s λ(s) = exp − k! t0 t0
.
t0
t0
t0 ¸k
Z d t2 λ(t2 )
t2
d t1 λ(t1 )
t0
(A.11)
In other words, N(t) − N(t0 ) has a Poisson distribution with parameter t0 d s λ(s). If λ(s) ≡ λ, one regains equation A.9. We now see two things. First, the appropriate condition on λ is that it be locally integrable. Then Pr{N(t) − N(t0 ) < ∞} = 1 as the sum of equation A.11 over all finite k adds up to one. Furthermore, N(t) − N(t0 ) with t0 < t0 < t has a Poisson Rt distribution with parameter t0 d s λ(s). Second, by rescaling time through Rt t := d s λ(s) one obtains (Tuckwell, 1988; Ash & Gardner, 1975) a homogeneous Poisson process with parameter λ = 1. This also follows more directly from equation A.1. It is of no practical help for understanding neuronal coincidence detection, though. For instance, if we use a spike train generated by an inhomogeneous Poisson process with rate λ(t) to drive, such as a leaky I&F neuron, its mean output firing rate does depend on the temporal structure of λ(t), as we have argued. This effect cannot be explained by simply rescaling time. Another example is provided by the auditory system, where λ(t) is taken to be a periodic function of t, with the period determined by external sound input. The cochlea, however, produces a whole range of frequency inputs, whereas time can be rescaled only once. Rt
A.3 Berry-Esseen Estimate. Equation 2.5 tells us that the neuronal input is a sum of independent, identically distributed random variables corresponding to neighboring neurons j. Neither independence nor a common
Extracting Oscillations
2013
distribution is necessary, but both are quite convenient. The point is that, according to the central limit theorem, a sum of N independent random variables1 has a gaussian distribution as N → ∞. In our case, N is definitely finite, so the question is: How good is the gaussian approximation? The answer is provided by a classical, and remarkable, result of Berry and Esseen (Lamperti, 1966, sec. 15). We first formulate the Berry-Esseen result. Let X1 , X2 , . . . be independent with a common distribution having variance σ 2 and finite third moment. PN (Xj − hXj i) be the total input, the Xj stemming Furthermore, let SN = j=1 from neighboring neurons j as given by the right-hand side of equation 2.5 with N as the number of synapses, and let Yσ be a gaussian with mean 0 and variance σ 2 . Then there is a constant (2π )−1/2 ≤ C < 0.8 such that, whatever the distribution of the Xj and whatever x, ¯ ¯ ½ ¾ ¯ Ch|X1 − hX1 i|3 i ¯ SN ¯≤ ¯Pr √ ≤ x − Pr{Y . (A.12) ≤ x} √ σ ¯ ¯ N σ3 N In the present case, σ 2 directly follows from equation A.8. Computing h|X1 − hX1 i|3 i is a bit nasty but it is simpler, and also provides more insight, to estimate the third moment directly by Cauchy-Schwartz so as to get rid of the absolute value, h|X1 − hX1 i|3 i ≤ h(X1 − hX1 i)2 i1/2 h(X1 − hX1 i)4 i1/2 . The first term on the right equals σ ; the second is given by Z t d s λ(s) ² 4 (t − s) + 3σ 4 , h(X1 − hX1 i)4 i =
(A.13)
(A.14)
t0
Rt where σ 2 = t0 d s λ(s)² 2 (t − s). Collecting terms, we can estimate the righthand side of equation A.12, the precision of the gaussian approximation √ being determined by 1/ N as N becomes large. Appendix B: Cross-Correlograms and Degree of Synchrony Here we outline the relationship between the relative modulation Arel of cross-correlograms and the underlying degree of synchrony, rin . A spike input generated by equation 2.1 leads to a periodic cross-correlation function, C(t) ∝
∞ X m=−∞
Gσ √2 (t − mT).
(B.1)
1 This N directly corresponds with the number of synapses that provide the neuronal input. There is no need to confuse it with the stochastic variable N(t) of the previous subsection.
2014
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
The relative amplitude Arel of C is to be defined below. It is estimated from the Fourier transform of C, which is defined in equation 3.11. The Fourier coefficients of equation B.1 are h i 2π . (B.2) C˜ k ∝ exp −(k σ ω)2 with ω = T For σ ω > 1 the first Fourier component dominates, all higher coefficients can be neglected, and we can approximate equation B.1 by C(t) ≈ C˜ 0 + 2 C˜ 1 cos(t),
(B.3)
where 2 C˜ 1 is the amplitude of the first-order oscillation. Then Arel , defined as the relative modulation of the cross-correlogram (cf., e.g., Konig, ¨ Engel, & Singer, 1995), is approximated by Arel ≈
2 C˜ 1 . C˜ 0
Substituting equation B.2 into B.4, we obtain h i Arel ≈ 2 exp −(σ ω)2 .
(B.4)
(B.5)
Using the definition of the vector strength (see equation 2.2) in B.5 and solving for rin , we find s Arel . (B.6) rin ≈ 2 The restriction σ ω > 1 corresponds to rin < 0.6 or Arel < 0.7. The oscillation amplitude of the cross-correlation function C(t) for spike activity as found in various brain areas decays to zero with increasing |t| (cf. Gray & Singer, 1989) because neuronal activity is not strictly periodic. Most of the cross-correlograms can be fitted by using generalized Gabor functions of the form (cf., for example, Konig ¨ et al., 1995) µ 2¶ t (B.7) C(t) ∝ 1 + Arel cos(t) exp − 2 , λ where λ is a time constant. In this case we obtain a measure of the degree of synchrony rin also from equation B.6, which originally was derived for the periodic case only. The transfer of the arguments from the periodic to the nonperiodic case is reasonable if λ is of the order of a few oscillation periods T or longer. For coincidence detection, only correlations within the integration time τ of a neuron are important. For neurons that are able to act as coincidence detectors, τ has to be at least of the order of T, so that reasonable coincidence detectors do not “see” the decay of the correlation function.
Extracting Oscillations
2015
Acknowledgments We thank Julian Eggert and Werner Kistler for stimulating discussions, helpful comments, and a careful reading of the manuscript. We also thank Jack Cowan and Richard Palmer for some useful hints concerning the title. This work has been supported by the Deutsche Forschungsgemeinschaft under grant numbers He 1729/8-1 (RK) and He 1729/2-2, 8-1 (WG).
References Abbott, L. F., Varela J. A., Sen K., & Nelson S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Abeles, M. (1982). Role of the cortical neuron: Integrator or coincidence detector? Isr. J. Med. Sci., 18, 83–92. Ash, R. B. (1972). Real analysis and probability. New York: Academic Press. Ash, R. B., Gardner, M. F. (1975). Topics in stochastic processes. New York: Academic Press. ¨ Douglas, R. J., Martin, K. A. C., & Koch C. (1991). Synaptic backBernander, O., ground activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. Burgess, N., Recce, M., & O’Keefe, J. (1994). A model of hippocampal function. Neural Networks, 7, 1065–1081. Carr, C. E. (1992). Evolution of the central auditory system in reptiles and birds. In D. B. Webster, R. R. Fay, & A. N. Popper (eds.), The evolutionary biology of hearing (pp. 511–543). New York: Springer-Verlag. Carr, C. E. (1993). Processing of temporal information in the brain. Ann. Rev. Neurosci., 16, 223–243. Carr, C. E., & Boudreau, R. E. (1991). Central projections of auditory nerve fibers in the barn owl. J. Comp. Neurol., 314, 306–318. Carr, C. E., & Konishi, M. (1990). A circuit for detection of interaural time differences in the brain stem of the barn owl. J. Neurosci., 10, 3227–3246. Davis, J. L., & Eichenbaum, H. (Eds.). (1991). Olfaction: A model system for computational neuroscience. Cambridge, MA: MIT Press. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60, 121–130. Freeman, W. J. (1975). Mass action in the nervous system. New York: Academic Press. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Gerstner, W., & van Hemmen, J. L. (1994). Coding and information processing in neural networks. In E. Domany, J. L. van Hemmen, K. Schulten (Eds.), Models of neural networks II (pp. 1–93). New York: Springer-Verlag.
2016
R. Kempter, W. Gerstner, J. L. van Hemmen, and H. Wagner
Gerstner, W., Kempter, R., van Hemmen, J. L., and Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gnedenko, B. V. (1968). The theory of probability. (4th ed.). New York: Chelsea. Goldberg, J. M., & Brown, P. B. (1969). Response of binaural neurons of dog superior olivary complex to dichotic tonal stimuli: Some physiological mechanisms of sound localization. J. Neurophysiol., 32, 613–636. Gray, C. M., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698–1702. Ingber, L. (1982). Statistical mechanics of neocortical interactions. I. Basic formulation. Physica, 5D, 83–107. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Comput., 9, 1015–1045. Konig, ¨ P., Engel, A. K., & Singer, W. (1995). Relation between oscillatory activity and long-range synchronization in cat visual cortex. Proc. Natl. Acad. Sci. USA, 92, 290–294. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends Neurosci., 19, 130–137. Koppl, ¨ C. (1997). Phase locking to high frequencies in the auditory nerve and cochlear nucleus magnocellularis of the barn owl, tyto alba. J. Neurosci., 17, 3312–3321. Lamperti, J. (1966). Probability. New York: Benjamin. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Murthy, V. N., & Fetz, E. E. (1992). Coherent 25 to 35 Hz oscillations in the sensorimotor cortex of awake behaving monkeys. Proc. Natl. Acad. Sci. USA, 89, 5670–5674. Rapp, M., Yarom, Y., & Segev, I. (1992). The impact of parallel fiber backgroud activity on the cable properties of cerebellar pulkinje cells. Neural Comput., 4, 518–533. Reyes, A. D., Rubel, E. W., and Spain, W. J. (1996). In vitro analysis of optimal stimuli for phase-locking and time-delayed modulation of firing in avian nucleus laminaris neurons. J. Neurosci., 16, 993–1007. Ritz, R., Gerstner, W., Fuentes, U., & van Hemmen, J. L. (1994). A biologically motivated and analytically soluble model of collective oscillations in the cortex. II. Application to binding and pattern segmentation. Biol. Cybern., 71, 349–358. Ritz, R., Gerstner, W., & van Hemmen, J. L. (1994). Associative binding and segregation in a network of spiking neurons. In E. Domany, J. L. van Hemmen, K. Schulten (Eds.), Models of neural networks II (pp. 177–223). New York: Springe-Verlag. Snyder, D. L., & Miller, M. I. (1991). Random point processes in time and space. (2nd ed.). New York: Springer-Verlag. Softky, W. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58, 13–41.
Extracting Oscillations
2017
Stemmler, M. (1996). A single spike suffices: The simplest form of stochastic resonance. Network, 7, 687–716. Theunissen, F., & Miller, J. P. (1995). Temporal encoding in nervous systems: A rigorous definition. J. Comp. Neurosci., 2, 149–162. Troyer, T. W., and Miller, K. D. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Comput., 9, 971–983. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology: Vol. 2: Nonlinear and stochastic theories. Cambridge: Cambridge University Press. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biol. Cybern., 54, 29–40. Wang, D., Buhmann, J., von der Malsburg, C. (1990). Pattern segmentation in associative memory. Neural Comput., 2, 94–106. Wehr, M., & Laurent, G. (1996). Odour encoding by temporal sequences of firing in oscillating neural assemblies. Nature, 384, 162–166. Wiesenfeld, K., & Moss, F. (1995). Stochastic resonance and the benefits of noise: From ice ages to crayfish and SQUIDs. Nature, 373, 33–36. Received August 21, 1997; accepted March 16, 1998.
LETTER
Communicated by Jack Cowan
Connecting Cortical and Behavioral Dynamics: Bimanual Coordination V. K. Jirsa A. Fuchs J. A. S. Kelso Program in Complex Systems and Brain Sciences, Center for Complex Systems, Florida Atlantic University, Boca Raton, Florida 33431, U.S.A.
For the paradigmatic case of bimanual coordination, we review levels of organization of behavioral dynamics and present a description in terms of modes of behavior. We briefly review a recently developed model of spatiotemporal brain activity that is based on short- and long-range connectivity of neural ensembles. This model is specified for the case of motor and sensorimotor units embedded in the neural sheet. Focusing on the cortical left-right symmetry, we derive a bimodal description of the brain activity that is connected to behavioral dynamics. We make predictions of global features of brain dynamics during coordination tasks and test these against experimental magnetoencephalogram (MEG) results. A key feature of our approach is that phenomenological laws at the behavioral level can be connected to a field-theoretical description of cortical dynamics. 1 Introduction In coordinated movements typically several states related to different behavioral patterns can be found—for example, different gaits of horses (Collins & Stewart, 1993; Schoner, ¨ Jiang, & Kelso, 1990) or different configurations among the joints for trajectory formation tasks (Buchanan, Kelso, & de Guzman, 1997; Kelso, Buchanan, & Wallace, 1991). These states have different stabilities dependent on external or internal control parameters. When such control parameters are manipulated, coordination states may become unstable, and the system exhibits a transition from one state to another. These phenomena have been investigated intensively experimentally and theoretically and mathematical models have been set up reproducing the experimentally observed coordination behavior as well as predicting new effects (see Haken, 1996; Kelso, 1995, for reviews). On the other hand recent magnetoencephalogram (MEG) and electroencephalogram (EEG) experiments (Kelso et al., 1992; Wallenstein, Kelso, & Bressler, 1995) have investigated the spatiotemporal brain dynamics during coordination of finger movements with external periodic stimuli. To accommodate these results, a c 1998 Massachusetts Institute of Technology Neural Computation 10, 2019–2045 (1998) °
2020
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
mathematical phenomenological model was developed describing ongoing brain activity (Jirsa, Friedrich, Haken, & Kelso, 1994). In Jirsa and Haken (1996a, 1996b), a more biologically motivated field theory of the spatiotemporal brain dynamics was elaborated, which combined properties of neural ensembles, including their short- and long-range connections in the cortex, in addition to describing the interaction of functional units embedded into the neural sheet. This approach was applied to a brain-coordination experiment (Kelso et al., 1992), where the subject’s task was to coordinate rhythmic behavior of one finger with an external acoustic stimulus. During this experiment, the MEG of the subject was recorded. Complex systems, such as the brain, have the general property that they perform low-dimensional behavior during transitions from one macroscopic state to another (Cross & Hohenberg, 1993; Haken, 1983, 1987). This behavior has also been found in the analyses (Fuchs, Kelso, & Haken, 1992; Jirsa, Friedrich, & Haken, 1995) of brain data from the coordination experiment in Kelso et al. (1992). On the basis of such analyses, the phenomenological model in Jirsa et al. (1994) describing the brain activity was derived qualitatively from the biologically motivated theory in Jirsa and Haken (1996a, 1997). In this article, we treat rhythmic coordination behavior between the subjects’ two index fingers, that is, two equivalent internal oscillators. This experimental condition fundamentally differs from the situation described in Kelso et al. (1992), where an internal oscillator (finger) interacts with an external oscillator (metronome). The goal of this article is to show how it may be possible to traverse levels of organization from the behavioral level to the brain level. For this purpose, we choose a bimanual coordination experiment (Kelso, 1981, 1984) in which a transition in coordinated behavior is observed between finger movements when a control parameter is changed. Our line of thought is as follows: • On the purely behavioral level we distinguish two levels of organization: First is the collective level, represented by the relative phase between the fingers. This collective variable characterizes the dynamic state in which the system is found. Second, the component level is represented by the positions of the individual fingers and their velocities. These component variables perform an oscillatory behavior and interact nonlinearily. When both levels of description are consistent, the description on the collective level can be derived from the component level. This traverse of scales from the collective to the component level has been performed by Haken, Kelso, and Bunz (1985) and is known in the literature as the HKB model. In section 2 we give a brief review of their results. The same strategy has been applied to a number of other systems (see Kelso, 1995). • Our next step (also in section 2) will be to transform the HKB model from the component level to the mode level. These modes describe the same dynamics as the HKB model (because it is still the same model),
Connecting Cortical and Behavioral Dynamics
2021
but represent the behavioral coordination states. From Friedrich, Fuchs, and Haken (1991), Kelso et al. (1992), and Fuchs et al. (1992), we know that low-dimensional behavioral dynamics can be reflected in the brain dynamics. Since we want to focus here on a bimanual coordination experiment where a phase transition is observed on the behavioral level, we also expect a low-dimensional brain dynamics in the EEG/MEG, which can be described in terms of modes as in Jirsa et al. (1994). Thus, we will take the HKB model in the representation of the behavioral modes as a guideline for the development of a model of the brain dynamics. • On the brain level we start our discussion with an introduction of a recently developed field theory of the brain (Jirsa and Haken, 1996a, 1997) in section 3.1. We specify this model for the bimanual coordination case by defining sensorimotor and motor areas as functional units. By using anatomical symmetry arguments and the working hypothesis of low-dimensional brain dynamics, we can reduce the field-theoretical description to a set of equations governing the temporal dynamics of two spatial brain modes. These equations correspond to the equations of the behavioral modes of the HKB model and thus allow us to connect the behavioral and the brain levels. From this treatment, we make predictions about the macroscopic spatiotemporal brain dynamics potentially observable in brain experiments using EEG/MEG. We test these predictions in a preliminary fashion against experimental MEG results from a bimanual coordination experiment. 2 The Level of Behavior Experimental studies by one of us (Kelso, 1981, 1984), as well as others (see Carson, Byblow, & Goodman, 1994, for a review) have shown that abrupt phase transitions occur in human finger movements under the influence of scalar changes in cycling frequency. Below a critical cycling frequency, two dynamical patterns or states are possible: an in-phase state where the finger movements are symmetric and an antiphase state where the finger movements are antisymmetric. Starting the finger movements in the antiphase state and increasing the cycling frequency, a spontaneous transition from antiphase to in-phase is performed at the critical frequency. Beyond this frequency, only the in-phase state is stable. Further, it is experimentally observed that the amplitude of the finger movements decreases when the cycling frequency is increased. These phenomena were theoretically modeled by Haken et al. (1985) by formulating a model system for the dynamics of the collective variable represented by the relative phase 8 between the fingers. This model system was then used as a guide to establish a model for the dynamics of the component variables represented by finger positions x1 and x2 . Here we give
2022
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
a brief summary of the results, distinguishing the collective and component level. 2.1 Collective Level. In complex systems, collective variables are not known in advance; they have to be identified. Transitions offer a useful entry point. In bimanual coordination, the collective level may be represented by the relative phase 8 between the fingers. This collective variable characterizes the coordinative state in which the system is found, and its dynamics is governed by ˙ = −a sin 8 − 2b sin 28, 8
(2.1)
where a and b are constant parameters but dependent on a control parameter, the cycling frequency Ä. The dynamics of 8 shows bistability at 8 = 0, π if b>
|a| 4
(2.2)
and monostability at 8 = 0 otherwise. Starting at 8 = π and then changing the parameters a and b such that b = |a|/4, the system undergoes a transition from 8 = π to 8 = 0 and remains in this state. 2.2 Component Level. The component level is represented by the positions x1 and x2 of the individual fingers and their velocities. These component variables perform an oscillatory behavior and interact nonlinearily. Two ordinary differential equations, again based on detailed experimental results, describe the dynamics of the individual fingers with the amplitudes x1 and x2 . This model system reads x¨ 1 + (Ax21 + Bx˙ 21 − γ )x˙ 1 + Ä2 x1 = (x˙ 1 − x˙ 2 ) · (α + β(x1 − x2 )2 ),
(2.3)
x¨ 2 + (Ax22 + Bx˙ 22 − γ )x˙ 2 + Ä2 x2 = (x˙ 2 − x˙ 1 ) · (α + β(x1 − x2 )2 ).
(2.4)
The left-hand sides of equations 2.3 and 2.4 describe the motion of the individual fingers, while the right-hand sides describe the coupling. These equations can be solved for small amplitudes by the following ansatz: xi = ri ei8i eiÄt + ri e−i8i e−iÄt
with
i = 1, 2,
(2.5)
where ri is a real-time dependent amplitude and 8i a real-time dependent phase. Mathematical analysis (Haken et al., 1985) reveals the following properties of the model system. In the steady state, the nontrivial amplitude of the oscillators is r2i = r2 =
γ A + 3BÄ2
(2.6)
Connecting Cortical and Behavioral Dynamics
2023
and depends on the cycling frequency Ä. Here ri is independent of the index i. The relative phase 8 = 81 − 82 obeys the dynamics in equation 2.1, and the parameters are given by a = −(α + 2βr2 ) and b = 1/2βr2 . An increase in the cycling frequency causes a decrease in the amplitude of the oscillators. The transition from antiphase 8 = π to in-phase 8 = 0 occurs at a critical amplitude, which is given by r2c =
γ −α . = A + 3BÄ2c 4β
(2.7)
The model system (see equations 2.3 and 2.4) reproduces the main experimental phenomena observed in the bimanual coordination experiment and predicts other properties, such as relaxation times, critical fluctuations, and switching times that have been checked quantitatively in experiments (Kelso, Scholz, & Schoner, ¨ 1986; Scholz, Kelso, & Schoner, ¨ 1987). 2.3 Mode Level. With the goal of connecting the behavioral coordination laws (see equations 2.1, 2.3 and 2.4) to brain dynamics, we introduce an alternative description of these phenomena in terms of symmetric and antisymmetric modes. These modes directly correspond to the behavioral states of the system, for which we seek correspondence at the brain level. We define the following variables: ψ˜ + = x1 + x2
ψ˜ − = x1 − x2 .
(2.8)
These variables represent modes of behavior where ψ˜ + corresponds to the symmetric (in-phase) mode and ψ˜ − to the antisymmetric (antiphase) mode. The backtransformation onto the amplitudes of finger movement reads x1 =
1 (ψ˜ + + ψ˜ − ) 2
x2 =
´ 1³ ψ˜ + − ψ˜ − . 2
(2.9)
In order to obtain the equations governing the dynamics of the new variables ψ˜ + , ψ˜ − we subtract and sum equations 2.3 and 2.4, respectively, and obtain ´ A ∂ ³ 3 ψ˜ + + 3ψ˜ −2 ψ˜ + ψ¨˜ + − γ ψ˙˜ + + Ä2 ψ˜ + + 12¶∂t µ 2 B ˙3 ψ˜ + + 3ψ˙˜ − ψ˙˜ + = 0 + 4 ´ A ∂ ³ 3 ψ˜ − + 3ψ˜ +2 ψ˜ − ψ¨˜ − − γ ψ˙˜ − + Ä2 ψ˜ − + 12¶∂t µ 2 B ˙3 ˙ ˙ ψ˜ − + 3ψ˜ + ψ˜ − = 2ψ˙˜ − · (α + β ψ˜ −2 ). + 4
(2.10)
The left-hand sides of equation 2.10 represent fully symmetric (with respect to the exchange of the indices + and −), nonlinearly coupled equations. The
2024
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
former coupling terms with α and β in the variables x1 , x2 now appear only in one equation solely in terms of the antisymmetric mode ψ˜ − . In order to treat the system in 2.10 analytically, we make the following ansatz: ψ˜ + = R+ ei8+ eiÄt + R+ e−i8+ e−iÄt
(2.11)
ψ˜ − = R− ei8− eiÄt + R− e−i8− e−iÄt ,
(2.12)
where R+ , R− denote real-time dependent amplitudes and 8+ , 8− the corresponding time-dependent phases. Inserting this ansatz into equations 2.10 and performing two approximations well known in nonlinear oscillator theory (rotating wave approximation and slowly varying amplitude approximation; see, e.g., Haken, 1983), we obtain the following equations for the amplitudes, 1 R˙ + = γ R+ − a(Ä) · (R2+ + 2R2− + R2− cos 2(8− − 8+ )) · R+ 2 1 ˙ R− = γ R− − a(Ä) · (R2− + 2R2+ + R2+ cos 2(8+ − 8− )) · R− 2 + αR− + βR3− ,
(2.13)
and for the phases ˙ + = −a(Ä)R2− sin 2(8− − 8+ ) 8
(2.14)
˙ − = −a(Ä)R2+ sin 2(8+ − 8− ), 8
(2.15)
where a(Ä) = 1/8(A + 3BÄ2 ). Defining the new variable φ = 8+ − 8− we can rewrite equations 2.14 and 2.15 as φ˙ = −a(Ä)(R2+ + R2− ) sin 2φ,
(2.16)
which has the only stable solutions φ = π2 , 3π 2 for nontrivial amplitudes R+ , R− . Thus, equation 2.13 can be reduced to 1 ∂V R˙ + = γ R+ − a(Ä) · (R2+ + R2− ) · R+ = − 2 ∂R+ 1 R˙ − = γ R− − a(Ä) · (R2+ + R2− ) · R− + αR− + βR3− 2 ∂V =− , ∂R−
(2.17)
Connecting Cortical and Behavioral Dynamics
2025
Figure 1: The potential V is plotted (top row) in dependence of R+ and R− as the frequency increases from left to right. The axis of R− points out of the page. The scale on the axes is in arbitrary units. The isoclines of the potential V are plotted in dependence of the oscillator amplitudes R+ , R− . The plus sign marks a local maximum, the minus sign a local minimum.
where the dynamics of R+ , R− can be expressed in terms of a gradient dynamics with the potential 1 1 1 1 V = − γ (R2+ + R2− ) + a(Ä) · (R2+ + R2− )2 − αR2− − βR4− . 4 4 2 4
(2.18)
Note that the equations 2.17 correspond to the standard equations of pattern recognition used in the synergetic computer (Fuchs & Haken, 1988; Haken, 1991). A linear stability analysis of equation 2.17 yields the same results as in Haken et al. (1985) and can be graphically presented in terms of the potential V in equation 2.18. In Figure 1 (upper row) the potential V is plotted in dependence of R+ and R− as the control parameter Ä increases from left to right. Here the R− -axis points out of the page. The corresponding isoclines of V are plotted below in arbitrary units of R+ , R− . Below the critical frequency bistability is present (left two pictures); either the symmetric or the antisymmetric mode is present. At the critical frequency Äc (third picture) the antisymmetric mode becomes unstable, and only the symmetric mode remains for higher frequencies (right-most picture). In the framework of this article we will use the behavioral mode system presented in equation 2.10 as a guideline to traverse scales of organization from the behavioral level to the brain level.
2026
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
3 The Level of Brain First, we briefly review a field-theoretical description of neural activity recently developed by Jirsa and Haken (1996a, 1996b, 1997). Then we specify the neural field equation with respect to the bimanual coordination experiment and discuss it in detail. 3.1 Field-Theoretical Description of Neural Activity. Let us consider an (n − 1)-dimensional closed surface 0 representing the neocortex in an n-dimensional space. This medium 0 shall consist of neural ensembles to which we assign two state variables describing their activity: dendritic currents generated by active synapses cause waves of extracellular fields, which can be measured by the EEG (Freeman, 1992), and intracellular fields measurable by the MEG (Williamson & Kaufman, 1987). Action potentials generated at the somas of neurons correspond to pulses. We call the magnitude of the neural ensemble average of the waves the wave activity ψj (x, t) with j = e, i where the indices distinguish excitatory and inhibitory activity and the magnitude of the neural ensemble average of the pulses the pulse activity ψj (x, t) with j = E, I distinguishing excitatory and inhibitory pulses. The location in 0 is denoted by x, the time point by t. These scalar quantities are related to each other via conversion operations (Freeman, 1992), which we define as follows: Z dX fj (x, X)Hj (x, X, t) , (3.1) ψj (x, t) = 0
where j = e, i, E, I. Here the function Hj (x, X, t) represents the output of a conversion operation and fj (x, X) the corresponding distribution function depending on the spatial connectivity. From the experimental results of Freeman (1992) it is known that the conversion from wave to pulse is sigmoidal within a neural ensemble; however, the conversion from pulse to wave is also sigmoidal, but constrained to a linear small-signal range. Assuming that excitatory neurons have only excitatory synapses and inhibitory neurons only inhibitory synapses (which is generally true; Abeles, 1991) we obtain the following relations between conversion output and pulses: ¶¸ · µ |x − X| He (x, X, t) = S ψE X, t − v ¶ µ |x − X| (3.2) ≈ ae ψE X, t − v ¶¸ · µ |x − X| Hi (x, X, t) = S ψI X, t − v ¶ µ |x − X| , ≈ ai ψI X, t − v
(3.3)
Connecting Cortical and Behavioral Dynamics
2027
and between conversion output and waves: ¶ ¶ · µ µ |x − X| |x − X| − ψi X, t − HE (x, X, t) = Se ψe X, t − v v ¶¸ µ |x − X| + pe X, t − v
¶ ¶ · µ µ |x − X| |x − X| − ψi X, t − HI (x, X, t) = Si ψe X, t − v v ¶¸ µ |x − X| + pi X, t − v
(3.4)
(3.5)
where in the latter Kirchhoff’s law was used. External input pj (x, t) is realized such that afferent fibers make synaptic connections and thus pj (x, t) appears only in equations 3.4 and 3.5. Here ae , ai are constant parameters denoting synaptic weights, v the propagation velocity, and S, Sj the sigmoid functions of a class j ensemble. Let us now make the following considerations. We are interested in a spatial scale of several cm and temporal scale of 100 msec, which is relevant in EEG and MEG. Intracortical fibers (excitatory and inhibitory) typically have a length of 0.1 cm; corticocortical (only excitatory) fiber lengths range from about 1 cm to 20 cm (Nunez, 1995). Cortical propagation velocities have a wide range from 0.2 m/sec (Miller, 1987) up to 6–9 m/sec (Nunez, 1995). With an average velocity of 1 m/sec, this yields propagation delays of 1 msec for the intracortical fibers and 10 msec to 200 msec for the corticocortical fibers. Synaptic delays and refractory times are of the order 1 msec; the neuronal membrane constant is in the range of several msec (Braitenberg & Schuz, ¨ 1991). From this brief summary, we see that the spatial and temporal scales vary considerably. The distribution of the intracortical fibers is very homogeneous (Braitenberg & Schuz, ¨ 1991), but the distribution of the corticocortical fibers is not (estimates are that 40% of all possible corticortical connections are realized for the visual areas in the primate cerebral cortex (Felleman & Van Essen, 1991). We assume the corticocortical fiber distributions to be homogeneous as a first approximation. Using the discussed temporal and spatial hierarchies, the dynamics of the system (see equations 3.1–3.5) can be systematically reduced (see Jirsa & Haken, 1996a, 1996b, 1997 for details): the fast dynamics (¿ 100msec) becomes either instantaneous or can be eliminated, and the spatial scales smaller than 1 cm become pointlike. Then the entire dynamics of the system can be described in terms of the slowest variable ψe (x, t) and a modified external input now denoted by
2028
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
p(x, t). The dynamics of ψe (x, t) is given by Z ψe (x, t) = ae
dX fe (x − X) ¶ µ ¶¸ · µ |x − X| |x − X| + p X, t − , · Se ρ ψe X, t − v v 0
(3.6)
where ρ is a density of excitatory fibers, modified due to the elimination of the other variables. Note that from the equations 3.1 through 3.5, the models by Wilson and Cowan (1972, 1973) in terms of pulse activities and by Nunez (1974, 1995) in terms of wave activities can be derived and are connected by our approach. Until now, the dimension of the cortical surface has been kept general. Here we want to specify n = 2, meaning that 0 represents a closed onedimensional loop. Such a geometry has been reported by Nunez (1995) to be a good approximation when macroscopic EEG dynamics is considered under more qualitative aspects of dynamics, like changes of dispersion relations. In the following sections, we will perform a low-dimensional mode decomposition, in which case the chosen geometry suffices for a discussion of the temporal mode dynamics. Using the method of Green’s functions (Jirsa & Haken, 1997) the above integral equation (3.6) can be rewritten as a nonlinear partial differential equation, ¶ µ ∂ 2 2 2 ˙ ¨ · S[ρ ψ(x, t)+p(x, t)], ψ +(ω0 −v 4) ψ +2ω0 ψ = ae ω0 +ω0 ∂t
(3.7)
where ω0 = v/σ and we dropped the indices e. Here we call ψ(x, t) the neural field. The interaction of functional units with the cortical sheet P 0 is represented by the external input signals pj (x, t), where p(x, t) = j pj (x, t) and the output signals ψ¯ j (t). A functional unit can include subcortical structures such as the projections of the cerebellum on the cortex or specific functional areas like the motor cortex. Anatomically these areas are obviously defined by their afferent and efferent fibers connecting to the cortical sheet. In the context of the present theory dealing with dynamics on a larger spatiotemporal scale, that is, wavelengths in the regime of several centimeters, it is more appropriate to identify the spatial localizations of the functional input units with the spatial structures that are generated by the time-dependent input signals zj (t), open to observation in the EEG/MEG. In the case of a finger movement, this spatial structure βj (x) corresponds to the well-known dipolar mode in the EEG/MEG located over the contralateral motor areas. Thus, such a functional input unit is defined as pj (t) = βj (x) zj (t) .
(3.8)
Connecting Cortical and Behavioral Dynamics
2029
Similarly, an output signal ψ¯ j (t) sent to noncortical areas is picked up from the cortical sheet according to ψ¯ j (t) =
Z 0
dx βj (x)ψ(x, t) ,
(3.9)
where βj (x) defines the spatial localization of the jth functional output unit. In summary, the field-theoretical approach presented here aims at a description of the spatiotemporal brain dynamics on the scale of several centimeters and 100 msec. These scales emphasize the corticocortical connections and allow the derivation of equation 3.6 in one field variable ψ(x, t) governing the spatiotemporal dynamics. Focussing on the dynamical aspects of the interaction of only a few modes in the following sections, a cortical representation of a closed strip is chosen. 3.2 Neural Field Theory of Bimanual Coordination. The neural areas subserving bimanual coordination are numerous and diverse. The cortex, through intracortical connections and long loop, reciprocal pathways to the basal ganglia and cerebellum, obviously plays a crucial role. Propriospinal and brainstem networks are also involved. Wiesendanger, Wicki, and Rouiller (1994), in a review of lesion studies in humans and nonhuman primates, implicate lateral premotor cortex, supplementary motor area, parietal association cortex, and the anterior corpus callosum (among others) in goal-directed bimanual coordination. Although many kinds of cortical lesions can affect bimanual movements, objective measures of spatiotemporal organization are rare in studies of patient populations. In the context of the work presented in this article, Tuller and Kelso (1989) showed that inphase and antiphase movements of the fingers were preserved in split-brain patients. Other phase relations were much more difficult for split brains to produce compared to normal subjects. Anatomical and physiological evidence for bilateral control of each cortical area may explain why callosal damage and unilateral cortical lesions tend to produce only transient disturbances of bimanual coordination (Wiesendanger et al., 1994). We consider a simplified scheme in which cortical areas interact in a cooperative fashion to produce goal-directed bimanual coordination (see Figure 2). Evidence for bilateral activation of primary motor cortices during a bimanual task in which both index fingers are simultaneously moved (see, e.g., Kristeva, Cheyne, & Deecke, 1991), is consistent with our double representation of “motor areas” in Figure 2. Similarly, the presence of movement-evoked fields in both postcentral cortices corresponding to reafferent activity from the periphery during bimanual movements (Kristeva et al., 1991) justifies the two “sensorimotor area” in our model. Thus, motor signals are conveyed from the motor areas in the cortical sheet to the individual fingers; sensorimotor signals carrying information about the finger movements are conveyed to the sensorimotor areas of the brain. Note that
2030
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
Figure 2: Two input units localized at βls (x) (sensorimotor, left hemisphere), βrs (x) (sensorimotor, right hemisphere) and two output units localized at βlm (x) (motor, left hemisphere), βrm (x) (motor, right hemisphere) are embedded into a one-dimensional closed neural strip whose activity is described by the field ψ(x, t).
we assign the same index l (r) to the left (right) finger and its contralateral hemisphere in order to keep the notation in the mathematical treatment as simple as possible. Here we deal with the following situation as shown in Figure 2: two input units localized at βls (x) (left hemisphere, sensorimotor), βrs (x) (right hemisphere, sensorimotor), and two output units localized at βlm (x) (left hemisphere, motor), βrm (x) (right hemisphere, motor) are embedded into a one-dimensional closed neural strip. The origin of the underlying coordinate system is located between the two hemispheres, where L is the length of the neural strip. Anatomical considerations imply as a first approximation the following symmetries: βls (x) βlm (x)
= =
βrs (−x) βrm (−x).
(3.10)
The output signal, here the motor field, is defined according to equation 3.9 and is conveyed to the corresponding finger where zl (t) = x1 (t), zr (t) = x2 (t) denote the extensions of the left and right finger movements, respectively.
Connecting Cortical and Behavioral Dynamics
2031
The motor movement zj (t) with j = l, r shall be described phenomenologically as a function f of the motor field ψ¯ j (t) and, in order to take phase shifts into account, its derivative ψ˙¯ j (t): zj (t) = f (ψ¯ j (t), ψ˙¯ j (t)) ≈ f0 + f1 · ψ¯ j (t) + f2 · ψ˙¯ j (t) + · · · ,
(3.11)
where f (ψ¯ j (t), ψ˙¯ j (t)) denotes a nonlinear function, which we expanded into a Taylor function in terms of ψ¯ j (t) and ψ˙¯ j (t) and truncated after the linear terms as an approximation. The constant f0 describes a constant amplitude shift and can be set to zero for a rhythmic movement. (See also Jirsa & Haken, 1997, for a treatment of the sensorimotor feedback loop in terms of oscillators.) In the following discussion we will consider two limiting cases: first f1 6= 0, f2 = 0, and later return to the case f1 = 0, f2 6= 0. With equations 3.9 and 3.11, and f1 6= 0, f2 = 0, the sensorimotor feedback is now given by, Z pj (x, t) = βjs (x)zj (t) = c0 βjs (x)
L/2
−L/2
βjm (x)ψ(x, t) dx,
(3.12)
and the feedback loops of the motor and sensorimotor units are closed. We are now in a position to specify the field equation (3.7) as follows: ¶ µ ∂ ψ¨ + (ω02 − v2 4)ψ + 2ω0 ψ˙ = ae ω02 + ω0 ∂t (3.13) · S[ρψ + pl (x, t) + pr (x, t)] . The field ψ(x, t) can also always be written in terms of symmetric and antisymmetric contributions, ψ(x, t) =
1 1 (ψ(x, t) + ψ(−x, t)) + (ψ(x, t) − ψ(−x, t)) . 2 | {z } |2 {z } ψ + (x,t)
(3.14)
ψ − (x,t)
Next, we make an assumption about the spatial pattern underlying the temporal dynamics of ψ + (x, t) and ψ − (x, t). In the bimanual coordination experiment, a transition from one pattern to another is observed on the behavioral level. In this case the theory of dynamical systems, in particular synergetics (Haken, 1983, 1987), predicts low-dimensional behavior of the system under consideration, and we expect to observe low-dimensional transition phenomena on the brain level too. Further, in previous analyses of brain-behavior experiments (Fuchs et al., 1992; Jirsa et al., 1995) involving behavioral transitions during coordination with external stimuli, low-dimensional spatiotemporal brain dynamics was found and could be described in terms of two spatial modes. Here we deal with the different situation of the coordination of two limbs. But since a similar phase transition
2032
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
in behavior has also been observed, we assume to a first approximation that each contribution ψ + (x, t) and ψ − (x, t) is dominated by one spatial pattern and factorizes ψ + (x, t) ≈ g+ (x) · ψ+ (t) ψ − (x, t) ≈ g− (x) · ψ− (t)
(3.15)
if only standing waves are present. The assumption of two dominating spatial modes is crucial for the following mathematical analysis and experimentally easy to test. If this assumption is confirmed, higher-order structures of the dynamics can be included following the lines of Jirsa et al. (1995) in which the reproduction of the experimental spatiotemporal signal could be improved by adding more spatial modes whose temporal dynamics depends only on the dynamics of the prior modes. The ansatz (see equation 3.15) can also be expanded for higher dimensions or traveling waves, but, this will increase the complexity of the analytical treatment considerably and in most cases will not lead to the same results as in the case of two dominating spatial modes. For these reasons the hypothesis, equation 3.15, is the first to be tested experimentally. A complex system whose dynamics is governed by a nonlinear evolution law may perform phase transitions from one stationary state to another when a control parameter is varied. Close to the transition point, the dynamics of this system is governed by the first leading orders of the nonlinearities (Haken, 1983, 1987). Here we express the sigmoid function S[n] by the logistic function S[n] =
1 1 − , 1 + exp (−4an) 1 + exp (4a)
(3.16)
where a denotes the sensitivity coefficient of response of the corresponding neural population. We expand equation 3.16 into a Taylor series up to third order in n and obtain 4 S[n] ≈ an − a3 n3 , 3
(3.17)
which is a good approximation for values an smaller than 1. Projecting the neural field equation (3.13) onto g+ (x) and g− (x) following the lines in Jirsa and Haken (1996a, 1997), we obtain ¶ µ ∂ ψ¨ + + 2ω0 ψ˙ + + Ä2+ ψ+ = ae ω02 + ω0 ∂t ¸ · 4 3 3 2 · af10 ψ+ − a ( f30 ψ+ + 3 f12 ψ− ψ+ ) (3.18) 3
Connecting Cortical and Behavioral Dynamics
2033
¶ µ ∂ ψ¨ − + 2ω0 ψ˙ − + Ä2− ψ− = ae ω02 + ω0 ∂t ¸ · 4 3 3 2 · af01 ψ− − a ( f03 ψ− + 3 f21 ψ+ ψ− ) (3.19) 3 where Z Ä2i = ω02 − v2 ·
L/2 −L/2
4gi (x) dx
with
i = +, −
(3.20)
and the terms fij with i, j = 0, . . . , 3 are constant parameters, which are given in the appendix. So far we have tackled the level of the brain. Let us now traverse the scales of organization by referring back to the behavioral level where the equations governing the behavioral dynamics are known from equation 2.10. Here we take these equations as a guide to obtain conditions that restrict the solution space of equations 3.18 and 3.19. In equation 2.8 we expressed the behavioral modes in terms of finger displacements. With equation 3.11, we can express the behavioral modes in terms of the neural field: ψ˜ + (t) = zl (t) + zr (t) = c0
ψ˜ − (t) = zl (t) − zr (t) = c0
Z
L/2
L/2
Z
L/2
L/2
(βlm (x) + βrm (x)) · ψ(x, t) dx
(3.21)
(βlm (x) − βrm (x)) · ψ(x, t) dx.
(3.22)
If we take our hypothesis, equation 3.15, of two dominating spatial modes into account, the behavioral modes can be expressed as: ψ˜ + (t) = c0 ψ+ (t)
ψ˜ − (t) = c0 ψ− (t)
Z
L/2
L/2
Z
L/2
L/2
(βlm (x) + βrm (x)) · g+ (x) dx = c+ ψ+ (t)
(3.23)
(βlm (x) − βrm (x)) · g− (x) dx = c− ψ− (t),
(3.24)
where c+ , c− are constant. In the present framework, it turns out that in the bimanual coordination case, the symmetric (antisymmetric) behavioral mode is proportional to the symmetric (antisymmetric) brain mode. As a consequence the dynamical system (see equation 2.10) of the behavioral modes and the system (see equations 3.18 and 3.19) of the brain modes should be equivalent. This requires ω02 ¿ ω0 ∂/∂t, which implies on the considered time scales (see also Jirsa and Haken, 1997, where this limit was
2034
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
used) that the mean corticocortical fiber length is large. We rewrite equations 3.18 and 3.19 as follows: ∂ ( f30 ψ+3 + 3 f12 ψ−2 ψ+ ) = 0 ∂t ∂ ψ¨ − − γ ψ˙ − + Ä2− ψ− + b ( f30 ψ−3 + 3 f21 ψ+2 ψ− ) ∂t = 2ψ˙ − · (α + βψ−2 ) ψ¨ + − γ ψ˙ + + Ä2+ ψ+ + b
(3.25)
where b = 4/3ae a3 ω0
γ = ae aω0 f10 − 2ω0
α = 1/2ae aω0 ( f01 − f10 )
β = −2ae a3 ω0 ( f03 − f30 ).
(3.26)
The system (see equation 3.25) is structurally equivalent to the dynamical system (see equation 3.10) for the behavioral modes. Note that in the latter system the Rayleigh terms with the parameter B were introduced in order to obtain a frequency dependence of the oscillator amplitude (see equation 2.6). An alternative way to achieve this dependence is to introduce frequencydependent parameters in equation 2.10, for example, A(Ä) = A + 3BÄ2 or γ = γ (Ä), which yields the same results as a Rayleigh term. If f12 = f21 and Ä+ = Ä− , then the left-hand side of equation 3.25 represents a fully symmetric system with respect to the coupling terms. The right-hand side of the second equation in equation 3.25 is a consequence of the difference between the spatial overlap of functional units–symmetric brain mode and functional units–antisymmetric brain mode. From the behavioral mode system in equation 2.10, we know the necessary condition α < 0, β > 0, which leads to the nontrivial restriction f01 < f10 , f03 < f30 and implies a greater spatial overlap of the considered functional units with the symmetric brain mode. We examine this possibility in experiment shortly. Let us turn to the second limit case f1 = 0, f2 6= 0 in equation 3.11. The subsequent treatment of the field equation, 3.13, is in full analogy to the prior limit case, f1 6= 0, f2 = 0. The main difference is that the sensorimotor ˙ feedback in equation 3.12 now follows the first derivative ψ(x, t) of the neural field. Performing the same calculations as before we obtain, ψ¨ + − γ ψ˙ + + Ä2+ ψ+ + b · ( f30 ψ˙ +3 + 3 f12 ψ˙ −2 ψ˙ + ) = 0 ψ¨ − − γ ψ˙ − + Ä2− ψ− + b · ( f30 ψ˙ −3 + 3 f21 ψ˙ +2 ψ˙ − ) = 2ψ˙ − · (α + β ψ˙ −2 )
(3.27)
with b = 4/3ae a3 ω02
γ = ae aω02 f10 − 2ω0
α = 1/2ae aω02 ( f01 − f10 )
β = −2/3ae a3 ω02 ( f03 − f30 )
(3.28)
Connecting Cortical and Behavioral Dynamics
2035
and the two conditions ρ ¿ c0 and Ä∂/∂t ¿ ω02 implying a shorter mean corticocortical fiber length than in the prior case. Under these conditions we have only Rayleigh terms present in the nonlinearities of the left-hand side in equation 3.27. The right-hand side is a modified HKB coupling where the nonlinear coupling term is given by the first derivative ψ˙ − (t). This modification preserves the bistability of the system (see equation 3.27) but leads to an effective parameter β dependent on the frequency Ä such that the critical transition frequency cannot be reached for constant parameters in equation 3.28. Summarizing, both limiting cases in equation 3.10 lead to slightly modified mode equations, 3.25 and 3.27, in comparison to equation 3.10. The first equations provide the correct coupling terms, but the Rayleigh terms are absent. The second provide the Rayleigh terms, but not the correct HKB coupling. Both problems can be overcome by the introduction of frequencydependent parameters. Further, a combination of the two limiting cases in equation 3.11— f1 6= 0, f2 6= 0—introduces Rayleigh terms and the correct HKB coupling, but also further terms. Here, as always in nonlinear complex systems, the choice of the parameters determines what kind of behavior will be observed. 3.3 Numerical Treatment. In order to illustrate our more general results of the previous section, we choose a simple example for a specific set of brain modes and localized functional units. The motor and sensorimotor areas on the right hemisphere are localized as follows: ½ βrm (x) = βrs (x) =
2π L
0
−² ≤ x ≤ L4 , otherwise
(3.29)
− L4 ≤ x ≤ ² , otherwise
(3.30)
and on the left hemisphere, ½ βlm (x) = βls (x) =
2π L
0
and satisfy the required anatomical symmetry of equation 3.10. For reasons of simplicity, we choose the motor and sensorimotor units on the same hemisphere to be identical in the numerical treatment. The sensorimotor feedback is specified for the first limiting case, f1 6= 0, f2 = 0, according to equation 3.11. We introduce a frequency-dependent function γ0 (Ä) into the linear damping G0 (Ä) = 2ω0 + γ0 (Ä) on the left-hand side of equation 3.13, which causes a frequency dependence of the wave amplitude and thus a frequency dependence of the finger movements as experimentally observed. For γ0 (Ä) = 0 the original field equation is present. In our specific ¯ example, equations 3.29 and 3.30, the mean field ψ(x, t), equivalent to the homogeneous mode, has to remain constant. To ensure this, we introduce a ˙¯ t) into equation 3.13. Using a semi-implicit linear mean-field damping ψ(x,
2036
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
forward-time-central-space procedure, we integrate the field equation (3.13) numerically. The functional units are localized according to equations 3.29 and 3.30, and the edges of the localization functions were smoothed for reasons of numerical stability. Periodic boundaries were used. The parameters used in the numerical simulations are: ω0 = 2π · 0.1, v = 0.152, ae = 1, a = 0.4, ρ = 0.5, c0 = 2, extension of the neural strip L = 1, spatial overlap of the localization functions ² = 0.025. Here the space unit corresponds to 1 m and the time unit to 100 msec. This parameter range is realistic: the corticocortical propagation velocity v is in the 1 m/sec range, the extension of the neural sheet L is in the 1 m range, and the long-range connectivity σ = v/ω0 is in the 10 cm range. In the bottom left corner of Figure 3 the localization of the functions βli (x), βri (x) with i = m, s is shown within the neural strip; above that, the symmetric mode g+ (x) and the antisymmetric mode g− (x) are shown. Here the space unit used is m. On the right-hand side of Figure 3, four rows each consisting of a space-time plot and time series are given. In each space-time plot, the field ψ(x, t) is plotted where the spatial domain x is vertical and the temporal domain t horizontal, as indicated by the arrows in the top left corner. In the graphs under the space-time plots, the field ψ(x, t) is projected onto the symmetric mode g+ (x) (dashed line) and the antisymmetric mode g− (x) (solid line) plotted over time t (in sec). The first two rows describe the situation before the pretransition, with γ0 (Ä) = 0. Two possible states are shown. In the top row, the antisymmetric mode g− (x) dominates; in the second row, the symmetric mode g+ (x) dominates. Increasing the damping to γ0 (Ä) = 0.3, the antisymmetric mode becomes unstable and performs a transition to the symmetric state. This transition is shown in the third row. The symmetric mode remains stable in the posttransition regime, as can be seen in the bottom row. Hence in the case for ² > 0, bistability is present in the pretransition regime and monostability in the posttransition regime. For ² = 0 no transition is observed, and bistability is preserved for the entire control parameter regime. 3.4 Experimental Predictions. From our results in the previous sections, we can predict some of the gross features to be observed in the macroscopic spatiotemporal brain dynamics measured by EEG or MEG in a bimanual coordination experiment: • The main brain dynamics should be dominated by one spatial pattern in the in-phase situation, as well as in the antiphase situation (see equation 3.15). • In the in-phase situation, the spatial pattern will be symmetric with respect to reflection at the plane between the two hemispheres; in the antiphase situation, it will be antisymmetric. Note that here the terms symmetric and antisymmetric apply to current distributions and hence to patterns to be observed in the EEG. Current flowing in apical dendrites
Connecting Cortical and Behavioral Dynamics
2037
Figure 3: The overlap of the localization functions βij (x) is shown in the bottom left corner. The activity of the field ψ(x, t) is plotted over the time t (horizontal) and the space x (vertical), together with the time series of the symmetric mode g+ (x) (dashed line) and the antisymmetric mode g− (x) (solid line). The top two rows correspond to the pretransition region γ0 (Ä) = 0, the bottom two rows to the posttransition region γ0 (Ä) = 0.3.
of pyramidal cells generates a magnetic field in a plane orthogonal to the currents. Thus, spatial antisymmetric (symmetric) patterns in the MEG correspond to symmetric (antisymmetric) current distributions. • By increasing the cycling frequency of the finger movements, the antiphase pattern should remain stable up to the critical frequency, where the transition in the finger movements is observed. Here a transition from the antisymmetric to the symmetric pattern should be observed. Beyond the critical frequency, the symmetric spatial pattern is stable. • Below and beyond the critical frequency, the temporal dynamics of the symmetric and the antisymmetric pattern should be dominated by an oscillation with the cycling frequency of the finger movements. Higher-odd harmonics of the cycling frequency should be visible in
2038
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
the frequency spectra of the brain signals. Even harmonics might also be observed if the Taylor expansion of the sigmoid function in equation 3.17 yields even order terms. • The parameters f01 , f10 , f30 , and f03 defined in the appendix have to satisfy the conditions f01 < f10 and f03 < f30 . Note that these conditions are inverted for the MEG. 3.5 Preliminary Experimental Test of Theoretical Predictions. A brainbehavior-experiment has been performed by Kelso, Fuchs, Holroyd, Cheyne, and Weinberg (1994) in which subjects moved their left and right index fingers in time with an auditory metronome presented to both ears in ascending frequency plateaus of 10 cycles each. The initial metronome frequency was 2.0 Hz and increased by 0.2 Hz for each plateau, with a total number of eight plateaus. The subject was instructed to move the right finger antiphase and the left finger in-phase with the metronome, and switched spontaneously to a movement pattern with both fingers in phase as the metronome frequency increased to the fifth plateau (2.8 Hz). During the experiment, magnetic field measurements of brain activity were obtained with a whole-head, 64-channel SQuID magnetometer at a sampling frequency of 250 Hz with a 40 Hz low-pass filter. Each subject performed 5 blocks of 10 runs each. In the following, we check these experimental data against the theoretical predictions. Note that the experimental set-up (auditory metronome) and the subject’s instructions differ somewhat from the experimental conditions of this article. Thus, the following experimental results have to be considered a preliminary test of the presented theory. In order to test the predictions in section 3.4, we perform a KarhunenLo`eve decomposition (KL) (see, e.g., Fuchs et al., 1992) of the MEG data on each frequency plateau separately. A KL decomposition decomposes a spatiotemporal signal ψ(x, t) into orthogonal spatial modes and corresponding time-dependent amplitudes such that a least-square error E is minimized and the KL modes have maximum variance (see Makeig, Bell, Jung, & Sejnowski, 1996) for higher-order approaches—independent component analysis—using mutual information). The normalized KL eigenvalue λ = 1 − E is a measure for the contribution of a KL mode to the entire spatiotemporal signal. Figure 4a shows the first KL mode (top row) and the second KL mode (bottom row) together with the frequency spectra (below the modes) of their corresponding time series for each plateau. The plateau number n increases from left to right. The orientation of the modes is such that the nose is on the top, and their color coding (after normalization) is given on the right of Figure 4. The vertical dotted lines in the frequency spectra denote the movement frequency and its higher harmonics. It turns out that in almost all these spectra (except plateau 1), the temporal dynamics is strongly dominated by the movement frequency. Further, except for the first plateau, we observe a spatial structure in the two KL modes, which
Connecting Cortical and Behavioral Dynamics
2039
Figure 4: (a) The first two KL modes of the entire spatiotemporal brain signal over the eight plateaus are shown together with their frequency spectra below in arbitrary units. (b) The first KL modes of the symmetric and antisymmetric expansion of the brain signals are shown over all plateaus, together with their frequency spectra. (c) Top: First KL modes from Figure 4b on plateau 8 and 2. Bottom: Sensorimotor-motor modes for self-paced left and right finger movement.
is constant before the transition (plateau 5). After the behavioral transition (plateaus 6 to 8), a different structure (increased activity on the right-hand side of the mode contributing to a more antisymmetric shape) is observed in the first KL mode, which is similar over the posttransition plateaus. In the pretransition region, the first KL mode is about 70% antisymmetric and 30% symmetric, whereas after the transition, the antisymmetric contribution increases to 90% and the symmetric contribution decreases to 10%. The contributions λ1 of the two KL modes to the entire spatiotemporal signal are plotted over the eight plateaus in Figure 5a (top left). Here the eigenvalue of the first KL mode is represented by a dotted line, the one of the
2040
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
second KL mode by a solid line and the contribution of the sum of the two KL modes by a bold line. Approximately 70% of the entire spatiotemporal signal can be represented by the sum of the first two KL modes, both before and after the transition. Before the transition, this representation is given as a superposition of the two modes; after the transition, the first KL mode carries most of the variance (about 60%) and dominates the dynamics. The spatial structure in the posttransition regime seems to show a more symmetric current distribution than in the pretransition regime. In order to test this in more detail, we decompose the entire spatiotemporal signal into symmetric contributions ψ + (x, t) and antisymmetric contributions ψ − (x, t) according to equation 3.14. We decompose these two signals into their KL modes and plot the first KL mode of each signal over the eight plateaus (see Figure 4b), where the top row shows the first KL mode of ψ − (x, t) and the bottom row the one of ψ + (x, t) together with their frequency spectra. These KL modes of ψ − (x, t), ψ + (x, t) are entirely dominated by the movement frequency. Note that the temporal activity of the symmetric contributions ψ + (x, t) decreases to almost zero, whereas the temporal activity of ψ − (x, t) shows an increase after the transition. In Figure 5a (bottom left) their KL eigenvalues λ3 are plotted over the plateaus (dotted: antisymmetric, solid: symmetric). Here we see that both KL eigenvalues are around 50% before the transition. After the transition, the KL eigenvalue of the first symmetric KL mode drops to 30%, and the antisymmetric KL mode increases to 75%. A similar behavior can be seen in the graph next to it, where λ4 describes the contribution of these KL modes to the entire spatiotemporal signal ψ(x, t) (dotted: antisymmetric, solid: symmetric, bold: antisymmetric + symmetric). In order to test how well ψ(x, t) can be reconstructed by two spatial modes that are fixed over all plateaus (see equation 3.15), we choose two representatives: the first KL mode g− (x) of ψ − (x, t) on plateau 8 and the first mode g+ (x) of ψ + (x, t) on plateau 2. These two modes are plotted again in Figure 4c (top). The contribution λ2 of these two fixed modes to the entire spatiotemporal signal ψ(x, t) is plotted in Figure 5a (top right, dotted: antisymmetric, solid: symmetric, bold: both). The representation of ψ(x, t) by these two fixed modes is about 30% in the pretransition region, where both modes have similar contributions. In the posttransition region, the mode g− (x) dominates and contributes about 50% to ψ(x, t). The degree of symmetry s of the first KL mode (in Figure 4a) is quantified and plotted over the plateau number n (bold: antisymmetric, solid: symmetric) in Figure 5b. In Figure 5c the time series of the first KL mode (bold line) of ψ(x, t) are plotted for plateau 2 (top row) and plateau 8 (bottom row). The auditory metronome (dotted line) is also plotted over time t. Both time series oscillate in phase with the movement frequency before and after the transition, which is consistent with our theoretical predictions, since the transition phenomena are captured by the transition of the spatial modes. Additionally, the finger movements (solid line) are plotted in the same graph where the left finger movement is always in phase with the metronome according to the
Connecting Cortical and Behavioral Dynamics
2041
Figure 5: (a) The contributions λ of the single modes to the brain signals are shown in dependence of the plateau number n. λ1 : KL eigenvalues from Figure 4a (dotted: first KL mode, solid: second KL mode, bold: both). λ2 : contributions from the two fixed spatial modes in Figure 4c (dotted: antisymmetric, solid: symmetric, bold: both). λ3 : KL eigenvalues from Figure 4b (dotted: antisymmetric, solid: symmetric). λ4 : contribution of the KL modes from Figure 4b to the entire brain signals (dotted: antisymmetric, solid: symmetric, bold: both). (b) The degree of symmetry s of the first KL mode is quantified and plotted over the plateau number n (bold: antisymmetric, solid: symmetric). (c) Time series on plateau 2 (top) and plateau 8 (bottom) (bold: first KL mode from Figure 4a, dotted: auditory stimulus, solid: finger movements). The amplitudes are scaled in arbitrary units; the time units are seconds.
subject’s instructions. The right finger movement is antiphase before and in-phase after the transition. From these results, we conclude that our first prediction is partially confirmed: the spatiotemporal dynamics of the brain signals in the pretransition region is equally dominated by the two first KL modes and not by one spatial mode only; after the transition, the first KL mode is dominant as ex-
2042
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
pected. The second, third, and fourth predictions are confirmed. We observe a transition-like behavior in the brain signals where the contribution of a symmetric current distribution in the posttransition regime is significantly larger. Further, the brain signals are strongly dominated by the movement frequency of the fingers. In a control condition, the subject performed self-paced rhythmic finger movements with each finger separately at a frequency of 1.5 Hz. During these movements, the magnetic brain activity was also recorded. The spatiotemporal brain dynamics is such that with movement onset, a spatial dipolar structure arises contralaterally and decays after an oscillation with a time period of about 200 msec. Here we do not have the possibility of distinguishing between sensorimotor and motor units, since both are involved in this task, and further controls were not available. Due to the proximity of the sensorimotor and motor units and the large spatial scale (of centimeters) under consideration, we approximate both units by the same spatial structure for this first preliminary analysis. We identify this spatial structure with the dipolar mode apparent during the self-paced movement. Figure 4c (bottom) shows these modes represented by their first KL mode, which contributes 70% to the entire brain signal. Here the left mode shows a strong dipolar structure on the right-hand side corresponding to a left finger movement, and, by analogy, the right mode on the left-hand side corresponding to a right finger movement. We identify these modes with the functional units involved (left mode: βrs (x), βrm (x), right mode: βls (x), βlm (x)). With the above identifications of g+ (x), g− (x) we can determine the parameters defined in the appendix: f01 = 0.57, f10 = 0.20, f30 = 3.5 · 10−4 , and f03 = 0.02 for c0 = 1, ρ = 0.2. These parameters satisfy the required condition f01 > f10 and f03 > f30 for the MEG. This condition appears to be robust against small changes of the spatial modes and functional units; however, it is violated by larger changes, such as rotation of g− (x) by π/2. 4 Summary The main point of this article is to show how it is possible to derive the phenomenological nonlinear laws at the behavioral level from models describing brain activity. For the paradigmatic case of bimanual coordination, we briefly reviewed the collective and component level of description of the behavioral dynamics. We proceeded by transforming the behavioral model on the component level onto a model describing the dynamics of the behavioral modes. The behavioral level was connected to the brain level by deriving the behavioral model on the mode level from a recently developed field-theoretical description of brain activity. For the derivation, the crucial points were the assumption of bimodal brain dynamics and the interplay between functional input and output units in the neural sheet. Here the comparison of behavioral and brain level serves as a guide to consistency of both descriptions. We made theoretical predictions about the global brain
Connecting Cortical and Behavioral Dynamics
2043
dynamics and presented a preliminary experimental test that gives strong indications that the predicted spatiotemporal dynamics is present during bimanual coordination tasks. Appendix: Explicit Forms of Coupling Integrals fij We define Z f+ (x) = ρg+ (x) + c0 (βls (x) + βrs (x)) ·
−L/2
Z f− (x) = ρg− (x) + c0 (βls (x) − βrs (x)) ·
L/2
L/2
−L/2
βlm (x)g+ (x) dx
(A.1)
βlm (x)g− (x) dx
(A.2)
and give the explicit forms of the coupling integrals fij in equations 3.18 and 3.19 as Z L/2 Z L/2 g+ (x) · f+ (x) dx f01 = g− (x) · f− (x) dx (A.3) f10 = −L/2
Z f30 =
−L/2
Z f12 = f21 =
L/2
L/2
−L/2 Z L/2 −L/2
−L/2
Z g+ (x) · f+ (x)3 dx
f03 =
L/2
−L/2
g− (x) · f− (x)3 dx (A.4)
g+ (x) · f+ (x) f− (x)2 dx g− (x) · f+ (x)2 f− (x) dx.
(A.5)
Acknowledgments This research was supported by NIMH (Neurosciences Research Branch) grant MH42900, KO5 MH01386, and the Human Frontiers Science Program. Further, we wish to thank Hermann Haken for many interesting discussions. V. K. J. gratefully acknowledges a fellowship from the Deutsche Forschungsgemeinschaft. References Abeles, M. (1991). Corticonics. New York: Cambridge University Press. Braitenberg, V., & Schuz, ¨ A. (1991). Anatomy of the cortex: Statistics and geometry. Berlin: Springer-Verlag. Buchanan, J. J., Kelso, J. A. S., & de Guzman, G. C. (1997). The self-organization of trajectory formation: I. Experimental evidence. Biol. Cybern., 76, 257–273.
2044
V. K. Jirsa, A. Fuchs, and J. A. S. Kelso
Carson, R., Byblow, W., & Goodman, D. (1994). The dynamical substructure of bimanual coordination. In S. Swinnen, H. Heuer, J. Massion, & P. Casaer (Eds.). Interlimb coordination: Neural, dynamical and cognitive constraints (pp. 319–337). San Diego: Academic Press. Collins, J. J., & Stewart, I. N. (1993). Coupled nonlinear oscillators and the symmetries of animal gaits. J. Nonlinear Sci., 3, 349–392. Cross, M. C., & Hohenberg, P. C. (1993). Pattern formation outside of equilibrium. Rev. Mod. Phys., 65, 851. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex, 1, 1–47. Freeman, W. J. (1992). Tutorial on neurobiology: From single neurons to brain chaos. Inter. J. Bif. Chaos, 2, 451–482. Friedrich, R., Fuchs, A., & Haken, H. (1991). Spatiotemporal EEG patterns. In H. Haken & H. P. Koepchen (Eds.), Rhythms in physiological systems. Berlin: Springer-Verlag. Fuchs, A., Kelso, J. A. S., & Haken, H. (1992). Phase transitions in the human brain: Spatial mode dynamics. Inter. J. Bif. Chaos, 2, 917–939. Fuchs, A., & Haken, H. (1988). Pattern recognition and associative memory as dynamical processes in a synergetic system I + II, erratum. Biol. Cybern., 60, 17–22, 107–109, 476. Haken, H., (1983). Synergetics: An introduction. (3rd ed.). Berlin: Springer-Verlag. Haken, H. (1987). Advanced synergetics. (2nd ed.). Berlin: Springer-Verlag. Haken, H. (1991). Synergetic computers and cognition: A top-down approach to neural nets. Berlin: Springer-Verlag. Haken, H. (1996). Principles of brain functioning. Berlin: Springer-Verlag. Haken, H., Kelso, J. A. S., & Bunz, H. (1985). A theoretical model of phase transitions in human hand movements. Biol. Cybern., 51, 347–356. Jirsa, V. K., Friedrich, R., & Haken, H. (1995). Reconstruction of the spatiotemporal dynamics of a human magnetoencephalogram. Physica D, 89, 100–122. Jirsa, V. K., & Haken, H. (1996a). Field theory of electromagnetic brain activity. Phys. Rev. Let., 77, 960. Jirsa, V. K., & Haken, H. (1996b). Derivation of a field equation of brain activity. J. Biol. Phys., 22, 101–112. Jirsa, V. K., & Haken, H. (1997). A derivation of a macroscopic field theory of the brain from the quasi-microscopic neural dynamics. Physica D, 99, 503–526. Jirsa, V. K., Friedrich, R., Haken, H., & Kelso J. A. S. (1994). A theoretical model of phase transitions in the human brain. Biol. Cybern., 71, 27–35. Kelso, J. A. S. (1981). On the oscillatory basis of movement. Bull. Psychon. Soc., 18, 63. Kelso, J. A. S. (1984). Phase transitions and critical behavior in human bimanual coordination. Am. J. Physiol., 15, R1000–R1004. Kelso, J. A. S. (1995). Dynamic patterns: The self-organization of brain and behavior. Cambridge, MA: MIT Press. Kelso, J. A. S., Bressler, S. L., Buchanan, S., DeGuzman, G. C., Ding, M., Fuchs, A., & Holroyd, T. (1992). A phase transition in human brain and behavior. Phys. Lett. A, 169, 134–144. Kelso, J. A. S., Buchanan, J. J., & Wallace, S. A. (1991). Order parameters for the
Connecting Cortical and Behavioral Dynamics
2045
neural organization of single, multijoint limb movement patterns. Exp. Brain Res., 85, 432–444. Kelso, J. A. S., Fuchs, A., Holroyd, T., Cheyne, D., & Weinberg, H. (1994). Bifurcations in human brain and behavior. Society for Neuroscience, 20, 444. Kelso, J. A. S., Scholz, J. P., & Schoner, ¨ G. (1986). Nonequilibrium phase transitions in coordinated biological motion: Critical fluctuations. Phys. Let. A, 118, 279–284. Kristeva, R., Cheyne, D., & Deecke, L. (1991). Neuromagnetic fields accompanying unilateral and bilateral voluntary movements: Topography and analysis of cortical sources. Electroenceph. Clin. Neurophys., 81, 284–298. Makeig, S., Bell, A. J., Jung, T. P., & Sejnowski, T. J. (1996). Independent component analysis of electroencephalic data. In D. Touretsky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 145– 151). Cambridge, MA: MIT Press. Miller, R. (1987). Representation of brief temporal patterns, Hebbian synapses, and the left-hemisphere dominance for phoneme recognition. Psychobiology, 15, 241–247. Nunez, P. L. (1974). The brain wave equation: A model for the EEG. Mathematical Biosciences, 21, 279–297. Nunez, P. L. (1995). Neocortical dynamics and human EEG rhythms. New York: Oxford University Press. Scholz, J. P., Kelso, J. A. S., & Schoner, ¨ G. (1987). Nonequilibrium phase transitions in coordinated biological motion: critical slowing down and switching time. Phys. Lett. A, 123, 390–394. Schoner, ¨ G., Jiang, W. Y., & Kelso, J. A. S. (1990). A synergetic theory of quadrupedal gaits and gait transitions. J. Theor. Biol., 142, 359–391. Tuller, B., & Kelso, J. A. S. (1989). Environmentally-specified patterns of movement coordination in normal and split-brain subjects. Exp. Brain Res., 75, 306–316. Wallenstein, G. V., Kelso, J. A. S., & Bressler, S. L. (1995). Phase transitions in spatiotemporal patterns of brain activity and behavior. Physica D, 84, 626– 634. Wiesendanger, M., Wicki, U., & Rouiller, E., (1994). Are there unifying structures in the brain responsible for interlimb coordination? In S. Swinnen, H. Heuer, J. Massion, & P. Casaer (Eds.), Interlimb coordination: Neural, dynamical and cognitive constraints (pp. 179–207). San Diego: Academic Press. Williamson, S. J., & Kaufman, L. (1987). Analysis of neuromagnetic signals. In A. S. Gevins & A. Remond (Eds.), Methods of analysis of brain electrical and magnetic signals. EEG Handbook. Amsterdam: Elsevier Science. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical J., 12, 1–24. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Received March 5, 1997; accepted February 4, 1998.
LETTER
Communicated by Alexandre Pouget
Constructive Incremental Learning from Only Local Information Stefan Schaal Department of Computer Science, University of Southern California, Los Angeles, CA 90089-2520, U.S.A., and Kawato Dynamic Brain Project (ERATO/JST), 619-02 Kyoto, Japan
Christopher G. Atkeson College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, U.S.A., and ATR Human Information Processing Laboratories, 619-02 Kyoto, Japan
We introduce a constructive, incremental learning system for regression problems that models data by means of spatially localized linear models. In contrast to other approaches, the size and shape of the receptive field of each locally linear model, as well as the parameters of the locally linear model itself, are learned independently, that is, without the need for competition or any other kind of communication. Independent learning is accomplished by incrementally minimizing a weighted local cross-validation error. As a result, we obtain a learning system that can allocate resources as needed while dealing with the bias-variance dilemma in a principled way. The spatial localization of the linear models increases robustness toward negative interference. Our learning system can be interpreted as a nonparametric adaptive bandwidth smoother, as a mixture of experts where the experts are trained in isolation, and as a learning system that profits from combining independent expert knowledge on the same problem. This article illustrates the potential learning capabilities of purely local learning and offers an interesting and powerful approach to learning with receptive fields. 1 Introduction Learning with spatially localized basis functions has become a popular paradigm in machine learning and neurobiological modeling. In the context of radial basis function networks (Moody & Darken, 1988; Poggio & Girosi, 1990), it was demonstrated that such local learning offers an alternative to learning with global basis functions, such as sigmoidal neural networks, and that its theoretical foundation can be solidly grounded in approximation theory (Powell, 1987). In neurophysiological studies, the concept of localized information processing in the form of receptive fields has been known since at least the work of Mountcastle (1957) and Hubel and Wiesel c 1998 Massachusetts Institute of Technology Neural Computation 10, 2047–2084 (1998) °
2048
Stefan Schaal and Christopher G. Atkeson
(1959). Since then, a wealth of experimental evidence has accumulated that suggests that information processing based on local receptive fields is a ubiquitous organizational principle in neurobiology that offers interesting computational opportunities (Lee, Rohrer, & Sparks, 1988; Georgopoulos, 1991; Field, 1994; Olshausen & Field, 1996; Daugman & Downing, 1995). In this article we explore the computational power of local, receptive field–based incremental learning with the goal of approximating unknown functional relationships between incoming streams of input and output data. By incremental learning we do not mean just that the parameters of the learning system are updated incrementally. We want to address a learning scenario in which limited memory is available such that after a new data point is incorporated in the learning system, it is discarded and cannot be reused, in which input and output distributions of the data are unknown, and in which these distributions may change over time. This situation resembles the learning of sensory and sensorimotor transformations in biology, and it also applies to a variety of artificial domains, ranging from autonomous robotic systems to process control. Given these constraints on incremental learning, two major problems need to be addressed. The first is how to allocate the appropriate number of resources, such as receptive fields, in order to deal with the trade-off between overfitting and oversmoothing, called the bias-variance dilemma (Geman, Bienenstock, & Doursat, 1992). The second problem of incremental learning comes from negative interference: the forgetting of useful knowledge while learning from new data. Methods to prevent negative interference require validation data sets, memorizing of all training data, or strong prior knowledge about the learning problem. However, none of these alternatives is available in the setting we have described as we want to avoid storing data and do not have much knowledge about the structure of the learning task. In order to address the problems of incremental learning, we resort to techniques from nonparametric statistics (Scott, 1992; Hastie & Tibshirani, 1990). Nearest-neighbor algorithms for pattern recognition and Parzen windows for density estimation are among the best-known methods out of this field (Duda & Hart, 1973). It is interesting to note that many nonparametric methods are essentially receptive field based: predictions are made using data from a restricted local neighborhood around the query point. The size of the neighborhood can be irregular, as typically is the case in nearest-neighbor approaches, or it can be a symmetric smooth weighting function as in Parzen windows. Receptive fields in nonparametric regression are mostly built on the fly and are discarded right after the prediction, a paradigm that has been termed lazy learning (Aha, 1997). Necessarily, such nonparametric methods need to store training data. Another characteristic is that predictions are usually based on a single receptive field. This property inspired the field of nonparametric regression to pursue more complex models in a receptive field, for instance, low-order polynomials (Cleveland, 1979; Cleveland & Loader, 1995). In contrast, many neural network algo-
Constructive Incremental Learning
2049
rithms, such as radial basis function systems, focused on combining the activation strengths of many receptive fields to optimize predictions. In this article, we demonstrate how a nonparametric regression approach can be used to build a receptive field–based learning system for incremental function approximation without the need to store the training data and without discarding receptive fields after using them. A locally linear model will be fitted incrementally within each receptive field such that local function approximation is accomplished in the spirit of a Taylor series expansion. A new property of this learning approach is that each receptive field is trained independent of all other receptive fields, thereby adjusting the parameters of its locally linear model, the size and shape of its receptive field, and the bias on the relevance on its individual input dimensions. New receptive fields are allocated as needed. The resulting algorithm, receptive field–weighted regression (RFWR), achieves robust incremental learning. It also has some interesting relations to previously suggested learning methods. It can be interpreted as a mixture of experts system (Jacobs, Jordan, Nowlan, & Hinton, 1991; Jordan & Jacobs, 1994) where the experts are trained in isolation. It can also be interpreted as a system where a set of experts is trained independently on the same problem and that profits from combining these experts for making predictions (Perrone & Cooper, 1993). Finally, RFWR can be interpreted as a nonparametric memory-based learner (Atkeson, Moore, & Schaal, 1997a) which stores only data that are surprising. In the next section, we give some motivation for our approach to incremental learning. Section 3 describes the details of our nonparametric incremental learning system and outlines some of its statistical characteristics. Section 4 discusses a variety of empirical evaluations. Section 5 outlines related work, and Section 6 concludes this article. 2 Incremental Learning 2.1 Statistical Assumptions. The assumed statistical model of our problems is the standard regression model, y = f (x) + ε,
(2.1)
where x ∈ Rn denotes the n-dimensional vector of input variables, y ∈ Rm the m-dimensional vector of output variables, and f (·) a deterministic vector-valued function mapping the input x to the output y. The additive random noise ε is assumed to be independently distributed, E{εi εj } = 0 for i 6= j, and mean zero, E{ε | x} = 0, but otherwise of unknown distribution (E{·} denotes the expectation operator). The input data are distributed according to the density p(x). 2.2 Localizing Interference. Interference in learning is a natural side effect of the ability to generalize, that is, to interpolate or extrapolate an
2050
Stefan Schaal and Christopher G. Atkeson
output for an unseen input from previously learned data. Generalization is accomplished by allowing changes to the parameters of the learning system to have nonlocal effects. If these effects reduce the overall correctness of predictions to a larger extent than they improve them, interference is called negative or even catastrophic. Incremental learning is particularly endangered by negative interference because there is no direct way to balance the amount of positive interference (that is, generalization) with the amount of negative interference. Any parameter update is usually greedy; its only concern is with the reduction of the error of the current piece of training data. To see the statistical causes of interference, consider using the mean squared error criterion J to select a model fˆ(·) to approximate the true function f (·): ½° °2 °2 ¾ Z +∞ ° ° ° ° ° J = E °y − fˆ(x)° = °y − fˆ(x)° p(x, y)dx dy −∞ Z +∞ ° °2 ° ° = °y − fˆ(x)° p(y | x)p(x) dx dy. −∞
(2.2)
This equation states that, in general, the approximation result for fˆ(·) depends on both the conditional distribution p(y | x) and the input distribution p(x) of the data (Fan & Gijbels, 1996). Only for an infinite amount of training data, fˆ(·) will asymptotically depend solely on p(y | x) (Papoulis, 1991): fˆ(x) = E{y | x} =
Z
+∞
−∞
y p(y | x) dy.
(2.3)
Thus, for a finite amount of data, a stable model fˆ(·) can be obtained only if neither of these distributions changes during learning. These considerations point toward the two major causes for negative interference. If p(y | x) changes, that is, the functional relationship between x and y is nonstationary, the parameters in a learning system may have to change. Analogously, if the data for learning are not sampled from a fixed input distribution p(x), the parameters of the learning system may also change. It is particularly a change of the input distribution p(x) that is likely to happen in incremental learning. Imagine a robot learning an inverse dynamics model of its arm, a model that maps joint positions, joint velocities, and joint accelerations to corresponding joint torques. Whenever the robot moves, it receives valid data about this functional relationship. However, since the robot is fulfilling different tasks at different times, the sampled data will come from quite different input distributions; for example, consider the difference between movements for cooking and movements for playing tennis. One of the interesting properties of learning with localized receptive fields lies in their potential robustness toward interference. If learning is spatially localized—training data at one location have a negligible effect
Constructive Incremental Learning
3 2 1
y
0 -1 -2
original training data new training data
true y
predicted y
a) Global Function Fitting With Sigmoidal Neural • Network •• • ••• • • •• • • •• • + + •• • • • •• + ++ + +++++ ++ + • ••• • ••••++ • + + ••••• +++ + + + + +++ + • ++ ++ •• • • + + ++ ++ ++ •• • •••• •• •• ++ ++ + ++ •• + ++ ++ • ••••••• • • •• • • • • +++ •• • •• •••• •• • + ++ +++ • • ••• • •••• • +++ + • • • • • • •• ••• •••• + • • •• • • • ••
4 3 2
y
4
• +
2051
1 0 -1 -2 -2.5
-3 -4
-6 -2.5
1 0.5 0 2.5 -2.5
b) Local Function Fitting With Receptive Fields • •• • ••• • • •• • • •• • + + •• • • • •• + ++ + +++++ ++ + • ••• • ••••++ • + + ••••• +++ + + + + +++ + • ++ ++ •• • • + + ++ ++ ++ •• • ++ ++ + ++ •••• •• •• •• + ++ ++ • ••••••• • • •• • • • • +++ •• • •• •••• •• • + ++ +++ • • ••• • •••• • +++ + • • • • • • •• ••• •••• + • • •• • • • ••
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x c) Learned Organization of Receptive Fields
2.5
-2
2.5
w
-5
predicted y after new training data
-2
-1.5
-1
-0.5
0 x
0.5
1
1.5
2
-1.5
-1
-0.5
0 x
0.5
1
1.5
2
Figure 1: (a) Results of function approximation of the function y = sin(2x) + 2 exp(−16x2 ) + N(0, 0.16) with a sigmoidal neural network. (b) Results of function approximation by a local receptive field–based algorithm, fitting locally linear models in each receptive field. Note that the data trace “true y,” “predicted y,” and “predicted y after new training data” largely coincide. (c) The organization of the (gaussian) receptive fields of (b) after training.
on the parameters of distant receptive fields—interference will be spatially localized as well. Figure 1 gives an illustration of this effect. Using a synthetic data set suggested by Fan and Gijbels (1995), we trained a three-layer sigmoidal feedforward neural network (six hidden units, using backpropagation with momentum) on 130 noisy data points uniformly distributed in x ∈ [−2.0, 0.5]. The excellent function fit obtained is shown by the “predicted y” trace in Figure 1a. Then we continued training the network on 70 new data points drawn from the same function but with a changed input distribution x ∈ [0.5, 2.0]. The network learned to accommodate these new data points, but in doing so, it also significantly changed its predictions for the previously learned data, although these data are largely separate from the new training data. This effect is due to the nonlocal nature of sigmoidal basis functions and is prone to lead to catastrophic interference, as shown in Figure 1a. We repeated the same experiment with our receptive field–based learning system, RFWR, which generates locally linear models in each receptive field and blends them for predictions (see Figures 1b and 1c). On the original training data, RFWR achieves comparable results to that of the sigmoidal neural network. After training on the new data, however, no interference
2052
Stefan Schaal and Christopher G. Atkeson
is apparent. The original fit in the left part of the graph was not visibly altered, in contrast to the neural network. Robustness toward negative interference, is accomplished by localizing interference, the best we can do since interference cannot be eliminated for finite data samples. 2.3 Avoiding the Problem of Resource Allocation. Due to the biasvariance trade-off (Geman et al., 1992), learning algorithms have to include a model selection phase in order to find an appropriate compromise between oversmoothing and overfitting. Usually this is accomplished by setting certain meta parameters, for instance, the number of hidden units in a neural network, according to some model selection criterion, such as crossvalidation (Stone, 1974). A question frequently asked in model selection (Bishop, 1996) thus becomes: “How many free parameters should be allocated in order to achieve (in expectation) a good bias-variance trade-off?” However, another approach can be pursued: “Given a fixed number of free parameters, how should a given data set be spatially limited in order to achieve (in expectation) a good bias-variance trade-off for the remaining data?” Instead of adapting the complexity of the learning system, one can also adapt the complexity of the region the data are drawn from. For general nonlinear function approximators, it is unclear how to answer this question. For spatially localized function fitting, however, this question translates into: “How should the extent of a receptive field be changed in order to make its associated parametric model fit the data appropriately?” Such an approach transforms the global bias-variance problem into a local one. The advantage of having each receptive field deal with the bias-variance trade-off individually lies in avoiding the resource allocation problem. In the spirit of a Taylor series expansion, let us assume that we know how to adjust the region of validity, that is, the size and shape of a receptive field, of each locally linear model such that its approximation error at the center—its bias θk —is determined by an optimal bias-variance trade-off (see Figure 2a). In order to approximate the entire nonlinear function, we have to cover the input space with sufficiently many locally linear models such that every data point is handled by at least one of them. Importantly, it does not matter whether we allocate too many local models. If we restrict extrapolation of the linear models to the θk bound (which actually corresponds to a minimal activation strength of a receptive field), an average of the outputs of all k linear models at a query point xq cannot have a larger error than max(θk ), as illustrated in Figure 2b. Indeed, allocating too many local models actually has a positive effect. Due to averaging, more overlapping linear models will tend to improve function estimates in the spirit of and with the same limitations as in ensemble methods (Perrone & Cooper, 1993). Although deriving an optimal local bias-variance trade-off remains hard (Friedman, 1984; Fan & Gijbels, 1996), the local nature of the problem allows new ways to find at least satisfying and computationally affordable solutions. Section 3 will demonstrate how a stochastic approximation of
Constructive Incremental Learning
2053
predictions from the different linear models
Receptive Field Activation w
Region of Validity
max(2θ ) k k
2θk Linear Model
1
xq Few Linear Models (no overlap)
0
a)
Many linear Models (strong overlap)
b)
Figure 2: (a) Region of validity of a linear model and its approximation bias θk . (b) Function approximation with piecewise linear models.
local leave-one-out cross-validation in conjunction with a regularization approach can be used to realize the local bias-variance trade-off, and even to control approximately the expected bias θk in each local model. 2.4 Summary. Given the discussion of the last two sections, a promising route to robust incremental learning seems to be a local receptive field– based system that can also adjust the extent of its receptive fields. However, care must be taken how one goes about accomplishing this goal. Learning methods based on competitive learning usually do not achieve the properties described in the previous section. In competitive learning, the size of a receptive field results from a global competition process of all local models to account for the training data. Therefore, changing the number of local models causes a change of the extent of all receptive fields such that the number of local models becomes a critical choice for the bias-variance trade-off—exactly what we would wish to avoid. The next section explains how an alternative approach based on nonparametric statistics offers a route to achieve our goals without resorting to competitive learning. 3 Receptive Field–Weighted Regression RFWR constructs a system of receptive fields for incremental function approximation. A prediction yˆ for a query point x is built from the normalized weighted sum of the individual predictions yˆ k of all receptive fields: PK wk yˆ k . yˆ = Pk=1 K k=1 wk
(3.1)
2054
Stefan Schaal and Christopher G. Atkeson
The weights wk correspond to the activation strengths of the corresponding receptive fields. They are determined from the size and shape of each receptive field, characterized by a kernel function. A variety of possible kernels have been suggested (Atkeson, Moore, and Schaal, 1997a). For analytical convenience, we use a gaussian kernel, µ ¶ 1 T wk = exp − (x − ck ) Dk (x − ck ) , 2
where Dk = MTk Mk , (3.2)
which parameterizes the receptive field by its location in input space, ck ∈ Rn , and a positive definite distance metric Dk , determining the size and shape of the receptive field. For algorithmic reasons, it is convenient to generate Dk from an upper triangular matrix Mk in order to ensure that Dk is positive definite. Within each receptive field, a simple parametric function models the relationship between input and output data. Local polynomials of low order have found widespread use in nonparametric statistics (Nadaraya, 1964; Watson, 1964; Wahba & Wold, 1975; Cleveland, 1979; Cleveland & Devlin, 1988). We will focus on locally linear models because they accomplish a favorable compromise between computational complexity and quality of result (Hastie & Loader, 1993): yˆ k = (x − ck )T bk + b0,k = x˜ T βk ,
x˜ = ((x − ck )T , 1)T ,
(3.3)
where βk denotes the parameters of the locally linear model and x˜ is a compact form of the center-subtracted, augmented input vector to simplify the notation. To clarify the elements and parameters of RFWR, Figure 3 gives a networklike illustration for a single output system. The inputs are routed to all receptive fields, each of which consists of a linear and a gaussian unit. The learning algorithm of RFWR determines the parameters ck , Mk , and βk for each receptive field independently—without any information about the other receptive fields, in contrast to competitive learning. RFWR adds and prunes receptive fields as needed, such that the number of receptive fields, K, will automatically adjust to the learning problem at hand. A one-dimensional example of function fitting with RFWR was shown in Figures 1b and 1c. It should be noted that the size of each receptive field adapted according to the local curvature of the function, that there is a certain amount of overlap between the receptive fields, and that the center locations have not been chosen with respect to any explicit optimization criterion. 3.1 Learning with RFWR. Three ingredients of the algorithm need to be discussed: the update of the linear model parameters βk , the decomposed distance metric Mk , and when and where to add and prune receptive fields. The centers ck are not changed after they are allocated. For the sake of clarity,
Constructive Incremental Learning
2055
I
AAA AAA AA AAA AA AAA AA AAA AA AAAA AAA AA AA AA
Dk
bk b0,k yv
wk
I
k
xn xI
Linear Unit Gaussian Unit centered at ck
Weighted Average
Output
x4 x3 x2 x1
yv
Inputs
Figure 3: A network illustration of receptive field–weighted regression.
we will drop the subscript k whenever we deal with one receptive field at a time from now on since each receptive field is updated in the same way. 3.1.1 Learning the Linear Model. Learning of β is straightforward since the problem is linear in β. It will be useful to leave the incremental learning framework for a moment and think in terms of a batch update. If we summarize the input part of all p training data points in the rows of the matrix X = (˜x1 , x˜ 2 , . . . , x˜ p )T , the corresponding output part in the rows of the matrix Y = (y1 , y2 , . . . , yp )T , and the corresponding weights in the diagonal matrix W = diag(w1 , w2 , . . . , wp ), the parameter vector β can be calculated from a weighted regression: β = (XT WX)−1 XT XY = PXT WY.
(3.4)
This kind of locally weighted regression has found extensive application in nonparametric statistics (Cleveland, 1979; Cleveland & Loader, 1995), in time-series prediction (Farmer & Sidorowich, 1987, 1988), and in regression
2056
Stefan Schaal and Christopher G. Atkeson
learning problems (Atkeson, 1989b; Moore, 1991; Schaal & Atkeson, 1994a; Atkeson et al., 1997a). The result for β in Equation 3.4 is exactly the same when β is calculated by recursive least squares from one sequential sweep through the training data (Ljung & Soderstr ¨ om, ¨ 1986). Given a training point (x, y), the incremental update of β yields: β n+1 = β n + wPn+1 x˜ eTcv ! Ã 1 Pn x˜ x˜ T Pn n+1 n P − λ where P = λ ˜ T Pn x˜ w +x
T
and ecv = (y − β n x˜ ). (3.5)
This update is employed by RFWR. It is useful to note that recursive least squares corresponds to a Newton training method with guaranteed convergence to the global minimum of, in our case, a weighted squared error criterion (Atkeson et al., 1997a). Furthermore, the recursive update avoids an explicit matrix inversion. Differing from the batch update in equation 3.4, equation 3.5 also includes a forgetting factor λ. Changes to the decomposed distance metric M during learning (see below) will change the weights w. For this reason, it is necessary to include λ in equation 3.5 in order to cancel the contributions gradually from previous data points where M was not yet learned properly (Ljung & Soderstr ¨ om, ¨ 1986). 3.1.2 Learning the Shape and Size of the Receptive Field. Adjusting the shape and size of the receptive field is accomplished by adjusting the decomposed distance metric M. At first glance, one might hope that this could be done by gradient descent in the weighted mean squared error criterion, J=
p °2 1 X ° wi °yi − yˆ i ° W i=1
where W =
p X
wi ,
(3.6)
i=1
which is the basis of the solution of locally weighted regression in equation 3.4 (Atkeson et al., 1997a). Unfortunately, minimizing equation 3.6 may result in a quite inappropriate solution. If for each training point one receptive field is centered right on this point, and the corresponding M is chosen such that the receptive field is so narrow that it is activated only by this data point, the corresponding linear model can fit this one data point with zero error. The function approximation result would strongly tend toward overfitting. It is this property that has made learning algorithms resort to competitive learning with a fixed number of local receptive fields. The global competitive process will prevent receptive fields from modeling just one data point (assuming there are more data points than receptive fields) (Moody & Darken, 1988; Jordan & Jacobs, 1994). But allowing for such a global competitive process takes away the property of being a local learner, even if the receptive fields are actually spatially localized.
Constructive Incremental Learning
2057
An alternative way to address this overfitting effect is to use leave-oneout cross-validation. The cost function to be minimized changes from equation 3.6 to J=
p °2 1 X ° wi °yi − yˆ i,−i ° . W i=1
(3.7)
The notation yˆ i,−i denotes that the prediction of the ith data point is calculated from training the learning system with the ith data point excluded from the training set. Thus, it becomes inappropriate for a receptive field to focus on just one training point since the error measure is calculated from data that did not exist in the training set. Leave-one-out cross-validation is usually computationally very expensive since a p-fold training of the learning system is required, for p data points in the training set. Furthermore, for example, for a sigmoidal neural network, it might be unclear how to combine the resultant p different learned parameters into a single solution. However, for linear regression problems, there is a result rendering these concerns irrelevant. Due to the Sherman-Morrison-Woodbury theorem (Belsley, Kuh, & Welsh, 1980), equation 3.7 can be rewritten as: ° °2 p p X °2 wi °yi − yˆ i ° 1 1 X ° wi °y − yˆ i,−i ° = . J= W i=1 W i=1 (1 − wi x˜ Ti P˜xi )2
(3.8)
This equation states that the leave-one-out cross-validation error can be obtained without p-fold training of the learning system, instead by an adjustment of the weighted mean squared error with the help of the inverted covariance matrix P (cf. equation 3.4). Equation 3.8 corresponds to a weighted version of the PRESS residual error in standard linear regression techniques (Myers, 1990). Neglecting for a moment how this cost function can be minimized incrementally, we have obtained a criterion that can be used to adjust M (Schaal & Atkeson, 1994b). Unfortunately, there is still a point of concern with equation 3.8. Minimizing the locally weighted leave-one-out cross-validation error results in a consistent learning system; with an increasing number of training data, the receptive fields will shrink to a very small size. The advantage of this behavior is that function approximation becomes asymptotically unbiased (i.e., consistent), but as a disadvantage, an ever-increasing number of receptive fields will be required to represent the approximated function. This property can be avoided by introducing a penalty term in equation 3.8: ° °2 p n X 1 X wi °yi − yˆ i ° + γ D2ij , J= W i=1 (1 − wi x˜ Ti P˜xi )2 i,j=1
(3.9)
where the scalar γ determines the strength of the penalty. By penalizing the sum of squared coefficients of the distance metric D, we are essentially pe-
2058
Stefan Schaal and Christopher G. Atkeson
nalizing the second derivatives of the function at the site of a receptive field. This is similar to approaches taken in spline fitting (deBoor, 1978; Wahba, 1990) and acts like a low-pass filter: the higher the second derivatives, the more smoothing (and thus bias) will be introduced locally. Another positive effect of the penalty term is that the introduction of bias reduces the variance of the function estimate, a problem usually associated with local function fitting methods (Friedman, 1984). Section 3.3 will outline the properties of equation 3.9 in more detail. What remains is how to minimize equation 3.9 incrementally by adjusting M by gradient descent with learning rate α: Mn+1 = Mn − α
∂J . ∂M
(3.10)
Applying the chain rule, the derivative of equation 3.10 can be written as: ° °2 p n X ∂ X wi °yi − yˆ i ° ∂J 2 = +γ Dij ∂M ∂M i=1 W(1 − wi x˜ Ti P˜xi )2 i,j=1 Ã p ! p p X X X ∂ J2 ∂ J1,i ∂wj ∂ + . J1,i + J2 = = ∂M i=1 ∂w ∂M ∂M j i=1 j=1
(3.11)
Without storing data in incremental learning, we cannot use cross-validation and thus cannot obtain the true gradient in equation 3.11. The usual approach to deriving a stochastic gradient would be to drop the two sums in equation 3.11. However, this approximate gradient would be quite inaccurate since the first term of that equation would always be positive: shrinking the receptive field reduces the weight of a data point and thus its contribution to the weighted error. It turns out that we are able to derive a much better stochastic approximation. Given one training point (x, y) and its associated weight w from equation 3.2, the derivative for this point can be approximated as: X ∂ J1,i ∂w w ∂ J2 ∂w X ∂ J1,i w ∂ J2 ∂J ≈ + = + . ∂M ∂w ∂M W ∂M ∂M ∂w W ∂M i=1 i=1 p
p
(3.12)
Summing equation 3.12 over all data points and recalling that W stands for the sum of weights (cf. equation 3.6), equation 3.12 can be verified to result in equation 3.11. Despite the term J1,i , it is now possible to obtain an incremental version of the stochastic derivative in equation 3.12 by introducing the “memory traces” W, E, H, and R (see the notation in equation 3.15): W n+1 = λW n + w En+1 = λEn + weTcv ecv
Constructive Incremental Learning
2059
w˜xeTcv , where h = w˜xT Pn+1 x˜ 1−h w2 eTcv ecv x˜ x˜ T . = λRn + 1−h
Hn+1 = λHn + Rn+1
(3.13)
The resulting incremental version of the derivative (see equation 3.12) becomes: ∂w X ∂ J1,i w ∂ J2 ∂J ≈ + n+1 ∂M ∂M i=1 ∂w W ∂M p
where: ∂D 1 ∂w = − w(x − c)T (x − c), ∂Mrl 2 ∂Mrl ∂Dij = δrj Mil + δir Mjl ∂Mrl
(3.14) n X ∂Dij ∂ J2 = 2γ Dij ∂Mrl ∂M rl i,j=1
(δ is the Kronecker operator)
p X ∂ J1,i i=1
En+1 ≈ −¡ ¢2 ∂w W n+1 µ µ ´T ¶ ³ 1 + n+1 eTcv ecv − 2Pn+1 x˜ y − x˜ T β n+1 ⊗ Hn W ¶ ³ ´ − 2Pn+1 x˜ x˜ T Pn+1 ⊗ Rn .
Deriving this derivative is possible due to the fact that an application of the Sherman-Morrison-Woodbury theorem allows us to take derivatives through the inverted covariance matrix P (Belsley et al., 1980; Atkeson & Schaal, 1995), and that a sum of the form 6vTi Qvi can be written as 6vTi Qvi = Q ⊗ 6vi vTi , where the operator ⊗ denotes an element-wise multiplication of two homomorphic matrices or vectors with a subsequent summation of all coefficients, Q ⊗ V = 6Qij Vij . It is interesting to note that the stochastic derivative (see equation 3.14) is not just concerned with reducing the error of the current training point, as in many other learning algorithms, but rather that it takes into account the previously encountered training data, too, through the memory traces (see equation 3.13). Thus, both the β and M update in RFWR are not greedy with respect to the current training sample, a characteristic that will contribute favorably to speed and robustness of incremental learning. 3.1.3 Adding Receptive Fields and Automatic Bias Adjustment. A new receptive field is created if a training sample (x, y) does not activate any existing receptive field by more than a threshold wgen . The center of the new
2060
Stefan Schaal and Christopher G. Atkeson
receptive field becomes c = x, M is set to a manually chosen default value, Mdef , and all other parameters are initialized to zero, except the matrix P. P corresponds to an inverted covariance matrix of the weighted inputs (treating the constant input 1 as the (n + 1)th input). A suitable initialization of P is as a diagonal matrix, the diagonal elements set to Pii = 1/r2i , where the coefficients ri are usually small quantities, such as, 0.001 (Ljung & Soderstr ¨ om, ¨ 1986). We summarize all ri in the (n + 1)-dimensional vector r = (r1 , r2 , . . . , rn+1 )T . The parameters r have an interesting statistical interpretation: they introduce bias in the regression coefficients β and correspond to one of the common forms of biased regression, ridge regression (Belsley et al., 1980). From a probabilistic point of view, they are Bayesian priors that the coefficients of β are zero. From an algorithmic perspective, they are fake data points of the form [xr = (0 . . . , r2i , 0, . . .)T , yr = 0] (Atkeson et al., 1997a). Under normal circumstances, the sizes of the coefficients of r are too small to introduce noticeable bias. However, ridge regression parameters are important if the input data are locally rank deficient, that is, the matrix inversion in equation 3.4 is close to singular. For high-dimensional input spaces, it is quite common to have locally rank deficient input data. Although RFWR does not explicitly require matrix inversions, the rank deficiency affects the incremental update in equation 3.5 by generating estimates of β with very large variances, causing unreliable predictions. Nonzero ridge regression parameters reduce this variance, though at the cost of introducing bias. An appropriate compromise can be found by including the ridge parameters as adjustable terms in RFWR using gradient descent in the cost (see equation 3.9): rn+1 = rn − αr
∂J . ∂r
(3.15)
After each update of P, the change in r is added to P. Additionally, it is necessary to add back the fraction of r that was lost due to the forgetting factor λ; bias should not be forgotten over time. These two computations can be performed together and are surprisingly simple. Section A.1 details this update and the stochastic approximation of ∂ J/∂r, which is analogous to the derivation of equation 3.14. 3.1.4 Pruning Receptive Fields. The last element in RFWR is a pruning facility. A receptive field is pruned if it overlaps too much with another receptive field. This effect is detected by a training sample activating two receptive fields simultaneously more than wprune . The receptive field with the larger determinant of the distance metric D is pruned. For computational convenience, det(D) can be approximated by 6D2ii (Deco & Obradovic, 1996). It should be noted that pruning due to overlap aims primarily at computational efficiency, since, as discussed in section 2.3, overlap does not
Constructive Incremental Learning
2061
degrade the approximation quality. The second cause for pruning is if the bias-adjusted weighted mean squared error, wMSE =
n X En − γ D2ij , Wn i,j=1
(3.16)
of the linear model of a unit is excessively large in comparison to other units. The bias adjustment term can be derived from the asymptotic behavior of RFWR, outlined in section 3.3 and detailed in Schaal and Atkeson (1997). Empirically, there are usually two ways to adjust M in order to minimize equation 3.9. The one we normally want to avoid is M = 0, the zero matrix. It indicates that the receptive field performs global regression instead of locally weighted regression. Global linear regression for a nonlinear function has a large wMSE. A simple outlier detection test among the wMSE of all receptive fields suffices to deal with such behavior. The receptive field is then reinitialized with randomized values. Normally pruning takes place rarely, and if it happens, it is mostly due to an inappropriate initialization of RFWR. 3.1.5 Summary of RFWR. In summary, each receptive field in RFWR has three sets of adjustable parameters: β for the locally linear model, M for the size and shape of the receptive field, and r for the bias. The linear model parameters are updated by a Newton method, and the other parameters are updated by gradient descent. A compact pseudo-code overview of RFWR is shown below. Initialize the RFWR with no receptive field (RF); For every new training sample (x, y): a) For k = 1 to #RF: Calculate the activation from equation 3.2 Update the receptive field parameters according to equations 3.10 and 3.15 end; b) If no subnet was activated by more than wgen ; Create a new RF with c = x, M = Mdef end; c) If two RFs are activated more than wprune : Erase the RF with the larger det(D) end; d) Calculate the m = E{wMSE} and std = E{(wMSE − m)2 }0.5 of all RFs; e) For k = 1 to #RF: If |wMSE − m| > ϕstd, Reinitialize receptive field with M = ε Mdef end; end;
2062
Stefan Schaal and Christopher G. Atkeson
The scalar ϕ is a (positive) outlier removal threshold, such as ϕ = 2.576 or ϕ = 3.291 (corresponding to a 99.0% or 99.9% confidence value with respect to a normal distribution), and the scalar ε is a random value ε = 1+|N(0, 1)|. This choice of ensures that the new distance metric will result in a smaller receptive field, which is less likely to converge to an M = 0 solution. It useful to note that the parameters wprune , wgen , and ϕ can be chosen independently of a particular learning problem and should thus be considered more like constants of the algorithm rather than open parameters. 3.2 Second-Order Gradient Descent. With a little extra computation, it is possible to replace the gradient descent update of M in equation 3.10 by second-order gradient descent to gain learning speed. For this purpose, we adopted Sutton’s (1992a, 1992b) Incremental delta-bar-delta (IDBD) algorithm. The derivation of the algorithm remains as demonstrated in Sutton (1992a, 1992b), only his standard least squares criterion is replaced by our cost function (see equation 3.9), and we apply IDBD to updating a distance metric. Section A.2 provides the details of the algorithm. It is also possible to apply second-order learning to the ridge regression update (see equation 3.15). Empirically, however, we did not find any significant improvements of doing so and, hence, incorporated second-order updates only for the distance metric in RFWR. 3.3 Asymptotic Properties of RFWR. In Schaal and Atkeson (1996, 1997) we derived the asymptotic properties of RFWR’s cost function (see equation 3.9). Here we will just mention some of these results that are directly relevant to this article. Assuming that (1) the number of training data points p goes to infinity, (2) within the range of a receptive field a second-order Taylor series expansion fits the training function sufficiently accurately (3) the variance of the noise σ 2 is locally constant, and (4) the input distribution is locally uniform, the following statements can be made: • The penalty term in the cost function (see equation 3.9) introduces nonvanishing bias like a low-pass filter: the higher the second derivatives (Hessian) of the function, the more bias is incurred. • The estimated locally linear model b is asymptotically unbiased. • The distance metric D will be a scaled approximation of the Hessian. • An appropriate penalty term γ for a learning problem can be computed from an estimate of the maximal eigenvalues of the Hessian. This corresponds to a smoothness bias. • A bias-adjusted weighted mean squared error, wMSE, can be formulated in order to compare the approximation quality of receptive fields. This measure was employed in equation 3.16.
Constructive Incremental Learning
2063
These asymptotic results confirm that the penalty term in the cost function (see equation 3.9) has the desired characteristics as mentioned in sections 2.3 and 3.1.2: receptive fields cannot shrink to zero size, and a controlled amount of bias was introduced. It is interesting that the estimated locally linear model b tends to become unbiased (under the assumption that O(2) errors of the Taylor series are negligible). This implies that applications requiring a gradient estimate from the function approximator can expect reliable results. The calculation of the gradient estimate is a natural by-product of every lookup in RFWR. 4 Simulation Results 4.1 Basic Function Approximation with RFWR. First, we will establish that RFWR is capable of competing with state-of-the-art supervised learning techniques on a fixed training set. A sufficiently complex learning task that can still be illustrated nicely is to approximate the function o n 2 2 2 2 z = max e−10x , e−50y , 1.25e−5(x +y ) + N(0, 0.01),
(4.1)
from a sample of 500 points, drawn uniformly from the unit square. This function consists of a narrow and a wide ridge that are perpendicular to each other and a gaussian bump at the origin (see Figure 4a). Training data are drawn uniformly from the training set without replacement; training time is measured in epochs—multiples of 500 training samples. The test set consists of 1681 data points corresponding to the vertices of a 41 × 41 grid over the unit square; the corresponding output values are the exact function values. The approximation error is measured as a normalized mean squared error, nMSE—the MSE on the test set normalized by the variance of the outputs of the test set. RFWR’s initial parameters are set to Mdef = 5 I (I is the identity matrix), γ = 10−7 , wgen = 0.1, and wprime = 0.9. The pruning and generation thresholds are of minor importance, determining the overlap of the receptive fields. The choice for the penalty term was calculated to tolerate a maximal bias of 0.1 (Schaal & Atkeson, 1997). The default value for the decomposed distance metric was determined manually such that an initial receptive field covered a significant portion of the input space. Ridge regression parameters did not play any role in this example and were omitted. A first qualitative evaluation of Figure 4 confirms that RFWR fulfills our expectations. The initially large receptive fields (see Figure 4c) adjust during learning according to the local curvature of the function. They become narrow and elongated in the region of the ridges, and they remain large in the flat parts of the function (See Figure 4d). The number of the receptive fields increased from 16 after one training epoch to 48, and the final approximation result was nMSE = 0.02.
2064
Stefan Schaal and Christopher G. Atkeson
1
y
1
y
0.5
0.5
0
0 -0.5
-0.5
-1 1.5 1
-11.5
1
1
z 0.5
z 0.5 0 0
0
-0.5 -
-0.5 1
1 0.5
0.5 -0.5
a)
0
0 x
-0.5
1.5
1.5
1
1
• • •• • • •⊕ • • • ⊕ • • •• •• • • • • •• • • ••• •• • • • • • •• •• • • • • • • •• ⊕• • • • •• • •• • • • • • • ••• • •• • • • • • • • •• • • •• • • • • • • • • • • • • •• • • • • • ••• • • •• ⊕ • • • • •• • • • • •• • •• • • •• •• •••• • • • • • •• • ••• ⊕ • • • • • •• • • • • • • • • • • •• • •••• • • • ⊕• • •• • • •• •• • • • • •• • ⊕ • • •• ••• • • • • • • • • • • •• • • • • ••• ••• • • •• • • • • • • • • • • • •• • •• • • • • • • • ••• • • • •• • •• • • •• • • •• • •• • • •• • • • •• ⊕ • •• • •• •• •• • • ⊕ • • • •• • • ••• • • • ⊕ • • • • • • • •• • • • • •••• • • • ⊕ •• • •• • • • • • • • • • • • ••• • •• •• • • • • • • • • ••• • •• • • • • • • • • • • • • • • • •• •• •• •⊕ • • • • • •• • • • • • • • •• •• •• • • • •• • •• •• •• •• • • • • •• •• •••• • • • • • • • •• • ••• • • • •• • • • • •• • • • • ⊕ • • • • • • • ⊕ • •• • • • • •• • •• • • • ⊕ • • • •• • • • • •• •• •• ⊕• • • •• •
-0.5
-1
c)
-1.5 -1.5
-1
-0.5
0 x
0.5
1
0.5
y
y
0
• • •• • • •⊕ • • • ⊕ • • •• •• • • • •⊕ • • • • ••• •⊕ • • • • ⊕ • • •• •• • • • • • • •• ⊕• • • • •• • •• • • • • • • ••• • •• • ⊕• • • • • • •• • • •• • • • • • • • • • • • • ⊕ • •• • • • • •• • • • • • • •• • • •• ⊕ • • • • • •• • • •• •• •••• • • • • ⊕•• • •••• •⊕ •• • • • • • • •• • • ⊕ • • • • • •• • •••• • • • ⊕• • •• • • •• •• • • • •• • ⊕ ••• • ⊕• • • • • • • • • • • • • • •• • ⊕ ••• • • • ••• • • • • • ••• • • ••⊕ • • ⊕ • •⊕ • • • ⊕•• • ⊕• • • •• • • • ••• ⊕ • •• • ⊕ • •• •• • •• ⊕• • • •• ⊕ •• • • •• • • ⊕ • • • • • •• ⊕•• ⊕• • ⊕• • •••• •⊕ • • ⊕ • • • • • • •• • • •• • •• •• • • ⊕ • • •⊕ •• • •• • • • • •• • • • •• • • • ⊕ •• •• • • • • •• • • • • • ⊕ • ••• • •• ⊕ • • • • • •⊕ • ⊕ • • • • • ⊕• • • •• •• •• •⊕ • •⊕ • • • •• • • • • • • • •• •• •• • • • •• • •• •• • • • • • •• • ⊕ • •• • •••• • • • • • • • •• • ••• • • • • •• • • •• ⊕•• ••• • • •• • • • ⊕ • • ⊕ • • ⊕ •• • • • •• • •• • • • •• • ⊕ • • •• • • •• •• ⊕• • • •• • •
•
0.5
x
-1
b)
-1 1
0
-0.5
-1
1.5
d)
-1.5 -1.5
-1
-0.5
0 x
0.5
1
1.5
Figure 4: (a) Target function to be approximated. (b) Approximated function after 50 epochs of training. (c) Receptive fields in input space after one epoch, given by contour lines of 0.1 isoactivation and a ⊕ mark for the centers (the training data are displayed by small dots). (d) Receptive fields after 50 epochs of training.
We compared the learning results of RFWR with three other algorithms: standard global linear regression and a sigmoidal three-layer backpropagation neural network as baseline comparisons, and the mixture of experts algorithm as a state-of-the-art comparison (Jacobs et al, 1991; Jordan & Jacobs, 1994; Xu, Jordan, & Hinton, 1995). Standard linear regression cannot accomplish a better result than nMSE=1.0 on this example; the function has no linear trend in the chosen region of input space. The sigmoidal network was trained by backpropagation with momentum in a variety of configurations using 20 to 100 units in the hidden layer (the output layer had one linear unit). These networks did not accomplish results better than nMSE = 0.1
Constructive Incremental Learning
2065
within 20,000 training epochs. Doubling the number of training samples and reducing the noise level to N(0, 0.0001) finally resulted in nMSE = 0.02 for a 100-hidden-unit net after about 15,000 epochs. By using the cascade correlation algorithm (Fahlman & Lebiere, 1990) to fit our original 500-datapoint training set we confirmed that the function in equation 4.1 seems to be a difficult learning task for sigmoidal networks: cascade correlation did not converge when confined to using only sigmoidal hidden units, while it achieved good function fitting (nMSE = 0.02) when it was allowed to use gaussian hidden units. A more natural and interesting comparison is with the mixture of experts (ME) system, particularly as suggested in Xu et al. (1995). In Xu et al. (1995), in contrast to the softmax gating network of Jordan and Jacobs (1994), the experts use a mixture of gaussians as the gating network, and both the gating net and the locally linear models in each leaf of the gating net can be updated by an analytical version of the expectation-maximation (EM) algorithm (Dempster, Laird, & Rubin, 1977). Thus, the basic elements of this form of ME are the same as in RFWR—locally linear models and gaussian receptive fields—while the training methods of the two systems differ significantly—competitive parametric likelihood maximization versus local nonparametric learning. Because ME does not add resources, the performance-determining parameters are how many experts are allocated and how the system is initialized. The algorithm was tested with 25, 50, 75, and 100 experts. Initially, the experts were uniformly distributed in the input space with an initial covariance matrix of the gaussians comparable to the initialization of RFWR’s distance metric. We conducted a similar test with RFWR, setting its determining parameter, the penalty γ , to 10−6 , 10−7 , 10−8 , and 10−10 . Figure 5 summarizes the results. Each learning curve is the average of 10 learning trials for each condition of the corresponding algorithm; the training data were randomly generated for each trial. Both algorithms achieve a nMSE=0.12 after only one training epoch—a typical signature of the fast recursive least squares updating of the linear models employed by both algorithms—which is about what the sigmoidal neural network had achieved after 10,000 to 20,000 epochs. Both algorithms converge after about 100 epochs. By adding more experts, the mixture of experts improves its performance to a best average value of nMSE = 0.04 with a slight trend to overfitting for 75 experts. RFWR consistently accomplishes a result of nMSE = 0.02 for all but the γ = 10−6 runs, with a slight tendency to overfitting for γ = 10−10 . One standard deviation error bars are indicated by the black bars at the beginning and end of each learning curve. It was surprising that ME did not achieve the same ultimate fit accuracy as RFWR. This behavior was due to the relatively small training set, the relatively low signal-to-noise ratio of the training data, and the way the gating network assigns training samples to each expert. By significantly increasing the amount of training data and/or lowering the noise, the results
Stefan Schaal and Christopher G. Atkeson
0.15
0.15
#ex=25
#ex=75
#ex=50
#ex=100 0.1
0.05
0.05
γ=10−8
γ=10−7
γ=10−10
75
50
nMSE
0.1
γ=10−6
25
a) 0
b)
0 1
10
100
1000
1
10
100
#Receptive Fields
2066
0 1000
#Training Epochs
Figure 5: Average learning curves (solid lines) for (a) ME and (b) RFWR. The black bars indicate one standard deviation error bars at the beginning and end of learning; for overlapping traces having approximately the same standard deviation, only one bar is shown. For RFWR (b), the increase of the number of receptive fields over time (dashed lines) is indicated as well.
of both algorithms become indistinguishable. It seems to be the method of credit assignment that makes a significant difference. The expectation step in ME uses normalized weights (i.e., posterior probabilities) to assign training data to the experts. Normalized weights create much sharper decision boundaries between the experts than unnormalized weights as in RFWR. Thus, in the case of noise and not too many training data, the ME algorithm tends to establish decision boundaries between the experts that are too sharp and starts fitting noise. Given the underlying assumption of ME that the world was generated by a mixture of linear models, this behavior may be expected. Since in our test cases the world is actually a continuous function and not a mixture of linear models, the assumptions of ME are only an approximation, which explains why the algorithm does not perform entirely appropriately. The assumptions of RFWR are quite different: each receptive field tries to find a region of validity that allows it to approximate the tangent plane in this region with some remaining bias. In the spirit of a low-order Taylor series expansion, this is a reasonable way to proceed. Thus, RFWR achieves consistent results with low variance (see Figure 5b). It is also interesting to see how the number of receptive fields of RFWR grows as a function of the penalty factor (see Figure 5b). As expected from the derivation of the cost function (see equation 3.9), a very small penalty parameter causes the receptive fields to keep on shrinking and entails a continuous growth of the number of receptive fields. Nevertheless, the tendency toward overfitting remained low, as can be seen in the γ = 10−10 traces in Figure 5b. When continuing learning until 10,000 epochs, the nMSE saturated close to the
Constructive Incremental Learning
2067
current values for all penalty factors. The local cross-validation term (equation 3.9) is responsible for this desirable behavior. When cross-validation was not used, overfitting was significantly more pronounced and the nMSE continued increasing for very small penalty factors. 4.2 Dealing with Irrelevant Inputs. In order to establish the usefulness of the ridge regression parameters, we conducted a further comparison with the mixture of experts. In sensorimotor control, it is unlikely that all variables given to the learning system are equally relevant to the task. Possible kinds of extraneous inputs include constant inputs, changing inputs that are meaningless, and copies and linear combinations of other inputs. Ideally, one would like an autonomous learning system to be robust toward such signals. To explore the behavior of ME and RFWR in such cases, three additional inputs were added to the function (see equation 4.1): a) one almost constant input of N(0.1, 0.001), one input with a Brownian walk in the interval [−0.1, 0.1], and one input that was a copy of x with added gaussian noise N(0, 0.0025). Otherwise training data were generated uniformly by the function (in equation 4.1), but with reduced additive noise of N(0, 0.0025) to improve the signal-to-noise ratio. For these tests, the ridge regression coefficients were initialized to 0.25 for each input. Figure 6 summarizes the average results of 10 trials for each algorithm. In Figure 6a, we show the mean nMSE and its standard deviation on two test sets. In Test1, the predictions were generated by using only the regression coefficients of the relevant inputs, β0 , β1 , β2 , on the same 1681-point test set as in the experiment of section 4.1. This was to establish whether these coefficients adjusted correctly to model the target function. Both algorithms achieved good learning results on this test (see Figure 6a). In Test2, we probed the robustness of the learned model toward the irrelevant inputs. We added the noisy constant, the Brownian, and the noisy x-copy input to the test set, but we also added an offset of 0.1 to each of these signals. If the algorithm learned that these inputs were irrelevant, this change should not matter. If the irrelevant inputs were mistakenly employed as signals to improve the nMSE on the training data, the predictions should deteriorate. Figure 6a demonstrates that the results of RFWR remained virtually unaltered by this test, while those of ME became significantly worse. This outcome can be explained by looking at the standard deviations of the regression coefficients of all the locally linear models (see Figure 6b). In contrast to ME, RFWR set the regression coefficients of the irrelevant inputs (β3 , β4 , β5 ) very close to zero, thus achieving the desired robustness. Such behavior was due to an adjustment of the corresponding ridge regression parameters: they increased for the irrelevant inputs and decreased to zero for the relevant inputs. We should point out that ME was not designed to deal with learning problems with irrelevant inputs and that there are ways to improve its performance in such cases. Nevertheless, this experiment clearly illustrates that it is necessary to deal with the problem of irrelevant
2068
0.1
ME
Mean and Standard Deviation
0.15
Stefan Schaal and Christopher G. Atkeson
RFWR
nMSE
0.1
0.05
0
RFWR
0.05
0 Test1
a)
ME
Test2
b)
β3
β4
β5
Regression Coefficients
Figure 6: (a) Average nMSE of ME and RFWR after 1000 training epochs (see the text for explanation). (b) Mean and standard deviation of the regression coefficients of the irrelevant inputs.
inputs and that local bias adjustment by means of ridge regression is one possible way to do so. 4.3 Shifting Input Distributions. It is easy to conceive of learning tasks where the input distribution of the training data changes over time. To test RFWR’s performance on such problems, we designed the following experiment. In three sequential episodes, training data for learning (see equation 4.1) were uniformly drawn from three slightly overlapping input regions in the unit cube: T1 = {(x, y, z) | −1.0 < x < −0.2}, T2 = {(x, y, z) | −0.4 < x < 0.4} , and T3 = {(x, y, z) | 0.2 < x < 1.0}. First, the algorithm was trained on T1 for 50,000 points and tested on T1 , then trained on T2 for 50,000 points and tested on T1 and T2 , and finally trained on T3 for 50,000 points and tested on test data from all regions. Figure 7 gives an example of how learning proceeded. This test probes how much of the previously learned competence is forgotten when the input distribution shifts. All parameters of RFWR were chosen as in section 4.1. Because the ME algorithm is not constructive and thus not well suited for learning with strongly shifting input distributions, we chose the resource allocating network (RAN) of Platt (1991) for a comparison, a learning algorithm that is constructive, has no competitive learning component, and has inspired a variety of other algorithms. RAN is a radial basis function (RBF) network that adds RBFs at the site of a training sample according to two criteria: when the approximation of the training sample error is too large, and when no RBF is activated by the training sample more than a ξ value. Both criteria have to be fulfilled simultaneously to create a new RBF. The spherical width of the RBF is chosen according to its distance to the nearest
Constructive Incremental Learning
2069
Figure 7: RFWR reconstructed function after training on (a) T1 , (b) then T2 , (c) and finally T3 .
neighboring RBF. By using gradient descent with momentum, the RBF centers are adjusted to reduce the mean squared approximation error, as are the weights of the linear regression network in the second layer of the RBF net. The strategy of RAN is to start initially with very wide RBFs and increase the threshold ξ over time until a prechosen upper limit is reached, causing the creation of ever-smaller RBFs at sites with large error. As in RFWR, we used gaussians (see equation 3.2) as the parametric structure for the RBF. Figure 8 summarizes the average of 10 learning trials for each algorithm. RFWR shows large robustness toward the shift of input distribution: there is only a minor increase of nMSE due to interference in the overlapping parts of the training data (see Figure 7). In contrast, as can be seen in the “original RAN” trace of Figure 8a, RAN significantly increases the nMSE during the second and third training episodes. Since RAN starts out with initial RBFs that cover the entire input space, interference is not properly localized, which explains the observed behavior. Note that we already have excluded the constant term in the linear regression layer of RAN (Platt, 1991), a term that is globally active and would decrease the performance in Figure 8 significantly. From the experience with RFWR, three possible improvements of RAN come to mind. First, instead of starting with very large RBFs initially, we can limit the maximal initial size as in RFWR to Mdef . Second, we can employ the hyper RBF technique of Poggio and Girosi (1990) to adjust the shape M of the RBFs by gradient descent as in RFWR (Furlanello, Giuliani, & Trentin, 1995). And third, instead of having the time-varying threshold ξ a global variable, we can define it as an individual variable for each RBF, thus removing the explicit dependency on global training time. By initial-
Stefan Schaal and Christopher G. Atkeson
150 0.5
0.5
150
Original RAN
nMSE
0.4
0.4
Modified RAN 100
0.3 0.2
50
0.1
100
0.3 0.2
50
0.1
0 0
a)
RFWR
50 100 #Training Iterations (*1000)
0 150
0 0
50
100
#Receptive Fields
2070
0 150
b)
Figure 8: Average learning curves (solid lines) and average number of receptive field/radial basis functions (dashed lines) for (a) RAN and (b) RFWR. The black bars give the one standard deviation at the beginning and the end of learning.
izing RAN with Mdef = 5 I as in RFWR, these modifications resulted in a significant improvement of robustness of RAN, as shown in Figure 8a. This version of RAN requires only half as many RBFs, converges more quickly, and achieves very low final approximation errors. As in RFWR, localizing the learning parameters leads to a clear improvement of robustness of incremental learning. 4.4 Sensorimotor Learning. As a last evaluation, we use a traditional example of sensorimotor learning, the approximation of the inverse dynamics of a two-joint arm (Atkeson, 1989a). The configuration of the arm is given by two joint angles, θ1 and θ2 (see Figure 9a). The inverse dynamics model is the map from the two joint angles, two joint velocities, and two joint accelerations to the corresponding torques (we assume that the arm controller makes use of a low-gain-feedback proportional integral derivative (PID) controller whose performance is enhanced by feedforward commands from the learned inverse dynamics; An, Atkeson, & Hollerbach, 1988). The torques for the shoulder and elbow joints are learned by separate networks because there is no reason to believe that a receptive field for the elbow torque should have the same shape as for the shoulder torque. For RFWR this would mean that both outputs have the same Hessian, which is definitely not the case. The task goal is to draw a figure 8 in two parts of the work space. Figure 9a shows the desired and the initial performance without the learned commands. Training proceeded in two steps. First, the arm performed sinusoidal movements with varying frequency content in the area of the upper 8. A total of 45,000 training points, sampled at 100 Hz, was used for training. Each training sample was used only once in the sequential order in which it was generated. The learning results are shown in
Constructive Incremental Learning
2071
0.5 Before Learning
RAN
RFWR
0.4 Gravity
0.3 0.2
Desired
0.1
After Learning 0
Desired θ1
-0.1 -0.2
After Learning Interference Test
θ2
-0.3 -0.4
(a)
-0.5 0
(b)
(c)
0.1 0.2 0.3 0.4 0.5
Figure 9: (a) Initial performance of the two-joint arm when drawing the figure 8 without feedforward control signals. (b) Performance of RFWR after learning. (c) Performance of RAN after learning.
the top part of Figure 9b for RFWR and Figure 9c for the modified RAN. Both algorithms were able to track the figure 8 properly. Next, the algorithms were trained in an analogous fashion on 45,000 samples around the lower figure 8. The bottom parts of Figures 9b and 9c show the corresponding good learning results. However, when returning to performing the upper figure 8, RAN showed significant interference (the dashed line in Figure 9c), although both algorithms were initiated with the same Mdef = 6.0 I. (Note that position, velocity, and acceleration inputs were normalized prior to learning to compensate for the differences in units.) This interference effect highlights the difference between the learning strategy of RBF networks in comparison to the nonparametric statistics approach to modeling with locally linear models. RBF networks need a sufficient overlap of the RBFs to achieve good learning results; one RBF by itself has only limited function approximation capabilities, an effect discussed in the context of hyperacuity (Churchland & Sejnowski, 1992). Gradient descent on the shape parameter M of the gaussian RBFs quickly decreased M in our example to achieve an appropriately large overlap. This overlap, however, encourages negative interference, as is evident in Figure 9c. The six-dimensional input space of this example emphasized the need for large overlap, while the two-dimensional example of the previous section did not. Experiments that used a fixed M as in the original RAN algorithm did not achieve better learning results within a reasonable training time. To avoid interference, there is always the unattractive solution of adding thousands
2072
Stefan Schaal and Christopher G. Atkeson
of quite narrow overlapping RBFs. In the results of Figure 9, both algorithms allocated fewer than 100 receptive fields. 5 Related Work The field that contributes the most to the development of RFWR is nonparametric statistics. Cleveland (1979) introduced the idea of employing locally linear models for memory-based function approximation, called locally weighted regression (LWR). In a series of articles, he and his colleagues extended the statistical framework of LWR to include multidimensional function approximation and local approximation techniques with higherorder polynomials (Cleveland, Devlin, & Grosse, 1988; Cleveland & Devlin, 1988). Cleveland and Loader (1995) suggested local Cp -tests and local PRESS for choosing the degree of local mixing of different order polynomials as well as local bandwidth adjustment and reviewed a large body of literature on the history of LWR. Hastie and Tibshirani (1990, 1994) give related overviews of nonparametric regression methods. Hastie and Loader (1993) discuss the usefulness of local polynomial regression and show that locally linear and locally quadratic function fitting have appealing properties in terms of the bias-variance trade-off. Friedman (1984) proposed a variable bandwidth smoother for one-dimensional regression problems. Using different statistical techniques, Fan and Gijbels (1992, 1995) suggested several adaptive bandwidth smoothers for LWR and provided detailed analyses of the asymptotic properties of their algorithms. For the purpose of time-series prediction, LWR was first used by Farmer and Sidorowich (1987, 1988). Atkeson (1989b) introduced the LWR framework for supervised learning in robot control. Moore (1991) employed LWR for learning control based on learning forward models. In the context of learning complex manipulation tasks with a robot, Schaal and Atkeson (1994a, 1994b) demonstrated how LWR can be extended to allow for local bandwidth adaptation by employing local cross-validation and local confidence criteria. Schaal and Atkeson (1996) introduced the first non-memorybased version of LWR. Schaal (1997) applied RFWR for value function approximation in reinforcement learning. Locally weighted learning for classification problems can be found, for example, in Lowe (1995). Aha (1997) compiled a series of articles on nonparametric local classification and regression learning, among which Atkeson, Moore, and Schaal (1997a, 1997b) give an extended survey on locally weighted learning and locally weighted learning applied to control. Besides nonparametric statistics, RFWR is related to work on constructive learning algorithms, local function approximation based on RBFs, and Kohonen-like self-organizing maps (SOM). An RBF function approximator with a locally linear model in each RBF was suggested by Millington (1991) for reinforcement learning. Platt (1991) suggested a constructive RBF-based learning system. Furlanello et al. (1995) and Furlanello and Giuliani (1995)
Constructive Incremental Learning
2073
extended Platt’s method by using Poggio and Girosi’s (1990) hyper-RBFs and local principal component analysis. For learning control, Cannon and Slotine (1995) derived a constructive RBF network that used wavelet-like RBFs to adapt to spatial frequency; this is similar to local bandwidth adaptation in nonparametric statistics and the adjustable receptive fields in RFWR. Orr (1995) discussed recursive least squares methods and ridge regression for learning with RBF networks. He also suggests several other methods, including generalized cross-validation, for regularizing ill-conditioned regression. One of the most established constructive learning systems is cascade correlation (Fahlman & Lebiere, 1990), a system sharing ideas with projection pursuit regression (Friedman & Stutzle, ¨ 1981). Related to this line of research is the upstart algorithm of Frean (1990), the SOM-based cascading system of Littmann and Ritter (1993), and the work of Jutten and Chentouf (1995). The first usage of locally linear models for regression problems in the context of SOMs was by Ritter and Schulten (1986), who extended Kohonen maps to fit locally linear models (LLM) within each of the units of the SOM. Related to this work is van der Smagt and Groen’s (1995) algorithm, which extended LLM to a hierarchical approximation in which each Kohonen unit itself can contain another LLM network. Fritzke (1994, 1995) demonstrated how SOMs can constructively add units, in the context of both RBF and LLM regression problems. Bruske and Sommer (1995) combined Fritzke’s ideas with Martinetz and Schulten’s (1994) neural gas algorithm to accomplish a more flexible topographic representation as in the original SOM work. A large body of literature on constructive learning stems from fitting highorder global polynomials to data, for instance, as given in Sanger (1991), Sanger, Sutton, and Matheus (1992), and Shin and Ghosh (1995). Due to the global character of these learning methods, the danger of negative interference is quite large. Additional references on constructive learning for regression can be found in the survey by Kwok and Yeung (1995). The idea of the mixture of experts in Jacobs et al. (1991) and hierarchical mixtures of experts in Jordan and Jacobs (1994) is related to RFWR as the mixture of experts approach looks for similar partitions of the input space, particularly in the version of Xu et al. (1995). Ormoneit and Tresp (1995) suggested methods to improve the generalization of mixture models when fit with the EM algorithm (Dempster et al., 1977) by introducing Bayesian priors. Closely related to the hierarchical mixture of experts are nonparametric decision tree techniques, in which the seminal work of Breiman, Friedman, Olshen, and Stone (1984) introduced classification and regression trees (CART), and Friedman (1991) proposed the MARS algorithm, a CART derivative particularly targeted at smooth function approximation for regression problems. Finally, adaptive receptive fields and the way receptive fields are created in RFWR resemble in part the classification algorithms of Reilly, Cooper, and Elbaum (1982) and Carpenter and Grossberg (1987).
2074
Stefan Schaal and Christopher G. Atkeson
6 Discussion This article emphasizes two major points. First, truly local learning—learning without competition, without gating nets, without global regression on top of the local receptive fields—is a feasible approach to learning; moreover, it can compete with state-of-the-art learning systems. Second, truly incremental learning—learning without knowledge about the input and conditional distributions, learning that must cope with continuously incoming data with many partially redundant and/or partially irrelevant inputs—needs to have a variety of mechanisms to make sure that incremental learning is robust. A carefully designed local learning system can accomplish this robustness. RFWR borrowed in particular from work in nonparametric statistics. Following the definition of H´ajek (1969), the term nonparametric indicates that the function to be modeled potentially consists of very large families of distributions that cannot be indexed by a finite-dimensional parameter vector in a natural way. This view summarizes the basic assumptions of our learning system, with the addition of prior knowledge about smoothness incorporated in a penalty term. If more prior knowledge is available for a particular problem, it should be incorporated in the learning system. It is unlikely that a nonparametric learner outperforms problem-tailored parametric learning (e.g., fitting sinusoidal data with a sinusoid is the best one can do). The examples given throughout this article highlight when local nonparametric learning can be advantageous, but there is no claim that it is generally superior to other learning systems. On the other hand, when it comes to learning without having strong prior knowledge about the problem, nonparametric methods can be quite beneficial. For instance, Quartz and Sejnowski (1997) claim that constructive nonparametric learning might be one of the key issues in understanding the development of the organization of brains. RFWR makes use of several new algorithmic features. We introduced a stochastic approximation to leave-one-out local cross-validation—crossvalidation that does not need a validation set anymore. This technique can potentially be useful for many other domains because it requires only that the (local) parameters to be estimated are linear in the inputs. By employing a novel penalized local cross-validation criterion, we were able to derive locally adaptive multidimensional distance metrics. These distance metrics can be interpreted as local approximations of the Hessians of the function to be modeled. In order to speed up learning of the distance metric, we derived a second-order gradient descent method. Finally, the penalized local crossvalidation criterion could also be employed to achieve automatic local bias adjustment of the relevance of input dimensions, obtained by local ridge regression. Using all these features, the constructive process of RFWR only needs to monitor the activation strength of all receptive fields in order to decide when to create a new receptive field; most constructive learning
Constructive Incremental Learning
2075
systems need to monitor an approximation error criterion as well, which can easily lead to an unfavorable bias-variance trade-off. Several issues have not been addressed in this article and are left to future research. RFWR makes use of gradient-based learning, which requires a proper choice of learning rates. Although we incorporated second-order learning derived from Sutton (1992a, 1992b), it may still be useful to do some experimentation with the choice of the learning rates in order to achieve close to optimal learning speed without entering unstable domains. It is also necessary to choose a roughly appropriate initial distance metric D (cf. equation 3.2), characterizing the initial size of a receptive field. An initial receptive field that is way too large has the danger that the receptive field grows to span the entire input domain: the initial receptive field has to be such that structure in the data cannot be mistaken for high variance noise. As a positive side effect of local learning, however, these open parameters can be explored by allowing just a small number of receptive fields on an initial data set and monitoring their learning behavior; each receptive field learns independently, and there is no need to do parameter exploration with a large number of receptive fields. A last algorithmic point concerns computational complexity. Recursive least squares is an O(n2 ) process (i.e., quadratic in the number of inputs), and the update of a full distance metric is worse than O(n2 ). If the dimensionality of the inputs goes beyond about 10, a learning task with many receptive fields will run fairly slowly on a serial computer. Fitting only diagonal distance metrics alleviates this effect and might be necessary anyway since the number of open parameters in the learning system might become too large compared to the number of training data points. This discussion naturally leads to the longstanding question of how local learning methods can deal with high-dimensional input spaces at all. As nicely described in Scott (1992), the curse of dimensionality has adverse effects on all systems that make use of neighboring points in the Euclidean sense, since the concept of “neighborhood” becomes gradually more counterintuitive when growing beyond 10 input dimensions, and it pretty much vanishes beyond 20 dimensions: every point is about the same distance from every other point. In such domains, the parametric model chosen for learning, be it local or global, becomes the key to success, essentially meaning that any learning system requires strong bias in high-dimensional worlds. However, it remains unclear whether high-dimensional input spaces have locally high-dimensional distributions. Our experience in sensorimotor learning is that this may not be true for many interesting problems, as physical systems do not realize arbitrary distributions. For instance, a sevendegree-of-freedom anthropomorphic robot arm, whose inverse dynamics model requires learning in a 21-dimensional input space, seems to realize locally not more than 4- to 8-dimensional input distributions. In Vijayakumar and Schaal (1997) we incorporated local dimensionality reduction as
2076
Stefan Schaal and Christopher G. Atkeson
a preprocessing step in every receptive field, allowing us to approximate high-dimensional data successfully. As a last point, one might wonder in how far a local learning system like RFWR could have any parallels with neurobiological information processing. Particularly inspired by work on the visual cortex, one of the mainstream assumptions about receptive field–based learning in the brain is that receptive fields are broadly tuned and widely overlapping and that the size of the receptive fields does not seem to be a free parameter in normal learning (as opposed to developmental and reorganizational processes after lesions, e.g., Merzenich, Kaas, Nelson, Sur, & Felleman, 1983). This view emphasizes that accuracy of encoding must be achieved by subsequent postprocessing steps. In contrast, RFWR suggest overlapping but much more finely tuned receptive fields, such that accuracy can be achieved directly by one or several overlapping units. Fine tuning can be achieved not only by a change of the size of the receptive field, but also by “plug-in” approaches, where several receptive fields tuned for different spatial frequencies contribute to learning (Cannon & Slotine, 1995). To distinguish between those two principles, experiments that test for interference and generalization during learning can provide valuable insights into the macroscopic organization of learning. Shadmehr and Mussa-Ivalidi (1994), Imamizu, Uno, and Kawato (1995), and Shadmehr, Brashers-Krug, and Mussa-Ivaldi (1995) provide examples of such investigations in motor control. Whether the learning principles of RFWR are biologically relevant remains speculative. What we have demonstrated, however, is that there are alternative and powerful methods to accomplish incremental constructive learning based on local receptive fields, and it might be interesting to seek cases where such learning systems might be applied. Receptive field–based local learning is an interesting research topic for neural computation, and truly local learning methods are just starting to demonstrate their potential. Appendix A.1 Ridge Regression Derivatives. Each ridge regression parameter can be conceived of as a weighted data point of the form [xr = r2i (0, . . . , 1, 0 . . .)T , yr = 0], which was incorporated in the regression by the recursive least squares update (see equation 3.5). Thus, the derivative of the cost function (see equation 3.9) is a simplified version of the derivative in equation 3.14: 2ri ∂J = n+1 ∂ri W
µ µ ³ ´T ¶ − 2Pn+1 xr yr − xTr β n+1 ⊗ Hn+1 ¶ ³ ´ n+1 T n+1 n+1 . − 2P xr xr P ⊗R
(A.1)
Constructive Incremental Learning
2077
By taking advantage of the many zero elements of the ridge “data points,” the actual computation of this derivative is greatly speeded up. There are several ways to incorporate the update of the ridge regression parameters in the matrix P, and it should be noted that we also need to add back the fraction of the ridge parameters that was forgotten due to the forgetting factor λ in each update of P (see equation 3.5). It turns out that there is a quite efficient way to perform this update. At every update of a receptive field, the forgetting factor effectively reduces the contribution of each ridge parameter by: 1λ,i = (1 − λ) r2i .
(A.2)
The update due to gradient descent is 1grad,i = (ri + 1ri )2 − r2i ,
(A.3)
and the total increment becomes 1i = 1λ,i + 1grad,i = (1 − λ) r2i + (ri + 1ri )2 − r2i = −λr2i + (ri + 1ri )2 .
(A.4)
Due to the fact that the ridge vectors are all unit vectors, it is possible to update P by executing a recursive least squares update for the increment—that is, add a ridge data point of the form [xr = 1i (0, . . . , 1, 0, . . .)T , yr = 0] for every ridge parameter by using equation 3.5. This update can be accelerated by taking into account the zeros in the ridge points. An additional speed-up can be obtained by not updating P at every iteration but accumulating the increments until they exceed a manually chosen threshold. A.2 Second-Order Learning of the Distance Metric. The idea of the incremental delta-bar-delta (IDBD) algorithm (Sutton, 1992a, 1992b) is to replace the learning rate in the gradient descent update (see equation 3.10) by an individual learning rate for each coefficient of M of the following form: ∂J , ∂Mij ³ ´ = exp βijn+1
= Mnij − αijn+1 Mn+1 ij where αijn+1
and
βijn+1 = βijn − θ
∂J n h . ∂Mij ij
(A.5)
Thus, the learning rates αij are changed in geometric steps by gradient descent in the meta parameter βij with meta learning rate θ . The term hij is updated as " #+ 2 ∂J n+1 n+1 ∂ J n − αijn+1 , hij = hij 1 − αij 2 Mij ∂Mij
2078
Stefan Schaal and Christopher G. Atkeson
where we define
[z]+ =
½
z 0
if z > 0 otherwise.
(A.6)
hij is initialized to zero when a receptive field is created. It corresponds to a memory term that stores a decaying trace of the cumulative sum of recent changes to Mij . For more details, see Sutton (1992a, 1992b). In order to apply this second-order update, it is necessary to store the parameters αij , βij , and hij and to compute the second derivative in equation A.6. This second derivative of the cost function (see equation 3.9) with respect to the coefficients of the decomposed distance metric becomes: ∂w X ∂ J1,i w ∂ J2 ∂J ≈ + n+1 ∂M ∂M i=1 ∂w W ∂M p
(A.7)
p p ∂w X ∂ 2 J1,i ∂w w ∂ 2 J2 ∂ 2 w X ∂ J1,i ∂2J + + ≈ ∂M2 ∂M2 i=1 ∂w ∂M i=1 ∂w2 ∂M W n+1 ∂M2
where:
µ ¶ 1 ∂w 2 ∂ 2w = − w(xl − cl )2 , w ∂M ∂M2rl
p X ∂ 2 J1,i i=1
∂w2
¶ n µ X ∂Dij 2 ∂ 2 J2 = 2γ 2Dll + ∂Mrl ∂M2rl i,j=1
eTcv ecv (W n+1 )2 µµ ¶ ¶ I 2 − n+1 − n+1 − 2Pn+1 x˜ x˜ T Pn+1 x˜ (y − x˜ T β n+1 )T ⊗Hn W W ¡ ¢T ¡ ¢ y − x˜ T β n+1 y − x˜ T β n+1 h 2 + n+1 W wµ µ ¶ ³ ´T ¶ 1 T n+1 T n+1 n ˜ ˜ e − 2 P β x y − x e ⊗ H − cv cv (W n+1 )2
≈−
+2
En+1 (W n+1 )3
Equation A.7 makes use of notation and results derived in equations 3.13 and 3.14. Acknowledgments Sethu Vijayakumar’s helpful comments contributed significantly to improve this article. This research was partly funded by the ATR Human Information Processing Research Laboratories. Additional support for S. S.
Constructive Incremental Learning
2079
was provided by the German Research Association, the German Scholarship Foundation, and the Alexander von Humboldt Foundation. Support for C. A. was provided under Air Force Office of Scientific Research grant F49-6209410362 and by a National Science Foundation Presidential Young Investigator Award. References Aha, D. (1997). Lazy learning. Artificial Intelligence Review, 11, 1–5. An, C. H., Atkeson, C. G., & Hollerbach, J. M. (1988). Model-based control of a robot manipulator. Cambridge, MA: MIT Press. Atkeson, C. G., Moore, A. W., & Schaal, S. (1997a). Locally weighted learning. Artificial Intelligence Review, 11, 11–73. Atkeson, C. G., Moore, A. W., & Schaal, S. (1997b). Locally weighted learning for control. Artificial Intelligence Review, 11, 75–113. Atkeson, C. G. (1989a). Learning arm kinematics and dynamics. Annual Review Neuroscience, 12, 157–183. Atkeson, C. G. (1989b). Using local models to control movement. In D. Touretzky, (Ed.), Advances in neural information processing systems, 1 (79–86). San Mateo, CA: Morgan Kaufmann. Atkeson, C. G., & Schaal, S. (1995). Memory-based neural networks for robot learning. Neurocomputing, 9, 243–269. Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley. Bishop, C. M. (1996). Neural networks for pattern recognition. New York: Oxford University Press. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Belmont, CA: Wadsworth International Group. Bruske, J., & Sommer, G. (1995). Dynamic cell structure learns perfectly topology preserving map. Neural Computation, 7, 845–865. Cannon, M., & Slotine, J. E. (1995). Space-frequency localized basis function networks for nonlinear system estimation and control. Neurocomputing, 9, 3, 293–342. Carpenter, G. A., & Grossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54–115. Churchland, P. S., & Sejnowski, T. J. (1992). The computational brain. Cambridge, MA: MIT Press. Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association, 74, 829–836. Cleveland, W. S., Devlin, S. J., & Grosse, E. (1988). Regression by local fitting: Methods, properties, and computational algorithms. Journal of Econometrics, 37, 87–114. Cleveland, W. S., & Devlin, S. J. (1988). Locally weighted regression: An approach to regression analysis by local fitting. Journal of the American Statistical Association, 83, 596–610.
2080
Stefan Schaal and Christopher G. Atkeson
Cleveland, W. S., & Loader, C. (1995). Smoothing by local regression: Principles and methods (Tech. Rep.). Murray Hill, NJ: AT&T Bell Laboratories. Daugman, J., & Downing, C. (1995). Gabor wavelets for statistical pattern recognition. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 414–420). Cambridge, MA: MIT Press. de Boor, C. (1978). A practical guide to splines. New York: Springer-Verlag. Deco, G., & Obradovic, D. (1996). An information-theoretic approach to neural computation. New York: Springer-Verlag. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39, 1–38. Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley. Fahlman, S. E., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 2. San Mateo, CA: Morgan Kaufmann. Fan, J., & Gijbels, I. (1992). Variable band-width and local linear regression smoothers. Annals of Statistics, 20, 2008–2036. Fan, J., & Gijbels, I. (1995). Data-driven bandwidth selection in local polynomial fitting: Variable bandwidth and spatial adaptation. Journal of the Royal Statistical Society B, 57, 371–395. Fan, J., & Gijbels, I. (1996). Local polynomical modelling and its applications. London: Chapman & Hall. Farmer, J. D., & Sidorowich, J. J. (1987). Predicting chaotic time series. Phys. Rev. Lett., 59 (8), 845–848. Farmer, J. D., & Sidorowich, J. J. (1988). Exploiting chaos to predict the future and reduce noise. In Y. C. Lee (Ed.), Evolution, learning, and cognition (p. 27). Singapore: World Scientific. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Frean, M. (1990). The upstart algorithm: A method for constructing and training feedforward neural networks. Neural Computation, 2, 198–209. Friedman, J. H. (1984). A variable span smoother (Tech. Rep. No. 5). Stanford: Department of Statistics, Stanford University. Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–141. Friedman, J. H., & Stutzle, ¨ W. (1981). Projection pursuit regression. Journal of the American Statistical Association, Theory and Models, 76, 817–823. Fritzke, B. (1994). Growing cell structures—A self-organizing network of unsupervised and supervised learning. Neural Networks, 7, 1441–1460. Fritzke, B. (1995). Incremental learning of locally linear mappings. In Proceedings of the International Conference on Artificial Neural Networks, Paris, October 9–13. Furlanello, C., & Giuliani, D. (1995). Combining local PCA and radial basis function networks for speaker normalization. In F. Girosi, J. Makhoul, E. Manolakas, & E. Wilson (Eds.), Proceedings of the 1995 IEEE Workshop on Neural Networks for Signal Processing V (pp. 233–242). New York: IEEE.
Constructive Incremental Learning
2081
Furlanello, C., Giuliani, D., & Trentin, E. (1995). Connectionist speaker normalization with generalized resource allocating networks. In D. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 867–874). Cambridge, MA: MIT Press. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Georgopoulos, A. P. (1991). Higher order motor control. Annual Review of Neuroscience, 14, 361–377. H´ajek, J. (1969). A course in nonparametric statistics. San Francisco: Holden-Day. Hastie, T., & Loader, C. (1993). Local regression: Automatic kernel carpentry. Statistical Science, 8, 120–143. Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models. London: Chapman and Hall. Hastie, T. J., & Tibshirani, R. J. (1994). Nonparametric regression and classification: Part I: Nonparametric regression. In V. Cherkassky, J. H. Friedman, & H. Wechsler (Eds.), From statistics to neural networks: Theory and pattern recognition applications. ASI Proceedings, subseries F, Computer and Systems Sciences. Berlin: Springer-Verlag. Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of of single neurons in the cat’s striate cortex. Journal of Neurophysiology, 148, 574–591. Imamizu, H., Uno, Y., & Kawato, M. (1995). Internal representations of the motor apparatus: Implications from generalization in visuomotor learning. Journal of Experimental Psychology, 21, 1174–1198. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79–87. Jordan, M. I., & Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Jutten, C., & Chentouf, R. (1995). A new scheme for incremental learning. Neural Processing Letters, 2, 1–4. Kwok, T.-Y., & Yeung, D.-Y. (1995). Constructive feedforward neural networks for regression problems: A survey (Tech. Rep. No. HKUST-CS95-43). Clear Water Bay, Kowloon, Hong Kong: Department of Computer Science, Hong Kong University of Science and Technology. Lee, C., Rohrer, W. R., & Sparks, D. L. (1988). Population coding of saccadic eye movement by neurons in the superior colliculus. Nature, 332, 357–360. Littmann, E., & Ritter, H. (1993). Generalization abilities of cascade network architectures. In S. J. Hanson, J. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 188–195). San Mateo, CA: Morgan Kaufmann. Ljung, L., & Soderstr ¨ om, ¨ T. (1986). Theory and practice of recursive identification. Cambridge, MA: MIT Press. Lowe, D. G. (1995). Similarity metric learning for a variable-kernel classifier. Neural Computation, 7, 72–85. Martinetz, T., & Schulten, K. (1994). Topology representing networks. Neural Networks, 7, 507–522.
2082
Stefan Schaal and Christopher G. Atkeson
Merzenich, M. M., Kaas, J. H., Nelson, R. J., Sur, M., & Felleman, D. (1983). Topographic reorganization of somatosensory cortical areas 3b and 1 in adult monkeys following restricted deafferentation. Neuroscience, 8, 33–55. Millington, P. J. (1991). Associative reinforcement learning for optimal control. Master’s thesis, Massachusetts Institute of Technology. Moody, J., & Darken, C. (1988). Learning with localized receptive fields. In D. Touretzky, , G. Hinton, & T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Summer School (pp. 133–143). San Mateo, CA: Morgan Kaufmann. Moore, A. (1991). Fast, robust adaptive control by learning only forward models. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4. San Mateo, CA: Morgan Kaufmann. Mountcastle, V. B. (1957). Modality and topographic properties of single neurons of cat’s somatic sensory cortex. Journal of Neurophysiology, 20, 408–434. Myers, R. H. (1990). Classical and modern regression with applications. Boston: PWSKENT. Nadaraya, E. A. (1964). On estimating regression. Theor. Prob. Appl., 9, 141–142. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Ormoneit, D., & Tresp, V. (1995). Improved Gaussian mixture density estimates using Bayesian penalty terms and network averaging (Tech. Rep. No. FKI-205-95. Munich: Theoretical Computer Science and Foundations of Artificial Intelligence, Technische Universit¨at Munchen. ¨ Orr, M. J. L. (1995). Regularization in the selection of radial basis function centers. Neural Computation, 7, 606–623. Papoulis, A. (1991). Probability, random variables, and stochastic processes. New York: McGraw-Hill. Perrone, M. P., & Cooper, L. N. (1993). When networks disagree: Ensemble methods for hybrid neural networks. In R. J. Mammone (Ed.), Neural networks for speech and image processing. London: Chapman-Hall. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation, 3, 213–225. Poggio, R., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982. Powell, M. J. D. (1987). Radial basis functions for multivariate interpolation: A review. In J. C. Mason & M. G. Cox (Eds.), Algorithms for approximation (pp. 143–167). Oxford: Clarendon Press. Quartz, S. R., & Sejnowski, T. J. (1997). The neural basis of cognitive development: A constructivist manifesto. Behavioral and Brain Sciences 20, 537–596. Reilly, D. L., Cooper, L. N., & Elbaum, C. (1982). A neural model for category learning. Biological Cybernetics, 45, 35–41. Ritter, H., & Schulten, K. (1986). Topology conserving mappings for learning motor tasks. In J. S. Denker (Ed.), Neural networks for computing (pp. 376–380). AIP Conference Proceedings, Snowbird, Utah. Sanger, T. D. (1991). A tree-structured adaptive network for function approximation in high-dimensional spaces. IEEE Transactions on Neural Networks, 2, 285–293.
Constructive Incremental Learning
2083
Sanger, T. D., Sutton, R. S., & Matheus, C. J. (1992). Iterative construction of sparse polynomial approximations. In S. J. Hanson, J. E. Moody, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 1064–1071). San Mateo, CA: Morgan-Kaufmann. Schaal, S. (1997). Learning from demonstration. In M. C. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 1040– 1046). Cambridge, MA: MIT Press. Schaal, S., & Atkeson, C. G. (1994a). Robot juggling: An implementation of memory-based learning. Control Systems Magazine, 14, 57–71. Schaal, S., & Atkeson, C. G. (1994b). Assessing the quality of learned local models. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 160–167). San Mateo, CA: Morgan Kaufmann. Schaal, S., & Atkeson, C. G. (1996). From isolation to cooperation: An alternative of a system of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 605–611). Cambridge, MA: MIT Press. Schaal, S., & Atkeson, C. G. (1997). Receptive field weighted regression (Tech. Rep. No. TR-H-209). Kyoto: ATR Human Information Processing Laboratories. Scott, D. W. (1992). Multivariate density estimation. New York: Wiley. Shadmehr, R., Brashers-Krug, T., & Mussa-Ivaldi, F. A. (1995). Interference in learning internal models of inverse dynamics in humans. In G. Tesauro, D. S. Touretzky, & K. T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 1117–1124). San Mateo, CA: Morgan Kaufmann. Shadmehr, R., & Mussa-Ivaldi, F. A. (1994). Adaptive representation of dynamics during learning of a motor task. Journal of Neuroscience, 14, 3208–3224. Shin, Y., & Ghosh, J. (1995). Ridge polynomial networks. IEEE Transactions on Neural Networks, 6, 610–622. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictors. Journal of the Royal Statistical Society, B36, 111–147. Sutton, R. S. (1992a). Gain adaptation beats least squares. In Proceedings of Seventh Yale Workshop on Adaptive and Learning Systems (pp. 161–166). New Haven, CT. Sutton, R. S. (1992b). Adapting bias by gradient descent: An incremental version of Delta-Bar-Delta. In Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 171–176). Cambridge, MA: MIT Press. van der Smagt, P., & Groen, F. (1995). Approximation with neural networks: Between local and global approximation. In Proceedings of the 1995 International Conference on Neural Networks, 2 (pp. 1060–1064). Perth, Australia. Vijayakumar, S., & Schaal, S. (1997). Local dimensionality reduction for locally weighted learning. In IEEE International Symposium on Computational Intelligence in Robotics and Automation (pp. 220–225). Monterey, CA, July 10–11. Wahba, G. (1990). Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics. Wahba, G., & Wold, S. (1975). A completely automatic french curve: Fitting spline functions by cross-validation. Communications in Statistics, 4 (1), 1–17. Watson, G. S. (1964). Smooth regression analysis. Sankhaya: The Indian Journal of Statistics A, 26, 359–372.
2084
Stefan Schaal and Christopher G. Atkeson
Xu, L., Jordan, M. I., & Hinton, G. E. (1995). An alternative model for mixture of experts. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems (pp. 633–640). Cambridge, MA: MIT Press. Received May 5, 1997; accepted February 13, 1998.
LETTER
Communicated by Anthony Bell
Information Maximization and Independent Component Analysis: Is There a Difference? D. Obradovic G. Deco Siemens AG, Central Technology Department, Information and Communications, 81739 Munich, Germany
This article provides a detailed and rigorous analysis of the two commonly used methods for redundancy reduction: linear independent component analysis (ICA) posed as a direct minimization of a suitably chosen redundancy measure and information maximization (InfoMax) of a continuous stochastic signal transmitted through an appropriate nonlinear network. The article shows analytically that ICA based on the KullbackLeibler information as a redundancy measure and InfoMax lead to the same solution if the parameterization of the output nonlinear functions in the latter method is sufficiently rich. Furthermore, this work discusses the alternative redundancy measures not based on the Kullback-Leibler information distance. The practical issues of applying ICA and InfoMax are also discussed and illustrated on the problem of extracting statistically independent factors from a linear, pixel-by-pixel mixture of images. 1 Introduction The pioneering work of Zipf (1949) and the ideas of Attneave (1954) about information processing in visual perception have led to the idea that the nervous system and the brain may be regulated by an economy principle. In the neural network society, these ideas were introduced by Barlow (1989). In this important work, the author presented the connectionist model of unsupervised learning under the perspective of redundancy reduction. The minimum entropy coding method was introduced for the generation of factorial codes (Barlow, Kaushal, & Mitchison, 1989). Atick and Redlich (1990) demonstrated that statistically salient input features can be optimally extracted from a noisy input by maximizing mutual information. Simultaneously, Atick and Redlich (1992) and especially Redlich (1993a,b) concentrate on the original idea of feature extraction by redundancy reduction. Several neural network learning algorithms for principal component analysis (PCA) are presented, among others, in Deco and Obradovic (1994b) and Obradovic and Deco (1994). The problem of linear independent component analysis (ICA) as linear feature extraction was introduced by Comon (1994) and further extended in the linear and defined in the nonlinear case by Deco and c 1998 Massachusetts Institute of Technology Neural Computation 10, 2085–2101 (1998) °
2086
D. Obradovic and G. Deco
Obradovic (1994b, 1995b), Deco and Brauer (1994a, 1995a), Parra, Deco, and Miesbach (1995), and Obradovic and Deco (1994, 1995, 1996). In parallel, Bell and Sejnowski (1995) have demonstrated that their InfoMax method can also achieve linear feature extraction. This article provides a detailed and rigorous analysis of the two methods and derives conditions under which these methods lead to an identical solution. In addition, we address the cumulant-based criteria for ICA as well as nonlinear ICA. The different methods are illustrated on extracting independent components from a linear mixture of images. 2 Linear Independent Component Analysis and Information Maximization *
Let x be a random vector of dimension n with the joint probability den*
sity function p( x ) whose covariance matrix is nonsingular. Furthermore, let *
*
M be a linear square map that maps x into the random vector y whose *
probability density function is p( y ). *
Definition: ICA. Linear ICA is an input-output linear transformation M from x *
to y such that the output components with joint probability: *
p( y ) = p(y1 , . . . , yn );
*
*
y= M x
(2.1)
are as independent as possible according to the appropriate measure, distance D: Ã
*
D p( y ),
n Y
! p(yi ) .
(2.2)
1
In the special case where the complete independence of the output components is achieved, the following holds: p(y1 , . . . , yn ) = p(y1 ), . . . , p(yn )
(2.3)
and, consequently, D = 0. *
If the input vector x is jointly gaussian, ICA is equivalent to the problem of diagonalizing the output covariance matrix Qy which is the standard PCA problem when the map is restricted to a rotation. In order to guarantee the existence of the solution for the ICA problem, we assume that the input *
signal x was originally obtained by the invertible linear mixture of the statistically independent signals z1 , . . . , zn .
Information Maximization and Independent Component Analysis
2087
Definition: Information maximization. Let the above-defined random vector *
x be transmitted through a combination of a matrix M and n nonlinear functions *
fi ; i = 1: n such that the resulting components of the output vector w are defined as: *
*
y= M x
wi = fi (yi )
(2.4)
Under the assumption that each nonlinear function fi is differentiable and that its derivative fi0 satisfies Z
∞ −∞
fi0 dyi = 1
(2.5)
the information-maximization problem is defined as maximization of the entropy Z
*
H(w) = −
*
*
*
d w p(w) log(p(w))
(2.6)
over the elements of matrix M and, possibly, the free parameters in the parameterization of fi . Typical choices for fi are single or normalized sums of sigmoid functions. At first glance, the ICA and InfoMax problems seem to be substantially different. Nevertheless, it is known that the information maximization leads to the statistical factorization of the output components wi —it essentially performs the same task as ICA (Bell & Sejnowski, 1995). In the remaining part of this section we give a rigorous proof that these two problems are identical when the Kullback-Leibler information is used as a measure D of the statistical independence in ICA and when the derivatives fi0 are capable of approximating output marginal distributions with infinite precision. The Kullback-Leibler distance DKL between the joint and the marginal probabilities is defined as: (
*
DKL = K p( y ),
Y
) p(yi ) =
i
Z
* y) p( ≥0 d y p( y ) log Q i p(yi )
*
*
(2.7)
or equivalently: (
*
K p( y ),
Y i
) p(yi ) =
n X
*
H(yi ) − H( y ).
(2.8)
i=1
Equation 2.8 indicates that the Kullback-Leibler distance is the mutual information between the output components yi .
2088
D. Obradovic and G. Deco
The relationship between the input and output joint probabilities of an invertible differentiable map g is equal to *
*
p(in) , p(out) = | det(J)|
(2.9)
where J is the Jacobian matrix of g. Consequently, equations 2.7 through 2.9 can be combined in the following way: Ã ! ) Z ( * * Y * * p( x ) ≥0 p(yi ) = d x p( x ) log Q K p( y ), i p(yi ) · | det(M)| i *
= − H(p( x )) ! Ã Z Y * * p(yi ) · | det(M)| ≥ 0 (2.10) − d x p( x ) log i *
Since the input entropy H(p( x )) is independent of the input-output transfor* Q mation, the minimization of K{p( y ), i p(yi )} is equivalent to maximization R * * Q of d x p( x ) log( i p(yQ i ) · | det(M)|), that is, to the maximum likelihood expectation (MLE) of log( i p(yi ) · | det(M)|). In general, the analytical expression for the marginal probabilities p(yi ) are not known, and their estimates ˆ i ) have to be obtained from the data for every change of the matrix M. p(y Similarly, in the information-maximization problem, the output joint en*
tropy H(w) is equal to:
( ) * p( y ) * Y 0 = −K p( y ), fi (yi ) H(w) = − d y p( y ) log Q 0 i fi (yi ) i à ! Z Y * * * 0 d x p( x ) log fi (yi ) · | det(M)| ≥ 0 (2.11) = H(p( x )) + *
Z
*
*
i
Q
or, equivalently, to the MLE of log( i fi0 (yi ) · | det(M)|). The equivalent result was obtained by Nadal and Parga (1994). (The same authors have an interesting article discussing the cumulant-based ICA approaches: Nadal & Parga, 1997.) Hence, the ICA with the Kullback-Leibler information measure and the maximum information transfer as defined in this article are posed as: ) ( * Y p(yi ) ICA ⇒ min K p( y ), ( ⇒ MLE log
Ã
i
Y i
!)
ˆ i ) · | det(M)| p(y
Information Maximization and Independent Component Analysis
( INFOMAX ⇒ min
*
K p( y ), (
⇒ MLE log
Ã
Y
) fi0 (yi )
i
Y
2089
!)
fi0 (yi )
· | det(M)|
(2.12)
i
The initial parameterization of the derivatives fi0 (yi ) in equation 2.12 has a possible interpretation as the prior on the estimation of the actual ˆ i) marginal densities p(yi ). Both methods require parameterization of p(y and fi0 (yi ). Hence, the problem statements in equation 2.12 can be used to derive conditions for the equivalence of the solutions of ICA and InfoMax. *
Lemma. For a given input distribution p( x ), the ICA and InfoMax problems achieve the same degree of statistical independence if the derivatives fi0 (yi ) can be ˆ i ). parameterized in the form of the marginal distribution estimates p(y The proof is straightforward since it requires that the parameterization of *
p( x ) and fi0 (yi ) be identical. This can be illustrated by an example. Example. The marginal probabilities p(yi ) have to be estimated from the data. A typical way of doing that is to estimate elements of a probability density function expansion up to the desired order. Let us use the first ˆ i) element of the Edgeworth expansion (Deco & Obradovic, 1996); let p(y have the form of a gaussian whose mean and standard deviation σi is equal to those of the actual marginal distribution p(yi ). Without a loss of generality, *
let us assume that the input distribution p( x ) is zero mean. In addition, let us parameterize the derivatives fi0 (yi ) as zero-mean gaussian distributions whose standard deviations ri are optimization parameters. Hence, it is easy to see that the MLE problems in equation 2.12 become o n X ICA ⇒ MLE − [log(σi )] + log(| det(M)|) σi = hy2i i ¸ ¾ ½ X· y2i INFOMAX ⇒ MLE − log(ri ) + 2 +log(| det(M)|) . 2ri
(2.13)
The resulting ICA problem is nothing more than the covariance matrix diagonalization (Deco & Obradovic, 1996), where the optimization is performed over the elements of the matrix M. In the case of InfoMax, the unknown parameters are not only the elements of M but also the gaussian parameters ri . It is easy to see that the optimal value of ri for every fixed matrix M is the actual standard deviation σi and therefore that the solution of the InfoMax problem will also result in the covariance matrix diagonalization.
2090
D. Obradovic and G. Deco
In practice, it is required that the solutions of both methods are unique modulo transformations that preserve statistical independence such as the component order permutation and diagonal scaling. The uniqueness is *
achieved if the number of gaussian components of p( x ) does not exceed one. In the case of multiple gaussian distributions, it is well known that there is an infinite number of matrix transformations that diagonalize the covariance matrix. Hence, the ICA and InfoMax algorithms will have unique *
solutions only if the original signal z did not have more than one gaussian component. In addition, there can be problems concerning the scaling of the elements of the matrix M. Hence, it is our experience that imposing the condition det(M) = 1
(2.14)
makes the optimization numerically stable and avoids possible scaling problems. Different parameterizations of M such that the condition in equation 2.14 holds can be found in Deco and Obradovic (1996). 3 Alternative Redundancy Measures In the previous section, we demonstrated that ICA and InfoMax are identical when the redundancy measure in ICA is the Kullback-Leibler information distance and when sufficient freedom is given to the marginal output probability modeling and estimation. Nevertheless, there are other measures that are easy to implement, especially in the case of a linear mixing with a matrix M. This section briefly reviews ICA based on the properties of *
cumulant expansion of the joint probability density function p( y ). The detailed derivation and analysis of the cumulant-based ICA can be found in Obradovic and Deco (1995) and Deco and Obradovic (1996). The cumulant-based criterion for ICA is derived by comparison of the *
cumulant expansion of the joint probability density p( y ) and of the product of the marginal output probabilities p(yi ). The complete factorization is achieved if both expansions are the same, i.e. if the nondiagonal coefficients *
in the higher-order cumulants of p( y ) take desired values (usually zero) imposed by the statistical independence of p(yi ). Hence, the cumulant ICA can be defined as distance Dcum between the elements of the cumulant expansions of the joint probability and the product of the univariate probabilities. The distance between the cumulant elements is usually defined as a squared difference. Since the cumulant expansion of an arbitrary distribution has an infinite number of elements, for practical purposes only cumulants up to the order four are considered. Hence, the resulting ICA cumulants-based
Information Maximization and Independent Component Analysis
2091
criterion has the following form:
Dcum ≈ J(M) =
4 X X i=1 nondiag
(i) 2 [C(i) nondiag − Cnondiag − desired ] ,
(3.1)
(i) where i defines the cumulant order, and where C(i) nondiag and Cnondiag − desired are the nondiagonal cumulant coefficients and their desired values for a given *
cumulant order i of the joint probability density function p( y ). In general, the desired coefficients C(i) nondiag − desired are equal to zero. For every change of the matrix M, the nondiagonal coefficients are estimated and the cost J(M) further minimized. The cumulant-based ICA criterion can be simplified further by using the properties of cumulant expansion of the data subjected to rotation matrix transformation (Obradovic & Deco, 1995). It is also possible to design alternative cost functions to the one in equation 3.1 by weighting the contributions from the different cumulant orders differently or by using some other norm instead of the two-norm used here. Nevertheless, since our goal is statistical independence at all cumulant levels of interest, we have weighted contributions from different cumulant orders equally. In addition, the two-norm is chosen since it is continuously differentiable and enables derivation of efficient algorithms, as we will show in equation 3.6. All linear input-output transformations M that result in statistical inde*
*
pendence of output components of y = M · x must have the following structure (Obradovic & Deco, 1995): · N, M = P · D · R · D−0.5 1
(3.2)
where P is a permutation matrix, D is an invertible diagonal scaling matrix, and R is a rotation matrix. In addition, N is the invertible matrix, usually orthogonal, which diagonalizes the covariance matrix of the input variable *
x: N · Qx · NT = D1 ,
(3.3)
where D1 is a nonsingular diagonal matrix. The parameterization of the matrix M in equation 3.2 is based on the observation that the statistical independence implies diagonal structure of the covariance matrix Qx . Hence, the search for the matrix M that achieves statistical independence can be posed as a search for a matrix R that preserves the diagonal structure of Qx modulo permutation P and invertible diagonal scaling D. It is not difficult to show that R is a rotation matrix (see equation 3.2). . Hence, when the orig*
· N, the statistical inal input x is preprocessed, that is, multiplied by D−0.5 1 independence of the resulting signals is achieved through a further rotation transformation R.
2092
D. Obradovic and G. Deco
The problem transformation from a search for an arbitrary invertible matrix M to a search for a rotation matrix R brings additional advantages since there are cumulant properties that are invariant to a rotation transformation. Finally, the cumulants of multidimensional zero-mean distribution are tensors, and therefore they satisfy the tensor property stating that the sum of squared diagonal elements remains constant with respect to the orthogonal transformation (Comon, 1994). In the case of a second-order cumulant, the resulting property corresponds to the invariance of the Frobenius matrix norm with respect to orthogonal transformations. Consequently, it is obvious that maximization of the sum of squared diagonal elements of the multidimensional cumulants results in the minimization of their offdiagonal elements, that is, the reduction of statistical dependence of the output components. Now, the cost function for ICA can be posed as the optimization problem with regard to the matrix R (Comon, 1994):
min Jr (R) = min * y1
n X i=1
=R·
*
−[Kir (y1 )]2 ; D−0.5 1
RT = R−1
(3.4)
*
· N· x
where Kir stands for the cumulant of order r of the ith component of a vector *
output variable y 1 . Hence, the sum of squared cumulants of individual output components y1 (i) is equal to the sum of squared diagonal elements *
of the tensor cumulant of the multidimensional output variable y . The only remaining problem is the parameterization of a rotation matrix R. A suitable parameterization of rotation matrices is obtained by Cayley (Koecher, 1983), R = (I + A)−1 (I − A);
AT = −A,
(3.5)
where A is the skew-symmetric matrix whose number of independent parameters is equal to 0.5n(n − 1). This parameterization covers all rotation matrices R with the property that (I + R) is nonsingular; no eigenvalue of R is equal to −1. The latter condition represents no restriction in our case since there always exists a diagonal unitary matrix that can make the eigenvalues different from −1. This novel parameterization enables relatively simple use of the cumulant criteria defined in equation 3.4 and makes it easily applicable when the number of sources is greater than two, which was not the case for the Comon (1994) approach. Finally, linear ICA can be posed as the following search for a skewsymmetric matrix A that implies statistical independence to a cumulant
Information Maximization and Independent Component Analysis
2093
order r: min
X r>2 RT
Jr (R) = min
n XX
*
−[Kir ( y )]2 ;
r>2 i=1
= R−1 = (I − A)−1 (I + A) *
(3.6)
*
y = R · D−0.5 · N· x 1
Since the cost function Jr (R) is negative for every r, we can simultaneously minimize the redundancy at different cumulant orders by summing the cost functions for desired values of r. Hence, not only the cost function is simplified by concentrating only on the diagonal cumulant elements but the number of unknown matrix elements is reduced from n2 to 0.5n(n − 1). The 0.5n(n − 1) free parameters of the matrix A can be determined by any gradient-based method such as backpropagation. The cumulant-based algorithms in equations 3.1 and 3.6 should not be confused with the ICA algorithms based on the Kullback-Leibler distance where the coefficients of the Edgeworth expansion (Comon, 1994) used to parameterize the marginal probabilities are defined as a function of cumulants. The cumulant algorithms presented here do not minimize the Kullback-Leibler distance but the magnitude of the nondiagonal elements of the estimated cumulant tensors. Minimization of the squared magnitude of the nondiagonal elements comes from the comparison of the cumulant expansion of the joint probability (to be precise, of its Fourier transform, that is, its characteristic function) and the product of cumulant expansions of marginal probabilities (again of their Fourier transforms). By inspection one sees that the nondiagonal elements in cumulant tensors should be zero for the independence to hold. There are many possible ways to penalize the deviation from zero, and we have chosen the quadratic norm. The connection to Kullback-Leibler distance (KLD) is that when all the marginals are independent, both the KLD and the cumulant cost function are equal to zero. Furthermore, the algorithm defined in equation 3.6 has the advantage that after suitable transformation (input covariance matrix diagonalization and scaling), it acts only on the diagonal cumulant elements while minimizing the magnitude of the nondiagonal, that is, cross-cumulant terms. It is our experience that the cumulant-based ICA criteria are numerically superior to the Kullback-Leibler distance–based ICA where the form of the independent univariate distributions is not a priori known since the optimization is concentrated not only on the linear map M but also on modeling the probability density functions. The ICA problem can also be formulated in the case where the inputoutput map is not a matrix but an invertible nonlinear function F. A parameterization of such functions with the so-called triangular volume-preserving network is presented in Deco and Brauer (1994a) and Deco and Obradovic (1996). Deco and Obradovic (1996) present several applications of the Non-
2094
D. Obradovic and G. Deco
linear ICA. Unfortunately, the cumulant-based criterion in this case requires minimization of all nondiagonal cumulant elements up to the desired order, and the extracted features are nonunique not only with respect to the linear scaling and permutation as in the linear case but also to the individual functional transformations of the independent features. The next section presents reconstruction of images carried out by the cumulant-based linear ICA as well as by the InfoMax method. 4 Image Reconstruction with the Cumulant-Based Linear ICA and InfoMax The application of cumulant-based linear ICA defined in equation 3.6 and InfoMax is illustrated with the example of a linear, pixel-by-pixel mixture of images and noise. The images used in this example can be found in the Matlab Demo directory as files mandrill.mat and detail.mat. Both images are reduced to the same pixel size (359 × 371) and are represented by matrices whose entries are elements of the gray scale (between 0 and 1). In addition, a third image of the same size was produced whose matrix elements (pixel gray-scale values) were uniformly distributed between 0 and 1. The matrices corresponding to the three images were linearly combined pixel by pixel. This is equivalent to stacking all the rows of an image together and forming a vector of the length (1 × 359 · 371) and then linearly mixing (multiplying from the right) the resulting three vectors with an invertible matrix S. The left part of Figure 1 depicts three mixed images where the matrix S was
1.5 S = 1.8 1.2
−2 1.3 −1.5
−3 −2 . 1.4
(4.1)
The cumulant-based linear ICA criterion defined in equation 3.6 was applied first. The covariance matrix of the vector representation of the three mixed images is calculated and its singular value decomposition (SVD) performed, resulting in matrices D1 and N. The algorithm used to perform SVD is a Matlab-provided function, but the same task can be achieved by any of existing learning PCA algorithms (for references see Deco & Obradovic, 1996). The scaled image vectors are then subjected to a rotation transformation R parameterized through a matrix A with, in this case, three unknown parameters. The learning of the unknown parameters of A, that is, R is carried out using a quasi-Newton algorithm with a line search for determining the optimal step size. It took only 15 iterations (cost function evaluations) to find a minimum of the cost function in equation 3.6 with r = 4. In other words, diagonal cumulant elements of order 3 and 4 were used where the *
third- and the fourth-order cumulants of the component y1i =y1 (i); i = 1: 3
Information Maximization and Independent Component Analysis
2095
Figure 1: Linear mixtures of the original images are depicted on the left; the extracted independent images are depicted on the right.
are: Ki3 (y1i ) = hy1i − y1i )3 i Ki4 (y1i ) = h(y1i − y1i )4 i − 3h(y1i − y1i )2 i,
(4.2)
where y1i is the mean value and h i stands for the expectation operator. From equation 4.2, it is clear that the used cumulants are invariant with respect to the bias.
2096
D. Obradovic and G. Deco
The resulting images are depicted in the right part of Figure 1. Hence, the cumulant-based linear ICA criterion in equation 3.6 has succeeded in recovering the original images from their linear mixture without using any prior information about the probability distribution of the original images. The quality of the reconstruction in Figure 1 can be judged by multiplying the resulting unmixing matrix M with the original mixing matrix S:
0.0084 M · S = −3.0166 0.0971
0.0066 0.0487 −3.2656
−3.4679 −0.0087 . 0.01404
(4.3)
The resulting product is very close to the unitary matrix modulo diagonal scaling and permutation. The signal-to-noise ratio of the unmixed signal can be easily calculated from the product of the mixing and unmixing matrices presented in equation 4.3. If the unmixing had been perfect, the resulting matrix would have been diagonal. Hence, the nondiagonal elements in every row provide the signal-to-noise measure. Hence, if all the original signals are of the unit variance, the resulting signal-to-noise ratio for the first image is equal to [(−3.4679)2 ]/[(0.0084)2 + (0.0066)2 ] = 1.05 ∗ 105 . The same calculation can be carried out for the other two images. The cost function evolution and the change of all three signal-to-noise ratios during the training are depicted in Figure 2. The number of steps on the x-axis corresponds to the number of function calls (i.e., cost function evaluations). In other words, the line search algorithm is based on a cubic approximation that requires three cost function evaluations before the optimal step size for the weight update is determined. Hence, the number of actual weight updates is approximately three times smaller than the number of steps depicted in the figure. Similar results are obtained by using the MLE interpretation of InfoMax in equation 2.12, where the derivatives fi0 (yi ), that is, the output probabilities, were parameterized directly as normalized sums of gaussians (Pearlmutter & Parra, 1996). The product of the mixing and unmixing matrix after 360 iterations (cost function evaluations) in this case is:
0.0250 −0.0335 2.2213
−0.0423 2.4854 −0.4
2.6447 0.0513 −0.0351
(4.4)
The normalization was carried out in order to guarantee that the sum of gaussians is suitable for modeling probabilities—that the corresponding integral is equal to one. We tried to learn the unmixing matrix with the number of gaussians in each sum varying between one and five. The successful learning of the unmixing matrix occurred only when each fi0 (yi ) was modeled as
Information Maximization and Independent Component Analysis
10 10 10 10 10 10
2097
Signal to Noise Ratios
6 5 4 3 2 1 0
5
10
15
20
25
30
20
25
30
Cumulant Cost Function 5 4 3 2 1 0
0
5
10
15 Number of Iterations
Figure 2: Signal-to-noise ratios and the cost evolution in cumulant-based training.
a normalized sum of five gaussian functions. Hence, in order to recover the original images, a total of 51 was needed (9 elements of the linear transformation, 5 gaussian centers per output, 5 gaussian standard deviations per output, and 4 normalized scaling factors per output). The optimization algorithm was once again the quasi-Newton batch algorithm with a line search. More than 300 iterations were required to converge to the minimum. The convergence of the cost function during learning as well as the evolution of the all three signal-to-noise ratios are given in Figure 3. From this figure we can see that the individual signal-to-noise ratios are still changing, although not significantly, even when the cost function has almost settled. Therefore, approximately 350 iterations (i.e., cost function evaluation) were needed to obtain an unmixing matrix with the signal-to-noise ratio greater than 1000 in each channel. On the other hand, in realistic situations, the original mixing matrix is unknown; the only way to check the quality of the training is either to inspect the resulting signals (images) with a naked eye when that is possible or somehow to measure their statistical dependence, which would very likely lead to evaluation of the higher-order cumulants. It is important to notice that we did not use the natural gradient approach of Amari, Cichocki, and Yang (1996) since the exact gradient was needed for approximating the inverse of the Hessian matrix in the quasi-Newton method. A separate simulation was carried out using natural gradient as it was done in Pearlmutter and Parra (1996), with the exception that the line search for determining the optimal step size was used. The training took considerably
2098
D. Obradovic and G. Deco
Signal to Noise Ratios 6000 5000 4000 3000 2000 1000 0
0
50
100
150
200
250
300
350
400
300
350
400
InfoMax Cost Function -3 -4 -5 -6 -7 -8 -9
0
50
100
150
200 250 Number of Iterations
Figure 3: Result of the InfoMax training with five gaussians in every mixture using the quasi-Newton method.
longer, which can be explained by the fact that the gradient methods are in general slower than the second-order algorithms such as the quasi-Newton algorithm. The result of the simulation is depicted in Figure 4. A possible explanation why the InfoMax algorithm with a single squashing function (e.g., integral of a single gaussian or a single fixed tangent hyperbolicus) does not converge in this case is that the original signals are subgaussian. Their kurtosis are [−1.3219 − 1.3284 − 1.1965], respectively. Hence, learning of the probability density functions was necessary in this case. On the other hand, this makes the cumulant approach very attractive since it is not sensitive to the presence of the subgaussian signals in the mixtures. The examples of the cumulant-based algorithm and the InfoMax approach with a sum of gaussians lead to the following conclusions: • The cumulant-based algorithm requires preprocessing (diagonalization and normalization of the input covariance matrix) before starting the optimization, while the InfoMax algorithm does not. Nevertheless, this is done easily by performing SVD of the input covariance matrix. • The optimization in the cumulant case is performed over a set of rotation matrices whose parameterization presented in this article is straightforward. Due to the structure of rotation matrices, the number
Information Maximization and Independent Component Analysis
10 10 10 10 10
2099
Signal to Noise Ratios
6 4 2 0 -2 0
100
200
300
400
500
600
700
800
600
700
800
InfoMax Cost Function -2 -4 -6 -8 -10 -12
0
100
200
300
400 500 Number of Iterations
Figure 4: Result of the InfoMax training with five gaussians in every mixture and using natural gradient.
of free parameters is small, while the number of variables in the gaussian mixture case grows with a factor of mn where n is the number of gaussians and m is the input dimension. In our examples, there were only 3 unknown parameters in the cumulant-based approach; there were 51 in the InfoMax approach with the gaussian mixture model. The number of parameters in the InfoMax algorithm is reduced only if the shape of the nonlinearities (densities) is a priori determined, as in the cocktail party example in Bell and Sejnowski (1995). While this article was in the reviewing process, Lee, Girolami, Bell, and Sejnowski (1998) introduced a method based on the modification of the standard InfoMax learning algorithm that estimates the sign of kurtosis of the unmixed signals during the training and chooses the appropriate generic models for the probability density functions. In this case, it seems that there is no need for additional parameters but those of the unmixing matrix. Nevertheless, even in that case, the number of the unknown parameters is n2 , which is still greater than 0.5n(n − 1) needed in the cumulant approach. The convergence properties of that algorithm are out of the scope of this article. • The value of the cost function in the cumulant algorithm is bounded from below with zero. Hence, this is an absolute measure of statistical dependence with respect to the given cumulant order r. Unfortunately,
2100
D. Obradovic and G. Deco
this is not the case in the MLE-based InfoMax in equation 2.12. In the latter approach, we know only that optimization leads to the statistical independence, but there is no absolute measure of how far from it we really are. • Our experience has shown that the optimization of the InfoMax MLE criterion becomes more stable if the determinant of M is constrained, for example, det(M) = 1. This is especially important when the number of gaussians is not enough to model the input probability. In this case, the optimization will continue to maximize det(M), although this does not improve the statistical independence of the components. 5 Conclusions This article has provided a detailed and rigorous analysis of the two commonly used methods for redundancy reduction: linear ICA posed as a direct minimization of a suitably chosen redundancy measure and InfoMax of a continuous stochastic signal transmitted through an appropriate nonlinear network. We showed analytically that ICA based on the Kullback-Leibler information as a redundancy measure and InfoMax lead to the same solution if the parameterization of the output nonlinear functions in the latter method is sufficiently rich. We discussed alternative redundancy measures not based on the Kullback-Leibler information distance. The practical issues of applying the cumulant-based ICA and InfoMax were also discussed and illustrated by the problem of extracting statistically independent factors from a linear, pixel-by-pixel mixture of images. References Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind source separation. In G. Tesauro, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing, 8 (pp. 757–763). Cambridge, MA: MIT Press. Atick, J., & Redlich, A. (1990). Towards a theory of early visual processing. Neural Computation, 2, 308–320. Atick, J. & Redlich, A. (1992). What does the retina know about natural scenes? Neural Computation, 4, 196–210. Attneave, F. (1954). Informational aspects of visual perception. Psychological Review, 61, 183–193. Barlow, H. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Barlow, H., Kaushal, T., & Mitchison, G. (1989). Finding minimum entropy codes. Neural Computation, 1, 412–423. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1160.
Information Maximization and Independent Component Analysis
2101
Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Deco, G., & Brauer, W. (1994a). Higher order statistics with neural networks. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing, 7 (pp. 247–254). Cambridge, MA: MIT Press. Deco, G., & Brauer, W. (1995a). Nonlinear higher order statistical decorrelation by volume-conserving neural networks. Neural Networks, 8, 525–535. Deco, G., & Obradovic, D. (1994b). Principal component analysis: A factorial learning approach. In International Conference on Artificial Neural Networks (Vol. 2, pp. 1059–1062). Berlin: Springer-Verlag. Deco, G., & Obradovic, D. (1995b). Rotation based redundancy reduction learning. Neural Networks, 8, 751–755. Deco, G., & Obradovic, D. (1996). An information-theoretic approach to neural computing. New York: Springer-Verlag. Lee, T., Girolami, M., Bell, A. J., & Sejnowski, T. J. (1998) A unifying informationtheoretic framework for independent component analysis. International Journal on Mathematical and Computer Modeling. In press. Koecher, M. (1983). Lineare Algebra und analytische Geometrie. New York: SpringerVerlag. Nadal, J-P., & Parga, N. (1994). Non-linear neurons in the low noise limit: A factorial code maximizes information transfer. Network, 5, 565–581. Nadal, J.-P., & Parga, N. (1997). Redundancy reduction and independent component analysis: Conditions on cumulants and adaptive approaches. Neural Computation, 9, 1421–1456. Obradovic, D., & Deco, G. (1994). Linear feature extraction in networks with lateral connections. In IEEE World Congress on Computational Intelligence (Vol. 2, pp. 686–691). Obradovic, D., & Deco, G. (1995). Linear feature extraction in non-gaussian networks. In World Congress on Neural Networks (Vol. 2, pp. 523–527). Obradovic, D., & Deco, G. (1996). An information theory based learning paradigm for linear feature extraction. Neurocomputing, 12, 203–221. Parra, L., Deco, G., & Miesbach, S. (1995). Redundancy reduction with information preserving nonlinear maps. Networks: Computation in Neural Systems, 6, 61–72. Pearlmutter, B. A., & Parra, L. C. (1996). A context-sensitive generalization of ICA. In Proc. ICONIP’96, Japan. Redlich, A. N. (1993a). Redundancy reduction as a strategy for unsupervised learning. Neural Computation, 5, 289–304. Redlich, A. N. (1993b). Supervised factorial learning. Neural Computation, 5, 750– 766. Zipf, G. (1949). Human behavior and the principle of least effort. Reading, MA: Addison-Wesley. Received December 17, 1996; accepted April 14, 1998.
LETTER
Communicated by Erkki Oja
An Alternative Perspective on Adaptive Independent Component Analysis Algorithms Mark Girolami Department of Computing and Information Systems, University of Paisley, Paisley, PA1 2BE Scotland
This article develops an extended independent component analysis algorithm for mixtures of arbitrary subgaussian and supergaussian sources. The gaussian mixture model of Pearson is employed in deriving a closedform generic score function for strictly subgaussian sources. This is combined with the score function for a unimodal supergaussian density to provide a computationally simple yet powerful algorithm for performing independent component analysis on arbitrary mixtures of nongaussian sources. 1 Introduction The independent component analysis (ICA) problem has been studied extensively within the artificial neural network (ANN) research community recently (Jutten & H´erault, 1991; Cichocki, Unbehauen, & Rummert, 1994; Karhunen & Joutsensalo, 1994; Bell & Sejnowski, 1995; Amari, Cichocki, & Yang, 1996; Pearlmutter & Parra, 1996; Oja, 1997; Karhunen & Pajunen, 1997; Xu, Cheung, Yang, & Amari, 1997; Hyvarinen & Oja, 1996; Malouche & Macchi, 1997; Girolami & Fyfe, 1997a; Lee, Girolami, Bell, & Sejnowski, in press). Two quite distinct approaches to the standard memoryless and noiseless ICA transformation have emerged. One approach is based on the information-theoretic method of Bell and Sejnowski (1995), and the other finds its roots in the nonlinear variants of Oja’s principal component analysis (PCA) subspace algorithm (Oja, 1982, 1989). These differing ICA methods have been shown to have a common information-theoretic basis (Girolami, 1997; Lee et al., in press). This article focuses on the development of a general information-theoretic equivariant style of ICA algorithm. The original infomax ICA algorithm (Bell & Sejnowski, 1995) was further generalized in an important article by Pearlmutter and Parra (1996). These authors considered the ICA problem from the perspective of maximum likelihood estimation (MLE) and demonstrated that the nonlinear term found in the infomax algorithm is in fact the score function of the hypothesized distribution of the unobserved sources. The online parameterization of the score function is developed based on a mixture model of logistic densities. This particular approach has been explored further by Xu et al. (1997). c 1998 Massachusetts Institute of Technology Neural Computation 10, 2103–2114 (1998) °
2104
Mark Girolami
Amari et al. (1996) considered the ICA problem by explicitly developing an adaptive weight update rule to minimize the mutual information of a network output. This work was of great importance in that the natural gradient (Amari, 1998) was introduced for the noiseless ICA model. Independent of this work, Cardoso and Laheld (1996) introduced what they termed the relative gradient. The natural or relative gradient has the effect of introducing the equivariant property into the performance of the associated algorithm. A number of recent articles have explored the natural gradient and its effect on algorithm performance (Amari, 1998; Amari & Cardoso, 1997). It should be noted that Cichocki, Unbehauen, and Rummert (1994) were, arguably, the first to publish a general form of ICA algorithm that exhibited equivariant properties, although the algorithm derivation was not motivated from information-theoretic or MLE criteria. The shortcoming of these algorithms has been their restriction to mixtures of solely supergaussian sources (Bell & Sejnowski, 1995) or mixtures of subgaussian sources (Amari et al., 1996). Cichocki, Kasprzak, and Amari (1996) successfully overcame this problem by switching the order of operation of the nonlinearities in their equivariant ICA algorithm, thus ensuring asymptotically stable separating solutions. This article develops an extended ICA algorithm, which allows arbitrary mixtures of supergaussian or subgaussian sources to be separated by utilizing score functions derived from sub- and supergaussian density models. This also provides an alternative perspective on the adaptive ICA algorithms by giving a probabilistic interpretation of the hyperbolic tangent term, which commonly appears within these learning rules. The remainder of the article is organized as follows: Section 2 presents the proposed extended ICA algorithm for the noiseless mixture model. Section 3 closes with a discussion and conclusions. 2 The ICA Signal Model and Adaptive Algorithms Assume that at time t, an observation of an N-dimensional data vector x(t) ∈
(2.1)
The statistical data model that generates the vector observation x(t) is defined as x(t) = As(t) + n(t).
(2.2)
In the strictly linear, memoryless, and real case, the matrix A ∈
(2.3)
Adaptive Independent Component Analysis Algorithms
2105
will be considered here as having zero-mean and finite variance. The observation or sensor noise term is assumed as being normally distributed (Kay, 1993). The aim of ICA is to recover the unobserved sources s(t) from the observations when there is no knowledge of the mixing matrix A (Comon, 1994). The classical adaptive approaches to ICA such as infomax (Bell & Sejnowski, 1995), MLE (Pearlmutter & Parra, 1996), and statistical loss function minimization (Amari, 1998) consider the noise-free transformation, x(t) = As(t).
(2.4)
The fundamental assumption made regarding the source vector is that it has individual marginal components that are statistically mutually independent and have at most one gaussian distribution (Comon, 1994). This is the basis of all research into blind signal processing and ICA. A subsequent linear transformation y(t) = Wx(t), where W ∈ <M×N seeks to minimize the mutual information between the components of the transformed variable y(t) and so make them as independent as possible. An excellent tutorial and review on the artificial neural network–based adaptive approaches to ICA can be found in Karhunen (1996). The more general statistical loss function (Amari, 1998; Douglas, Cichocki, & Amari, 1997) or projection pursuit approaches to ICA (Girolami, Cichocki, & Amari, 1997) provide discrete weight adaptation algorithms of the form ³ ´ 1Wn = ηn f(yn )xTn + (Wn Rxx WTn )−1 Wn Rxx ,
(2.5)
where ηn is the decaying learning rate or step size, and the sample covariance matrix of the observation vector is Rxx = E{xxT }. The function f(yn ) operates element-wise on the vector yn , such that f(yn ) = [ f1 (y1 ) . . . fM (yM )]T where each element is the score function of the associated output variable fi (yi ) =
p0sˆi (yi )
psˆi (yi )
parameterized as the hypothesized form of the underlying sources (Amari, 1998). For the case where the mixing matrix is square such that M = N, then the weight adaptation (see equation 2.5) reduces to the classical infomax (Bell & Sejnowski, 1995) and MLE (Pearlmutter & Parra, 1996) approaches to ICA: ³ ´ (2.6) 1Wn = ηn f(yn )xTn + (WTn )−1 . The natural gradient (Amari et al., 1996; Amari, 1998) equivalent of equation 2.5 yields the now-familiar form of ´ ³ 1Wn = ηn f(yn )yTn + I Wn ,
(2.7)
2106
Mark Girolami
where in this case W ∈ <M×N . The parameterization of the nonlinear term in the algorithms (see equations 2.5 through 2.7) is dependent on the probability density function (pdf) of the underlying sources. The original infomax algorithm used a uniform nonlinear term fi (yi ) = −2 tanh(yi ), which from the score function indicated that the hypothesized distribution of the sources was hyperbolic-Cauchy (Stuart & Ord, 1987) pi (yi ) ∝ sech2 (yi ). This symmetric distribution is characterized by having positive kurtosis and is generally referred to as supergaussian. Many naturally occurring signals such as human speech and music have densities similar to the hyperbolicCauchy (Bell & Sejnowski, 1995). The problem occurs when the underlying sources are characterized by densities that may be bimodal and have negative kurtosis. These densities are often referred to as subgaussian. Natural images possess such densities (Bell & Sejnowski, 1997), as do communications data (Haykin, 1994), physiological artifacts that contaminate electroencephalograph (EEG) data, and functional magnetic resonance imaging (fMRI) data (Jung et al., 1997; McKeown et al., 1997). The underlying density of such signals can be characterized by implementing a mixture model (Everitt, 1993) for each psˆi (yi ) and adapting the model parameters based on the observed data. This approach has been proposed for ICA by Pearlmutter and Parra (1996), Moulines, Cardoso, and Gassiat (1997), and Xu et al. (1997), where mixtures of gaussians or logistic density functions are utilized. In practical situations where observation data are scarce—for example, in EEG monitoring—there are often insufficient data to parameterize the mixture model fully (Lee, Girolami, & Sejnowski, 1998). In such cases it is desirable to use a single nonlinear term, which will allow the stable separation of either subgaussian or supergaussian sources. The following section proposes a simple yet powerful mixture model, which yields a nonlinear term for the natural gradient form of ICA algorithms. 2.1 The Score Function for Subgaussian Sources Based on the Pearson Mixture Model. MLE approaches to data clustering (Everitt, 1993) employ mixtures of gaussian (MOG) models to seek transition regions of high and low density and so identify potential data clusters. An early MOG model for data clustering was proposed by Pearson (1894), which took the form of p(s) = (1 − a) f1 (s) + af2 (s)
(2.8)
where f1 (s) =
−(s − µ1 )2 1 √ exp 2σ12 σ1 2π
and
f2 (s) =
−(s − µ2 )2 1 √ exp 2σ22 σ2 2π
are gaussians with mean values and variances defined as µ1 , µ2 and σ12 , σ22 , respectively. The weighting coefficient a serves to create levels of skewness
Adaptive Independent Component Analysis Algorithms
2107
Figure 1: Range of densities achieved by the Pearson mixture model.
in the distribution. Figure 1 indicates the range of densities that can be formed by this model for values of a within the valid range [0, . . . , 1] and the parameter µ1 = 2, µ2 = −2, σ12 = σ22 = 1. It is clear that the distribution is symmetric, possessing two distinct modes when a = 1/2. The cumulant generating function (CGF) for equation 2.8 and the related cumulants are computed in the appendix. Figure 2 highlights the associated levels of kurtosis for varying values of µ and a. For small values of skewness, that is, when a = 0.2 ∼ 0.8, the kurtosis is negative; interestingly high levels of skewness drive the kurtosis of the distribution to positive values. However, for the symmetric case, the kurtosis of the distribution is strictly negative (see equation A.5). This mixture model may then serve as a suitable density for computing the score function of symmetric subgaussian sources. Computing the score function for equation 2.8 in the general case yields s − µ1 − (ζ /σ22 )(s − µ2 )0 p0 (s) = , p(s) ζ 0 − σ12 where 0 = exp −(αs2 − βs − γ ); ζ = ϕ(σ12 )2 /σ22 and ϕ = a/(a − 1) α = (σ22 − σ12 )/2σ12 σ22 ; β = (µ2 σ12 − µ1 σ22 )/σ12 σ22 ; γ = (µ21 σ22 − µ22 σ12 )/2σ12 σ22 .
(2.9)
2108
Mark Girolami
Figure 2: Range of kurtosis values for varying values of mixing coefficient.
When |µ1 | = |µ2 | = |µ| and σ12 = σ22 = σ 2 , then after some manipulation, −s c(1 + ϕ exp(−2cs)) −s p0 (s) = 2 + = 2 + cf (cs), p(s) σ (1 − ϕ exp(−2cs)) σ
(2.10)
where c = µ/σ 2 . The shape of the nonlinear term f (cs) appearing in equation 2.10 for varying values of the mixing coefficient a is given in Figure 3. This family of curves can be considered as a series of shifted hyperbolic tangents; indeed, for a fully symmetric distribution where a = 1/2, the score function takes on a very simple closed form: −s c(1 − exp(−2cs)) −s p0 (s) = 2 + = 2 + c tanh(cs). p(s) σ (1 + exp(−2cs)) σ
(2.11)
Finally, for the case where σ 2 = µ = 1, the kurtosis equals −1/2, and the score function reduces to the following closed form: p0 (s) = −s + tanh(s). p(s)
(2.12)
Adaptive Independent Component Analysis Algorithms
2109
Figure 3: Plot of the nonlinear term appearing in the score function for the Pearson density model for varying values of mixing coefficient.
This score function can then be used as the nonlinear term in the ICA algorithm (2.7), and for strictly negatively kurtotic, symmetric source densities yields the following weight update equation: ³ ´ 1Wn = ηn I + tanh(yn )yTn − yn yTn Wn .
(2.13)
2.2 The Extended Infomax Algorithm. For sources with supergaussian densities, in addition to the hyperbolic-Cauchy distribution, the following density also may be employed: p(s) ∝ pG (s) sech2 (s), where pG (s) ∼ N(0, 1). This particular unimodal density has slightly heavier tails than a gaussian; the associated score function is given simply by p0 (s) = −s − tanh(s). p(s)
(2.14)
2110
Mark Girolami
Substituting in equation 2.7 yields, for strictly supergaussian sources, ³ ´ 1Wn = ηn I − tanh(yn )yTn − yn yTn Wn .
(2.15)
Equations (2.13) and (2.15) can be combined and written in a single compact form, ³ ´ 1Wn = ηn I − K tanh(yn )yTn − yn yTn Wn ,
(2.16)
where K is the M × M diagonal matrix containing the switching terms to select between the nonlinearities (2.12) and (2.14). In this case, it is the kurtosis sign of the individual output elements, which can be estimated in an online manner using a moving average estimator (Girolami & Fyfe, 1997b): K = diag[sgn(κ4 (1)), sgn(κ4 (2)), . . . , sgn(κ4 (M))].
(2.17)
An alternate switching criterion is utilized in Lee et al. (1997) based on the nonlinear moment defined by the stability criteria detailed in Cardoso (1998a, 1998b). This algorithm1 (see equation 2.16) was originally proposed in Girolami (1997) and Girolami and Fyfe (1997a) and its utility demonstrated on straightforward toy problems. It has also been successfully applied to the challenging real-world problem of removing artifacts from EEG and fMRI data (Jung et al., 1997; McKeown et al., 1997; Lee et al., 1997). It has the advantage of being applied to the situation where there are more sensors than sources. The algorithm also overcomes the restrictive nature of certain ICA algorithms in that it can “blindly” separate arbitrary mixtures of both suband supergaussian sources with no a priori knowledge of the source densities. The computational simplicity of the algorithm is particularly attractive in comparison to the methods that employ full-mixture-density models 3 Discussion and Conclusions This article has proposed an extended ICA algorithm capable of separating arbitrary mixtures of nongaussian sources in the noiseless case. Various authors have demonstrated the practical utility of the algorithm. The Pearson mixture model has been used in identifying a suitable score function for strictly subgaussian sources. Interestingly, for the symmetric case, the score function consists of a linear term and the hyperbolic tangent. This has also 1 A MATLAB implementation is available at http://www.cnl.salk.edu/CNL/. A Windows95/NT toolbox, which contains an implementation of the proposed extended ICA algorithm, is available at http://www.bip.riken.go.jp/absl/orsier/OOLABSS/.
Adaptive Independent Component Analysis Algorithms
2111
provided an alternative perspective on the form of the ICA algorithm by giving a probabilistic interpretation of the hyperbolic tangent nonlinear term, which appears within the weight update rule. This has further implications for the nonlinear PCA style of ICA algorithm in that a density model may now be associated with the algorithm function (Girolami, 1997). It should be noted that the Pearson model was originally proposed as a parametric model for data clustering, and this suggests that the extended ICA algorithm will also be suited to data clustering and vizualization applications. Preliminary data clustering and visualization results have been reported in Girolami (1998) and Girolami et al. (1997). Appendix The cumulant generating function for the density model p(s) = (1−a) f1 (s)+ af2 (s) where µ1 = −µ2 = µ and σ12 = σ22 = σ 2 admits φ(w) = log(ϕ(w)) √ R +∞ where ϕ(w) = −∞ p(s) exp(iws) ds and i = −1 (Stuart & Ord, 1987). This then gives a cumulant generating function of φ(w) = log((1 − a) exp(A) + a exp(B)) where σ 2 w2 σ 2 w2 and B = iµw + . 2 2 The related cumulants can be computed as ¯ dn φ(w) ¯¯ κn = (−i)n dwn ¯w=0 A = iµw −
and so:
¯ dφ(w) ¯¯ = −i2 µ(1 − 2a) = µ(1 − 2a) dw ¯w=0
κ1 = (−i)
2d
κ2 = (−i)
¯ ¯ ¯
2 φ(w) ¯
dw2
= (−i)2 (4aµ2 (a − 1) − σ 2 ) w=0
= (4aµ2 (1 − a) + σ 2 ) 3d
κ3 = (−i)
¯ ¯ ¯
3 φ(w) ¯
dw3
κ4 = (−i)
¯ ¯ ¯
= (−i)3 8iaµ3 (a − 1)(2a − 1)
4 φ(w) ¯
dw4
(A.2)
w=0
= 8aµ3 (a − 1)(2a − 1) 4d
(A.1)
(A.3)
= (−i)4 16aµ4 (1 − a)(6a2 − 6a + 1) w=0
= 16aµ4 (1 − a)(6a2 − 6a + 1).
(A.4)
2112
Mark Girolami
For the strictly symmetric case where a = 1/2 then: κ1 = mean value = 0 κ2 = variance = (µ2 + σ 2 ) κ3 = skew = 0 κ4 = fourth order cumulant = −2µ4 The kurtosis of the symmetric density model is then: kurtosis =
−2µ4 κ4 = 2 . 2 (κ2 ) (µ + σ 2 )2
(A.5)
This takes on strictly negative values for all µ > 0. Acknowledgments I am grateful for the ongoing support of this research by the Knowledge Laboratory, NCR (Ltd), Dundee, Scotland. Thanks to the anonymous reviewers for their valuable comments, which greatly improved the quality of this article. References Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Amari, S., & Cardoso, J-F. (1997). Blind source separation—Semi-parametric statistical approach. IEEE Transactions on Signal Processing, 45(11), 2692–2700. Amari, S., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. Neural Information Processing, 8, 757–763. Bell, A., & Sejnowski, T. (1995). An information maximisation approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A., & Sejnowski, T. (1997). The independent components of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Cardoso, J. F. (1998a). Blind signal processing: A review. Unpublished manuscript. Cardoso, J. F. (1998b). Entropic contrasts for source separation. In S. Haykin (Ed.), Unsupervised adaptive filtering. Cardoso, J. F., & Laheld, B. H. (1996). Equivariant adaptive source separation. I.E.E.E. Transactions on Signal Processing, SP-43, 3017–3029. Cichocki, A., Kasprzak, W., & Amari, S. (1996). Neural network approach to blind separation and enhancement of images. In Signal Processing VIII. Theories and Applications, EURASIP / LINT Publ., 1, 579–582. Cichocki, A., Unbehauen, R., & Rummert, E. (1994). Robust learning algorithm for blind separation of signals. Electronics Letters, 30, 1386–1387. Comon, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314.
Adaptive Independent Component Analysis Algorithms
2113
Douglas, S. S., Cichocki, A., & Amari, S. (1997). Multichannel blind separation and deconvolution of sources with arbitrary distributions. In Proc. IEEE Workshop on Neural Networks for Signal Processing (pp. 436–445). Everitt, B. S. (1993). Cluster analysis. Portsmouth, NH: Heinemann Educational Books. Girolami, M., & Fyfe, C. (1997a). Generalised independent component analysis through unsupervised learning with emergent Bussgang properties. In Proc I.E.E.E/I.C.N.N, International Conference on Neural Networks (pp. 1788–1891). Girolami, M., & Fyfe, C. (1997b). Extraction of independent signal sources using a deflationary exploratory projection pursuit network with lateral inhibition. In I.E.E. Proceedings on Vision, Image and Signal Processing (Vol. 14, 5, pp. 299– 306). Girolami, M. (1997). Self-organising neural networks for signal separation. Unpublished doctoral dissertation, University of Paisley. Girolami, M. (1998). The latent variable data model for exploratory data analysis and visualisation: A generalisation of the nonlinear infomax algorithm. Neural Processing Letters. Girolami, M., Cichocki, A., & Amari, S. I. (1997). A common neural network model for exploratory data analysis and independent component analysis. Brain Information Processing Group (Tech. Rep. No. BIP-97-001). Saitama, Japan: RIKEN, Brain Science Institute. Haykin, S. (1994). Blind deconvolution. Englewood Cliffs, NJ: Prentice Hall. Hyvarinen, A., and Oja, E. (1996). Simple neuron models for independent component analysis. International Journal of Neural Systems, 7, 671–687. Jung, T. P., Humphries, C., Lee, T. W., Makeig, S., McKeown, M., Iragui, V., & Sejnowski, T. (1997). Extended ICA removes artifacts from electroencephalographic recordings. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. MIT Press, In Press. Jutten, C., & Herault, J. (1991). Blind separation of sources, part 1: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–10. Karhunen, J. (1996). Neural approaches to independent component analysis and source separation. In Proc. ESANN’96 (pp. 249–266). Karhunen, J., & Joutsensalo, J. (1994). Representation and separation of signals using nonlinear PCA type learning. Neural Networks, 7, 113–127. Karhunen, J., & Pajunen, P. (1997). Blind source separation using least-squares type adaptive algorithms. In Proc I.E.E.E Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP’97) (pp. 3361–3364). Kay, S. (1993). Fundamentals of statistical signal processing, estimation theory. Englewood Cliffs, NJ: Prentice Hall. Lee, T. W., Girolami, M., Bell, A. J., & Sejnowski, T. (in press). A unifying information theoretic framework for independent component analysis. International Journal on Mathematical and Computer Modelling. Lee, T. W., Girolami, M., & Sejnowski, T. (1998). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and supergaussian sources. Neural Computation.
2114
Mark Girolami
Malouche, Z., & Macchi, O. (1997). A linear adaptive neural network for extraction of independent components. ESANN’97, Special ICA Session (pp. 261– 266). McKeown, M., Jung, T. P., Makeig, S., Brown, G., Kindermann, S., Lee, T. W., & Sejnowski, T. (1997). Transiently time-locked fMRI activations revealed by independent component analysis. Proceeding of the National Academy of Sciences, 95, 803–810. Moulines, E., Cardoso, J. F., & Gassiat, E. (1997). Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In Proc. ICASSP’97 (Vol. 5, pp. 3617–3620). Oja, E. (1982). A simplified neuron model as a principal component analyser. Journal of Mathematical Biology, 15, 267–273. Oja, E. (1989). Neural networks, principal components and subspaces. International Journal of Neural Systems, 1, 61–68. Oja, E. (1997). The nonlinear PCA learning rule in independent component analysis. Neurocomputing, 17, 25–45. Pearlmutter, B., & Parra, L. (1996). A context sensitive generalisation of ICA. International Conference on Neural Information Processing (pp. 151–157). Pearson, K. (1894). Contributions to the mathematical study of evolution. Phil. Trans. Roy. Soc., A 185, 71. Stuart, A., & Ord, J. K. (1987). Kendall’s advanced theory of statistic. Vol. 1: Distribution Theory. New York: Wiley. Xu, L., Cheung, C., Yang, H., & Amari, S-I. (1997). Independent component analysis by the information theoretic approach with mixture of density. In Proc. IEEE International Conference on Neural Networks (Vol. 3, 1821–1826). Received November 3, 1997; accepted April 28, 1998.
LETTER
Communicated by Robert Tibshirani
Density Estimation by Mixture Models with Smoothing Priors Akio Utsugi National Institute of Bioscience and Human-Technology, Tsukuba 305–8566, Japan
In the statistical approach for self-organizing maps (SOMs), learning is regarded as an estimation algorithm for a gaussian mixture model with a gaussian smoothing prior on the centroid parameters. The values of the hyperparameters and the topological structure are selected on the basis of a statistical principle. However, since the component selection probabilities are fixed to a common value, the centroids concentrate on areas with high data density. This deforms a coordinate system on an extracted manifold and makes smoothness evaluation for the manifold inaccurate. In this article, we study an extended SOM model whose component selection probabilities are variable. To stabilize the estimation, a smoothing prior on the component selection probabilities is introduced. An estimation algorithm for the parameters and the hyperparameters based on empirical Bayesian inference is obtained. The performance of density estimation by the new model and the SOM model is compared via simulation experiments. 1 Introduction Mixture of probability distributions is an elemental method for constructing a complex stochastic model from simple stochastic components (McLachlan & Basford, 1988). Although the density of a mixture model is only a weighted sum of its component densities, estimation for the model is difficult in comparison with that for the associated single-component model. However, once a standard estimation method by an expectation-maximization (EM) algorithm was established, mixture models have been applied to many problems. In particular, gaussian mixture models are used for discovering clusters from data. Moreover, their extension has been attempted in various ways. For example, prior probabilities on the parameters have been introduced to stabilize the estimation (Hanson, Stutz, & Cheeseman, 1991; Ormoneit & Tresp, 1996), Furthermore, we can consider a topological structure for the assembly of the components independent of the data space. That is, the components are assumed to lie on a hidden space, which is called the inner space of the model. When the dimension of the inner space is smaller than the data space, this model represents stochastic data generation from a manifold embedded in the data space. In addition, when the manifold is smooth, we can c 1998 Massachusetts Institute of Technology Neural Computation 10, 2115–2135 (1998) °
2116
Akio Utsugi
use the smoothness for the regularization of the parameters. In this case, the manifold is modeled as a sample of a smooth stochastic process, which is represented by a smoothing prior probability distribution on the centroid parameters of the components. When this model is used for the extraction of the hidden manifold from data, rather than discovering clusters, many components are required for the detailed representation of curvature structure on the manifold. The regularization of the centroid parameters enables stable estimation on the many components. In particular, the simplest kind of gaussian mixture model with a gaussian smoothing prior on the centroid parameters can be regarded as a stochastic model for self-organizing maps (SOMs). In other words, a learning algorithm for SOMs is derived as an estimation algorithm for this stochastic model (Utsugi, 1996, 1997). Although SOMs are well known as unique neural networks to extract hidden topological structure from data, they had no suitable statistical model for a long time (Kohonen, 1995). The statistical treatment of SOMs has many advantages. For example, the values of the hyperparameters and the topology can be selected on the basis of a statistical criterion. Experiments in earlier studies were confined to the SOM model with one-dimensional inner spaces. In this article, we attempt an experiment for the model with a multidimensional inner space. Then, problems proper to the multidimensional case are observed. For example, the centroids of the components tend to concentrate on areas with high data density, since the component selection probabilities are fixed to a common value. This property causes the deformation of a coordinate system on an extracted manifold. Especially in the multidimensional case, this deformation is large. When the smoothness of the manifold is evaluated for the hyperparameter selection, the deformation of the coordinate system makes this evaluation inaccurate. This difficulty is partially due to the fixation of the component selection probabilities. If we use the component selection probabilities as variable parameters, the components can spread more uniformly on the extracted manifold. This suppresses the deformation of the coordinate system. However, if the component selection probabilities are completely free, the model has too many free parameters to obtain stable estimates. Thus, we need the regularization of the component selection probabilities. In the remainder of this article, we study a mixture model with a smoothing prior on the component selection probabilities in addition to that on the centroids. First, a maximum a posteriori (MAP) estimation algorithm for the parameters is obtained using a generalized EM algorithm. Next, we obtain a hyperparameter estimation algorithm by an empirical Bayesian method. Finally, the performance of density estimation based on the new model and the SOM model is compared using simulation experiments.
Density Estimation
2117
2 SOM with Multidimensional Inner Space 2.1 Stochastic Model for SOM. The probability density function of a mixture model for a data point x is defined as f (x|w, v , β) =
r X
vs g(x|ws , β)
(2.1)
s=1
using the component selection probabilities vs and the component densities g(x|ws , β) with the component parameters ws and the common parameter β, where v = (v1 , . . . , vr )0 and w = (w01 , . . . , w0r )0 . We focus on a special kind of mixture model for a data point x = (x1 , . . . , xm )0 ∈ Rm , whose components have spherical gaussian densities with a common variance, µ ¶m/2 µ ¶ β β exp − kx − ws k2 , (2.2) g(x|ws , β) = 2π 2 where ws = (ws1 , . . . , wsm )0 ∈ Rm are the centroids of the components. The likelihood function for a data set X = (x1 , . . . , xn )0 is given as f (X |w, v , β) =
n Y
f (xi |w, v , β)
i=1
µ =
β 2π
¶mn/2 Y n X r i=1
µ ¶ β vs exp − kx − ws k2 . 2 s=1
(2.3)
In addition to this likelihood, a smoothing prior on the centroids along an inner space is introduced. Using a discretized differential operator D on the inner space, which is considered in the next section, we define Pr the smoothkDw(j) k2 , ness of a centroid configuration along the inner space by − j=1 0 where w(j) = (w1j , . . . , wrj ) . Then we assume a gaussian smoothing prior, f (w|α, D ) =
r ³ ³ α ´ Y α ´l/2 (det + M )1/2 exp − kDw(j) k2 , 2π 2 j=1
(2.4)
where M = D 0 D , l = rank M , and det + M denotes the product of positive eigenvalues of M . This prior gives larger probability density to smoother centroid configurations. Since M is generally singular, this prior density function is partially improper. That is, it specifies density only on a subspace of the parameter space. This subspace is spanned by the eigenvectors of M with positive eigenvalues. From the likelihood and the prior, a posterior on the centroids is calculated by Bayes’ rule, f (w|X , v , α, β, D ) ∝ f (X , w|v , α, β, D ) = f (X |w, v , β) f (w|α, D ).
(2.5)
2118
Akio Utsugi
The MAP estimates of the centroids are given as the maximizer of this posterior. In particular, if vs are fixed to 1/r for every s, a learning algorithm of SOMs is derived from the posterior as an approximate MAP estimation algorithm (Utsugi, 1997). Next, we consider the estimation of α and β. Since β is the inverse of the squared variance of the gaussian components, it can be estimated through an EM algorithm together with the centroids. On the other hand, α is a parameter in the prior, that is, a hyperparameter. Although there are several methods for determining the hyperparameter, we adopt the empirical Bayesian method because of its stability and usability (Utsugi, 1996, 1997). In the empirical Bayesian method, β is treated as a hyperparameter, and the estimates of the hyperparameters are given by the maximizer of their evidence: Z f (X , w|α, β, D )dw. (2.6) f (X |α, β, D ) = A gaussian approximation for the evidence has been presented (Utsugi, 1997). In addition, an efficient search algorithm for the maximizer of the evidence using the derivatives has been presented (Utsugi, 1996). 2.2 Discretized Differential Operator on Multidimensional Inner Space. On a one-dimensional inner space with a simple line segment topology, the discretized differential operator D is given by a discretized Laplacian D (2) r , whose entries are −2 |i − j + 1| = 0 1 |i − j + 1| = 1 i = 1, . . . , r − 2; j = 1, . . . , r. (2.7) dij = 0 otherwise Using this operator, we can regard each term kDw(j) k2 in the gaussian smoothing prior as a discretized spline regularizer. A spline regularizer for a real function wj (x) on the range [0, 1] is given by Z S(wj ) =
1
Ã
0
∂ 2 wj ∂x2
!2 dx.
(2.8)
We regard w(j) as a discretized vector of wj (x), that is, wsj = wj ((s − 1)δ), s = 1, . . . , r, where δ = 1/(r − 1). The discretized spline regularizer is now given by S(w(j) ) =
¶ r−1 µ X ws−1,j + ws+1,j − 2wsj 2 s=2
δ
=
1 kDw(j) k2 . δ2
We ignore the constant 1/δ 2 , since it can be absorbed into α.
(2.9)
Density Estimation
2119
A natural extension of the spline regularizer for functions on multidimensional spaces is the thin-plate spline regularizer (Wahba, 1990). In the two-dimensional case, this is given as !2 Ã Z Z Ã 2 !2 Ã 2 !2 ∂ wj ∂ 2 wj ∂ wj + +2 dxdy S(wj ) = 2 ∂y2 ∂x∂y Ä ∂x
(2.10)
for a function wj (x, y) on a region Ä ⊂ R2 . Here, we consider that w(j) is a discretized vector of the function wj (x, y) at the vertices of a regular-interval lattice on Ä. When the lattice is rectangular and its vertices are indexed in order by (k, t), k = 1, . . . , p, t = 1, . . . , q, the value of the function at a vertex (k, t) gives the sth entry of w(j) at s = q(k − 1) + t. The discretization of the thin-plate spline regularizer is now given as S(w(j) ) = kD xx w(j) k2 + kD yy w(j) k2 + 2kD xy w(j) k2 ,
(2.11)
D xx = Dp(2) ⊗ I q
(2.12)
D yy = I p ⊗ D (2) q
(2.13)
D xy = Dp(1) ⊗ D (1) q ,
(2.14)
using
where I p and I q denote identity matrices with sizes p and q, and ⊗ denotes the Kronecker product. The matrices Dp(2) and D (2) q are the one-dimensional
discretized Laplacians with sizes p and q. The matrices Dp(1) and D (1) q are
the one-dimensional first-order differential operators. For example, Dp(1) is defined as a matrix whose entries are i=j 1 −1 i + 1 = j i = 1, . . . , p − 1; j = 1, . . . , p. (2.15) dij = 0 otherwise The discretized thin-plate spline regularizer is also expressed as S(w(j) ) = kDw(j) k2 using
D = [D 0xx , D 0yy , D 0xy , D 0xy ]0 .
(2.16)
The SOM model with a rectangular inner space uses this D in its smoothing prior. We can also make other boundary shapes of the inner space than rectangles by manipulating D . In the next section, we perform a simulation experiment for this model.
2120
Akio Utsugi
2.3 Simulation Experiment. In the simulation experiment, we use artificial data generated from two types of quadric surfaces in R3 : an elliptic paraboloid and a hyperbolic paraboloid, which are represented by equations x3 = ax21 + bx22 and x3 = ax21 − bx22 , respectively. Initially, we obtain points on a plane spanned by the x1 and x2 axes using a two-dimensional gaussian random generator with zero mean and a covariance matrix diag(c2 , d2 ), where diag(c2 , d2 ) is a diagonal matrix with the diagonal (c2 , d2 ). Then we obtain points on the quadric surfaces using the above equations. Finally, by adding three-dimensional random vectors to the points on the surfaces, we obtain the data points. The random vectors are generated from a gaussian random generator with zero mean and covariance σ 2 I 3 . We use a = 0.25, b = 1, c = 0.7, d = 0.35, σ = 0.1, and n = 300. To the data sets, we apply the SOM model whose inner space is specified by a rectangular lattice with p = 10 and q = 10. The centroids and the hyperparameters are estimated by the fast search algorithm presented earlier (Utsugi, 1996). This algorithm is the mixture of the centroid estimation algorithm based on posterior maximization and the hyperparameter search algorithm based on evidence maximization. The initial values of the centroids are made from a gaussian random generator with zero mean and covariance I 3 . The initial values of the hyperparameters are set to α = 1000 and β = 25. We attempt 50 learning sessions using different initial centroids. Figure 1 shows the configurations of the centroids giving maximum evidence out of the 50 learning sessions. We use the gaussian approximation for the calculation of the evidence (see equation 2.6). The nodes of the meshes in the figure represent the centroids. The meshes are made by linking the nearest components in the inner space. From this figure, we observe that the centroids concentrate on a center area with high data density, and this deforms the coordinate system on the surfaces. In particular, the data from the elliptic paraboloid yield large deformation. Although centroid configurations giving maximum evidence have different patterns for other data sets, they also have a large deformation. If a hidden manifold has small curvature and data density on the manifold is uniform, α becomes large. In this case, the strong regularization produces a regular centroid configuration. In the simulations, α decreases because the data density on the manifold is not uniform. However, if the component selection probabilities can be adapted to the data density on the manifold, α may maintain a large value, and thus a regular centroid configuration may be obtained. In the following section, such a model is constructed.
Density Estimation
2121
hyperbolic paraboloid
1
1 3
2
0
x
x
3
elliptic paraboloid 2
−1
−1 −2
−1
0 x1
1
−2
2
2
1
1 2
2
0
x
x
2
−2
0
−1 −2
−2
−1
0 x1
1
2
−2
−1
0 x1
1
2
0
−1 −2
−1
0 x1
1
2
−2
Figure 1: Centroid configurations of the SOM model giving maximum evidence. The meshes and their nodes represent the extracted surfaces and the centroids, respectively. The dots represent the data points generated from two sorts of quadric surfaces. The graphs viewed from two angles are displayed.
3 Mixture Model with Smoothing Prior on Component Selection Probabilities 3.1 Smoothing Prior on Component Selection Probabilities. In this section, we study a mixture model with variable component selection probabilities. The simplest method for estimating the component selection probabilities is the maximum likelihood (ML) method. In this case, these parameters are regarded as completely free and estimated using an EM algorithm. However, this method produces unstable estimates, because we use many components for the detailed representation of a curvature structure on the hidden manifold, unlike the original mixture model. In particular, multidimensional manifolds require many components for the interpolation of sparse data points. Some components may have too few data points to estimate their parameters properly. To stabilize the estimation, we need a regularization prior on the component selection probabilities v . A candidate for such a prior is a Dirichlet prior, which is conjugate for the multinomial distribution on v (Ormoneit
2122
Akio Utsugi
& Tresp, 1996). However, the Dirichlet prior has a bias toward even component selection, and thus it has difficulty in capturing the large and smooth variation of the component selection probabilities. Sufficient regularization by this prior often produces the same result as the SOM model. For the stable estimation of the largely variable component selection probabilities, we use a smoothing prior on v along the inner space. Since v is a probability vector, that is, its entries are restricted to the range [0, 1] and sum into one, the direct use of a gaussian prior on v is not natural. Instead, we consider a gaussian prior on other variables µ = (µ1 , . . . , µr )0 ∈ Rr that produce v via the soft-max function, exp µs . vs = Pr t=1 exp µt
(3.1)
A similar parameter conversion is used in the mixture-of-experts model (Jordan & Jacobs, 1994; Jordan & Xu, 1995) . We can now make a gaussian smoothing prior on µ as f (µ|γ , D ) =
³ γ ´l/2 ³ γ ´ (det + M )1/2 exp − kDµk2 . 2π 2
(3.2)
However, when the second-order differential operator is used as D , the estimation of µ is very unstable. A reason for this instability is explained as follows. For the second-order differential operator, M = D 0 D has two sorts of eigenvectors with zero eigenvalues, which are a constant vector and linear-variation vectors. This means that the prior does not constrain the bias and the linear trend of µ. For the centroid parameters, their bias and linear trend can be determined properly from data without constraint. On the other hand, the linear trend of µ is difficult to determine, because it interferes with the linear trend of component intervals. We need not determine the bias of µ. Instead of this problematic prior, we use another smoothing prior with constraint on the linear trend. The new prior is based on both the first-order and second-order differential operators. The second-order differential operator D 2 has been already given by equation 2.16. The first-order differential operator D 1 is defined such that kD 1 µk2 is the discretization of the firstorder regularizer for a function µ(x, y), Z Z (µ S(µ) =
Ä
∂µ ∂x
¶2
µ +
∂µ ∂y
¶2 ) .
(3.3)
Such an operator is given by
D 1 = [D 0x , D 0y ]0 ,
(3.4)
Density Estimation
2123
where
D x = Dp(1) ⊗ I q
(3.5)
D y = I p ⊗ D (1) q .
(3.6)
The matrices Dp(1) and D (1) q are the one-dimensional first-order differential operators, defined in section 2.2. The new smoothing prior on µ is now given as f (µ|γ1 , γ2 , D 1 , D 2 ) =
1 (det + H PM )1/2 (2π )l/2 · ¸ 1 × exp − (γ1 kD 1 µk2 + γ2 kD 2 µk2 ) , 2
(3.7)
where
H PM = γ1 M 1 + γ2 M 2 ,
(3.8)
M 1 = D 01 D 1 , M 2 = D 02 D 2 , and l = rank H PM . The first and second terms in the exponential function are the first-order and second-order regularizers, respectively. The new hyperparameters, γ1 and γ2 , mean the strength of the regularization. In addition, we use a similar prior on the centroids f (w|α1 , α2 , D 1 , D 2 ) =
m Y j=1
1 (det + H PW )1/2 (2π )l/2
· ¸ 1 2 2 × exp − (α1 kD 1 w(j) k +α2 kD 2 w(j) k ) , (3.9) 2 where
H PW = α1 M 1 + α2 M 2 .
(3.10)
If α1 = 0, this prior is identical to equation 2.4. Here, we introduce the first-order regularization to observe the effect of it. Here we refer to the parameters of the model as θ = (w0 , µ0 )0 , the hyperparameters as h = (α1 , α2 , β, γ1 , γ2 ), and the topological structure as T = {D 1 , D 2 }, respectively. From the mixture likelihood (see equation 2.3) and the two smoothing priors (see equations 3.7 and 3.9), a posterior on θ is calculated by f (θ |X , h, T ) ∝ f (X , θ |h, T ) = f (X |θ , β) f (w|α1 , α2 , T ) f (µ|γ1 , γ2 , T ).
(3.11)
In the next section, we construct an MAP estimation algorithm for this posterior by a generalized EM algorithm.
2124
Akio Utsugi
3.2 MAP Estimation Algorithm for Parameters. To make the EM algorithm, we need a likelihood function for the complete data set {X , Y }, where Y is missing data consisting of the binary memberships of the data points for the components. This likelihood is given as f (X , Y |θ , β) =
r n Y Y (vs g(xi |ws , β))ysi .
(3.12)
i=1 s=1
Using this, a conditional expectation of the log posterior is given as Q(θ |θ ∗ ) = EY (log f (X , Y |θ , β)|X , θ ∗ ) + log f (w|α1 , α2 , T ) + log f (µ|γ1 , γ2 , T ) n X n X r r X β βX nm log − psi log vs + psi kxi − ws k2 = 2 2π 2 i=1 s=1 i=1 s=1 −
m m ml 1X log 2π + log det + H PW − w0 H PW w(j) 2 2 2 j=1 (j)
1 l 1 − log 2π + log det + H PM − µ0 H PM µ, 2 2 2
(3.13)
where θ ∗ is a temporary estimate and v∗ g(xi |w∗s , β) psi = p(ysi = 1|xi , θ ∗ , β) = Prs ∗ ∗ t vt g(xi |w t , β)
(3.14)
are posterior component selection probabilities. In the EM algorithm, the maximizer of this function is used as the next temporary estimate, and this updating is iterated until convergence. The centroids maximizing Q are easily obtained by solving linear equations, ∂Q ∂ w(j)
=β
n X
xij pi − β N w (j) − H PW w(j) = 0,
(3.15)
i=1
P where pi = (p1i , . . . , pri )0 and N = ni diag pi . Thus, the next temporary estimates of the centroids are given as ˆ (j) = β H −1 w QW
n X
xij pi ,
(3.16)
i=1
where
H QW = −
∂ 2Q = β N + α1 M 1 + α2 M 2 . ∂ w(j) ∂ w0(j)
(3.17)
Density Estimation
2125
This updating rule has the same form as the EM algorithm for the SOM model. On the other hand, to obtain µ maximizing Q, we need to solve a nonlinear equation, n ∂Q X = (pi − v ) − H PM µ = 0, ∂µ i=1
(3.18)
which requires an iterative method, such as a Newton-Raphson algorithm. Instead, we consider using one step of the Newton-Raphson algorithm as an updating rule for µ, which gives a generalized EM algorithm (Jordan & Xu, 1995). However, this fails for our model, since the Hesse matrix of −Q is singular. The Hesse matrix of −Q is given as
H QM = −
∂ 2Q = nV + γ1 M 1 + γ2 M 2 , ∂ µ∂ µ0
(3.19)
where V = diag v − vv 0 . We can show that H QM is nonnegative definite and has 1r , an r-dimensional vector with ones, as an eigenvector with zero eigenvalue. This singularity is due to the nonidentifiability of the model along the direction of 1r . That is, the posterior is invariable by adding any constant vector to µ. To eliminate this nonidentifiability, we fix the projection 1r . When this projection is zero, µ is restricted to a linear manifold of µ onto P given by rs µs = 0. We now maximize a Lagrange function, L(µ, λ) = Q(µ) + λ
r X
µs ,
(3.20)
s=1
rather than Q, where λ is a Lagrange multiplier. An increment vector 1µ by a Newton-Raphson algorithm for this Lagrange function is obtained by solving an equation, ·
H QM −10r
1r 0
¸·
1µ λ
¸
· =−
dQM 0
¸ ,
(3.21)
where dQM = −∂Q/∂ µ is given in equation 3.18. In the result, the updating rule for θ is given by equation 3.16 and ˆ = µ + 1µ. µ
(3.22)
3.3 Hyperparameter Selection. Evidence for h = (α1 , α2 , β, γ1 , γ2 ) is defined as Z f (X , θ |h, T )dθ , (3.23) f (X |h, T ) = 2
2126
Akio Utsugi
P where 2 = {θ : w ∈ Rrm , rs=1 µs = 0, µ ∈ Rr }. Here, we obtain a calculable formula for the evidence using a gaussian approximation (MacKay, 1992). For the gaussian approximation, we need the Hessian of the negative integrand at the maximizer of the integrand. In the present case, this maximizer is given by the MAP estimate θˆ . The Hessian must be evaluated on the restricted parameter space 2 rather than on the unrestricted parameter space Rrm+r . We can obtain the Hessian on 2 from the Hesse matrix on Rrm+r :
Hθθ = −
∂2 log f (X , θˆ |h, T ). ∂ θ∂ θ0
(3.24)
The Hessian on 2 is given by det + H θ θ . This is because H θ θ is nonnegative definite, and its only eigenvector with zero eigenvalue is orthogonal to 2. The local evidence at θˆ is defined as Z f (X , S ˆ |h, T ) =
θ
f (X , θ |h, T )dθ S
ˆ
θ
' (2π)(rm+r−1)/2 f (X , θˆ |h, T )(det + H θ θ )−1/2 ,
(3.25)
where S ˆ ⊂ 2 is a region dominated by θˆ . Although the Hesse matrix is θ obtained exactly in appendix A, it is complicated. Rather, we approximate it as log det + H θ θ ' m log det H QW + log det + H QM .
(3.26)
The log evidence is now given as µ ¶ n r X X β β nm 2 ˆ sk log + log vˆ s exp − kxi − w e(h) = 2 2π 2 i=1 s=1 −
m m 1X ml ˆ 0 H PW w ˆ (j) log 2π + log det + H PW − w 2 2 2 j=1 (j)
1 l 1 0 ˆ ˆ H PM µ − log 2π + log det + H PM − µ 2 2 2 1 rm + r − 1 log 2π. − (m log det H QW + log det + H QM ) + 2 2 (3.27) An approximate Newton-Raphson algorithm for the maximizer of this evidence is proposed in appendix B.
Density Estimation
2127
hyperbolic paraboloid
1
1 3
2
0
x
x
3
elliptic paraboloid 2
−1
−1 −2
−1
0 x1
1
−2
2
2
1
1 2
2
0
x
x
2
−2
0
−1 −2
−2
−1
0 x1
1
2
−2
−1
0 x1
1
2
0
−1 −2
−1
0 x1
1
2
−2
Figure 2: Centroid configurations of the new model giving maximum evidence.
3.4 Simulation Experiment. We apply the new model to the data used in section 2.3. The results of the previous simulations are employed as the initial values of the centroids and the hyperparameters α1 and β in the current simulations. The initial values of the other hyperparameters are set to α2 = 100, γ1 = 100, and γ2 = 1000. The initial values of the component selection probabilities are set to a constant. Figure 2 shows the configurations of the centroids with maximum evidence. By comparing them with the configurations in Figure 1, we find that the deformation of the coordinate system on the surfaces relaxes. Figure 3 shows the distributions of the estimated component selection probabilities. These distributions are similar to a gaussian distribution used in the data generation. To compare the performance of density estimation between the new model and the SOM model, we calculate an expected log likelihood (ELL) for each model. ELL is a measure of compatibility between a data distribution and a model distribution. Here we approximate ELL by log likelihood for 1000 newly generated data points. On both data sets, the new model shows larger ELL. Next, we apply the model to data under more severe conditions. First, the noise level is increased from 0.1 to 0.2. Under this condition, γ1 and γ2 grow infinitely on all learning sessions. In this case, the component selec-
2128
Akio Utsugi
elliptic paraboloid
hyperbolic paraboloid
0.04
0.04
0.02
0.02
0 10 5 0 0
5
10
0 10 5 0 0
5
10
Figure 3: Graphs of estimated component selection probabilities on the inner plane.
tion probabilities approach constant. We set the upper limit of γ1 and γ2 to 105 . The centroid configurations with maximum evidence are shown in Figure 4. In this case, the new model yields the same solution as the SOM model. Second, the data size is reduced from 300 to 100. Under this condition, the component selection probabilities become constant again. Finally, we increase the curvature of the hidden surfaces by doubling the values of a and b. The centroids and the component selection probabilities with maximum evidence are shown in Figures 5 and 6, respectively. For the hyperbolicparaboloid data, the solution has a similar pattern to the solution under the low curvature condition. On the other hand, for the elliptic-paraboloid data, the centroid configuration is largely deformed, and the component selection probabilities become monotonic by infinite growth of γ2 . 4 Discussion 4.1 Condition for Convergence into Correct Structure. In the proposed method, topology extraction is formalized as an inverse problem for a stochastic generative model. Thus, if the generative model is valid, the correct structure is given by the optimal estimate for the model. Since we use the generalized EM algorithm for the parameter estimation, convergence into a local optimal estimate is guaranteed if the hyperparameters are fixed (Dempster, Laird, & Rubin, 1977). On the other hand, the hyperparameter estimation fails if the data size is too small, as shown in the simulations. This is because empirical Bayesian inference has a bias toward simple structure. Without sufficient data for structure determination, it selects the simplest structure (Utsugi, 1997). In the current case, the simplest structure has a regular centroid configuration on a flat plane and constant component selection probabilities. In particular, the component selection probabilities easily become constant or monotonic
Density Estimation
2129
hyperbolic paraboloid
1
1 3
2
0
x
x
3
elliptic paraboloid 2
−1
−1 −2
−1
0 x1
1
−2
2
2
1
1 2
2
0
x
x
2
−2
0
−1 −2
−2
−1
0 x1
1
2
−2
−1
0 x1
1
2
0
−1 −2
−1
0 x1
1
2
−2
Figure 4: Centroid configurations giving maximum evidence in a high-noise condition.
if the flexibility of the manifold is large, because data density on the manifold can be also represented by component density. Furthermore, it is essentially difficult to fit a flat coordinate system to a high-curvature manifold globally. In conclusion, we need many data and a small curvature of the manifold for exact hyperparameter estimation. 4.2 Relation to Other Related Methods. 4.2.1 Principal Curves. Here, we consider the relationship of our model to principal curves and principal manifolds (Hastie & Stuetzle, 1989; Tibshirani, 1992). A principal curve is a generalization of a linear principal component. This is originally defined for a probability distribution. However, several algorithms to estimate it from data were also presented. In particular, Tibshirani’s algorithm is derived from a generative model, which is based on a gaussian mixture model with a constraint on its centroids. Although this generative principal curve model is similar to our model, they are different in the manner of parameterization. In the principal curve model, each data point is assumed to be generated from an individual component on a smooth manifold. That is, the com-
2130
Akio Utsugi
hyperbolic paraboloid 2
2
1
x3
x3
elliptic paraboloid 3
1 0
−1
0
−1 −2
−1
0 x
1
−2
2
−2
−1
2
2
1
1
0
−1 −2
0 x
1
2
0 x
1
2
1
x2
x2
1
0
−1 −2
−1
0 x
1
−2
2
−2
−1
1
1
Figure 5: Centroid configurations giving maximum evidence in a highcurvature condition.
elliptic paraboloid
hyperbolic paraboloid
0.04
0.04
0.02
0.02
0 10 5 0 0
5
10
0 10 5 0 0
5
10
Figure 6: Estimated component selection probabilities in a high-curvature condition.
ponents have a one-to-one correspondence with data points. We need to estimate the positions of the components on the manifold in addition to the estimation of the manifold itself. Since these position parameters are not regularized and their number increases in proportion to the data size,
Density Estimation
2131
the estimates maintain large variance even if the data size increases. This property degrades the performance of the density estimation. On the other hand, our model estimates the distribution of the data sources on the manifold, instead of estimating the positions of the data sources on the manifold. The distributions are represented by the component selection probabilities, whose number is fixed independent of the data size. In addition, these parameters are regularized by the smoothing prior. Thus, the effective number of the parameters is considerably reduced in comparison with the principal curve model. We can say that our model is a regularized version of the principal curve model. Thus, our model is expected to make a better estimation than the principal curve model if the assumption of the regularization is valid. Moreover, our model is more suitable for Bayesian inference, including the determination of the hyperparameters, because it is governed by a few free hyperparameters. 4.2.2 Autoassociative Bottleneck Networks. It is known that three-layered autoassociative neural networks with backpropagation learning are related to principal component analysis (PCA) if their hidden units are less than their input-output units. Furthermore, five-layered autoassociative networks can perform nonlinear PCA if they have more units in the second and fourth layers and fewer units in the third layer than the input-output layers (Diamantaras & Kung, 1996). This nonlinear PCA net produces fully distributed representation in the third layer, while the SOM model produces a localized activity blob in the inner unit layer. The nonlinear PCA net represents smooth mapping between the inner space and the data space by a pair of three-layered neural networks. Although the three-layered neural networks can emulate any smooth function, they require many inner units for the emulation of nonlinear mapping. Thus, their estimation becomes unstable for small data. Although we can introduce the regularization of the parameters, this smoothness representation is not natural. On the other hand, our model represents the smoothness directly using a spline regularizer. Thus, under empirical Bayesian inference, our model is expected to acquire a finer estimate than the nonlinear PCA net. 4.2.3 Generative Topographic Mapping. Bishop, Svensen, and Williams (1998) recently proposed a new method to extract topological structure from data, called generative topographic mapping (GTM). This method is also based on a gaussian mixture model with constrained centroids and an EM algorithm. However, GTM represents a smooth mapping between the inner space and the data space using a generalized linear network. Thus, the performance of this method depends on the choice of a basis function system in the generalized linear network. It is known that the spline regularization method is also expressed by a generalized linear network (Poggio & Girosi, 1990). Thus, GTM and our model may be two different model representations of an identical principle.
2132
Akio Utsugi
However, our model representation is more suitable for empirical Bayesian inference, because the smoothness is expressed directly by a prior probability on the parameters. 5 Conclusion A stochastic model for SOM with a multidimensional inner space was constructed as a mixture model with thin-plate spline regularization on the centroids. Then estimation on the model was attempted using artificial data. In the experiment, we observed that the model has difficulty caused by constant component selection probabilities. To overcome this difficulty, we constructed a mixture model with variable component selection probabilities. For the stabilization of the estimation, a smoothing prior on the component selection probabilities was introduced. Although introducing variable component selection probabilities was effective for reducing the deformation of the coordinate systems on lowcurvature manifolds, the deformation on high-curvature manifolds was not removed. This is a limitation in fitting a manifold with a single coordinate system into data. Instead, we can consider a model with multiple local coordinate systems. However, such a complicated model will require an efficient topology search algorithm. Appendix A We obtain the Hesse matrix of the negative log posterior with respect to the parameters. First, the derivatives of the log posterior with respect to the parameters are obtained. These have the same forms as the derivatives of Q, given by equations 3.15 and 3.18, except that the posterior component selection probabilities psi in the formulas are calculated using θ rather than θ ∗ . Thus, in calculating the Hesse matrix by differentiating the derivatives, we must consider the differentiation of psi . We now obtain the Hesse matrix consisting of the following submatrices: −
n X ∂ 2G 2 = −β (δst psi − psi pti )(xi − ws )(xi − wt )0 ∂ ws ∂ w0t i=1
− −
∂ 2G ∂ µ∂ µ0
+(βns δst + hst )I m n X =− (diag pi − pi p0i − V ) + H PM i=1 n X
∂ 2G = −β ∂µs ∂ w0t
(δst psi − psi pti )(xi − wt ),
(A.1) (A.2) (A.3)
i=1
where G is the log posterior and hst are the entries in H PW . If all psi have
Density Estimation
2133
P binary values and satisfy a probability condition rs=1 psi = 1, the above Hesse matrix becomes identical to the Hesse matrix of −Q, given by equations 3.17 and 3.19. Since psi are the estimates of the binary memberships ysi , they are near binary values if each component has sufficient identity. In this case, we can substitute the Hesse matrix of −Q for that of −G approximately. Appendix B Here, we obtain a hyperparameter search algorithm by an approximate Newton-Raphson algorithm. First, we obtain the gradients of the log evidence with respect to the hyperparameters: n X r nm 1 X m ∂e ˆ s k2 − tr(H −1 = − psi kxi − w QW N ) ∂β 2β 2 i=1 s=1 2
(B.1)
m 1X m m ∂e ˆ (j) k2 − tr(H −1 = tr(H + M ) − kD o w o PW QW M o ) ∂αo 2 2 j=1 2
(B.2)
1 1 1 ∂e ˆ k2 − tr(H + = tr(H + PM M o ) − kD o µ QM M o ) ∂γo 2 2 2
(B.3)
where o = 1, 2 and H + denotes the Moore-Penrose generalized inverse of H . In this calculation, the variation of θˆ with the hyperparameters is neglected. Next, we obtain the Hesse matrix of the log evidence, whose diagonal is given by nm m ∂ 2e 2 = − 2 + tr(H −1 QW N ) 2 ∂β 2β 2
(B.4)
m ∂ 2e −1 2 2 = − tr{(H + PW M o ) − (H QW M o ) } ∂αo2 2
(B.5)
1 ∂ 2e + 2 2 = − tr{(H + PM M o ) − (H QM M o ) }. ∂γo2 2
(B.6)
For simplicity, we employ only the diagonal in the Hesse matrix. Although we can now make a Newton-Raphson algorithm for the hyperparameters, we should use that for the logarithms of the hyperparameters, because the hyperparameters are restricted to positive values. The gradients and the
2134
Akio Utsugi
Hesse matrix with respect to the log hyperparameters can be calculated using formulas such as ∂e ∂e =β ∂ log β ∂β
(B.7)
∂ 2e ∂e ∂ 2e + β2 2 . =β 2 ∂(log β) ∂β ∂β
(B.8)
We now obtain both updating rules for the parameters and the hyperparameters. Strictly speaking, the hyperparameter-updating rule should be used after obtaining the MAP estimate of the parameters under fixed hyperparameters. We actually use these rules alternately, and this accelerates the estimation procedure. References Bishop, C. M., Svensen, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computation, 10, 215–234. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc., Ser. B, 39, 1–38. Diamantaras, K. I., & Kung, S. Y. (1996). Principal component neural networks: Theory and applications. New York: Wiley. Hanson, R., Stutz, J., & Cheeseman, P. (1991). Bayesian classification theory (Tech. Rep. No. FIA-90-12-7-01). Moffet Field, CA: NASA Ames Research Center. Hastie, T., & Stuetzle, W. (1989). Principal curves. J. Amer. Statist. Assoc., 84, 502–516. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Jordan, M. I., & Xu, L. (1995). Convergence results for the EM approach to mixtures of expert architectures. Neural Networks, 8, 1409–1431. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. MacKay, D. J. C. (1992). A practical Bayesian framework for backprop networks. Neural Computation, 4, 448–472. McLachlan, G. J., & Basford, K. E. (1988). Mixture models: Inference and applications to clustering. New York: Marcel Dekker. Ormoneit, D., & Tresp, V. (1996). Improved gaussian mixture density estimates using Bayesian penalty terms and networks averaging. In D. S. Touretzky, M. C. Mozer, M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 542–548). Cambridge, MA: MIT Press. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proc. IEEE, 78, 1481–1497. Tibshirani, R. (1992). Principal curves revisited. Statistics and Computing, 2, 183– 190. Utsugi, A. (1996). Topology selection for self-organizing maps. Network, 7, 727– 740.
Density Estimation
2135
Utsugi, A. (1997). Hyperparameter selection for self-organizing maps. Neural Computation, 9, 623–635. Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM. Received January 31, 1997; accepted January 14, 1998.
LETTER
Communicated by Todd Leen
Complexity Issues in Natural Gradient Descent Method for Training Multilayer Perceptrons Howard Hua Yang Department of Computer Science, Oregon Graduate Institute, Portland, OR 97291, U.S.A.
Shun-ichi Amari Laboratory for Information Synthesis, RIKEN Brain Science Institute, Saitama 351-01, Japan
The natural gradient descent method is applied to train an n-m-1 multilayer perceptron. Based on an efficient scheme to represent the Fisher information matrix for an n-m-1 stochastic multilayer perceptron, a new algorithm is proposed to calculate the natural gradient without inverting the Fisher information matrix explicitly. When the input dimension n is much larger than the number of hidden neurons m, the time complexity of computing the natural gradient is O(n). 1 Introduction Amari (1997, 1998) has shown that the natural gradient descent learning rule is statistically efficient. In principle, this learning rule can be used to train any adaptive system, but the complexity of this rule depends on the architecture of the learning machine. The main difficulty in implementing the rule is designing a fast algorithm to compute the natural gradient. For an n-m-1 multilayer perceptron, we show in this article that the natural gradient can be computed in O(n) flops when m ¿ n. Here, a flop is a floating-point operation: an add or a multiplication. Orr and Leen (1997) used the curvature information (the Hessian matrix) in the nonlinear adaptive momentum scheme to optimize the convergence of the stochastic gradient descent. They have shown that the complexity of their algorithm is O(n). However, their algorithm is different from the natural gradient descent. Instead of using the momentum scheme, we use the inverse of the Fisher information matrix to transform the stochastic gradient to optimize the learning dynamics. This method is based on statistical inference. A review of the learning of artificial neural networks is given by Yang, Murata, and Amari (1998). The rest of this article is organized as follows. The stochastic multilayer perceptron is described in section 2. The natural gradient descent method using the inverse of the Fisher information matrix and its properties are c 1998 Massachusetts Institute of Technology Neural Computation 10, 2137–2157 (1998) °
2138
Howard Hua Yang and Shun-ichi Amari
discussed in section 3. A constructive procedure to compute the inverse of the Fisher information matrix is formulated in section 4. For a singlelayer perceptron, we calculate an explicit expression of the natural gradient in section 4.1. It is obvious from this expression that the time complexity for computing the natural gradient for the single-layer perceptron is O(n). Importantly, this is also true for a multilayer perceptron, but its proof is not straightforward. To prove this result, we analyze the structure of the Fisher information matrix in section 4.2. Based on this analysis, we discuss the time complexities of computing the inverse of the Fisher information matrix and the natural gradient in section 5 for the multilayer perceptron. We first discuss the simplest multilayer perceptron, a committee machine, in section 5.1, and then discuss the multilayer perceptron in section 5.2. The conclusions are summarized in section 6. 2 Stochastic Multilayer Perceptron We consider the following stochastic multilayer perceptron model: z=
m X
ai ϕ(wTi x + bi ) + ξ,
(2.1)
i=1
where (·)T denotes the transpose, ξ ∼ N(0, σ 2 ) is the gaussian additive noise with the variance σ 2 , and ϕ(x) is a differentiable output function of hidden neurons. Assume that the multilayer network has an n-dimensional input, m hidden neurons, a one-dimensional output, and m ≤ n. Denote a = (a1 , · · · , am )T the weight vector of the output neuron, wi = (w1i , · · · , wni )T the weight vector of the ith hidden neuron, and b = (b1 , · · · , bm )T the vector of thresholds for the hidden neurons. Let W = [w1 , · · · , wm ] be a matrix formed by column weight vectors wi ; then equation 2.1 can be rewritten as z = aT ϕ(W T x + b) + ξ. Here, the scalar function ϕ operates on each component of the vector W T x + b. 3 Fisher Information Matrix and the Natural Gradient Descent The stochastic perceptron model enables us to derive learning algorithms for neural networks from some statistical inference methods and evaluate the performance of the algorithms systematically. The joint probability density function (pdf) of the input and the output is p(x, z; W , a, b) = p(z|x; W , a, b)p(x), where p(z|x; W , a, b) is the conditional pdf of the output z given the input
Natural Gradient Descent Method
2139
x and p(x) is the pdf of the input x. The loss function is defined as the negative log-likelihood function L(x, z; θ ) = − log p(x, z; θ ) = l(z|x; θ ) − log p(x), where
θ = (wT1 , · · · , wTm , aT , bT )T , and 1 (z − aT ϕ(W T x + b))2 . l(z|x; θ ) = − log p(z|x; θ ) = 2σ 2 Since p(x) does not depend on θ , minimizing the loss function L(x, z; θ ) is equivalent to minimizing l(z|x; θ ). Given the training set DT = {(xt , zt ), t = 1, · · · , T}, minimizing the loss function L(DT ; θ ) = minimizing the training error
QT
t=1 L(xt , zt ; θ )
is equivalent to
T 1X (zt − aT ϕ(W T xt + b))2 . Etr (θ , DT ) = T t=1
Since
∂L ∂θ
=
∂l , ∂θ
the Fisher information matrix is defined by ·
G = G(θ ) = Eθ
¸ · ¸ ∂L ∂L T ∂l ∂l T ( ) = Eθ ( ) , ∂θ ∂θ ∂θ ∂θ
(3.1)
where Eθ [·] denotes the expectation with respect to p(x, z; θ ). Let 2 be a parameter set and P = {p(x, z; θ ) : θ ∈ 2} be a family of pdfs parameterized by θ ∈ 2. The Kullback-Leibler divergence between two pdfs in P is Z p(x, z; θ ) 0 . D(θ , θ ) = dxdz p(x, z; θ ) log p(x, z; θ 0 ) Let G(θ ) be the Riemannian metric tensor for the Riemannian space 2. The squared length of a small dθ in 2 is T G(θ ) = dθ G(θ )dθ .
kdθ k2
The two spaces P and 2 are naturally related by the following equation: D(θ , θ 0 ) =
1 T dθ G(θ )dθ + O(kdθ k3 ), 2
where θ 0 = θ + dθ and dθ is small.
2140
Howard Hua Yang and Shun-ichi Amari
Amari (1998) showed that in the Riemannian space 2, for any differentiable loss function F(θ ), the steepest descent direction is given by −G−1 (θ )
∂F . ∂θ
Based on this result, Amari (1997) proposed the following natural gradient descent algorithm:
θ t+1 = θ t −
∂F µ −1 G (θ ) , t ∂θ
where µ is a learning rate. In particular, when negative log-likelihood function is chosen as the loss function, the above algorithm is known as the method of scoring in statistics. Amari (1998) showed that this algorithm gives a Fisher efficient online estimator; the asymptotic variance of θ t driven by equation 3.5 satisfies E[(θ t − θ ∗ )(θ t − θ ∗ )T | θ ∗ ] ≈
1 −1 ∗ G (θ ), t
(3.2)
which gives the mean square error E[kθ t − θ ∗ k2 | θ ∗ ] ≈
1 Tr(G−1 (θ ∗ )). t
(3.3)
The above property is verified by the simulation results in Yang and Amari (1998). Yang and Amari (1997a) have shown that G(θ ) = σ12 A(θ ) where A(θ ) does not depend on σ 2 (the variance of the additive noise). Define l1 (z|x; θ ) =
1 (z − aT ϕ(W T x + b))2 2
and e 1 (z|x; θ ) = A−1 (θ ) ∂l1 (z|x; θ ). ∇l ∂θ
(3.4)
To minimize the training error, we propose the following natural gradient descent algorithm:
θ t+1 = θ t −
µe ∇l1 (zt |xt ; θ t ), t
(3.5)
e 1 (zt |xt ; θ t ) does not depend on σ 2 . where the natural gradient ∇l To implement this algorithm, we need to compute the natural gradient e 1 (zt |xt ; θ t ) in each iteration. If we compute the inverse of A(θ t ) first, then ∇l the natural gradient, the time complexity is O(n3 ).
Natural Gradient Descent Method
2141
We proposed a method (Yang & Amari, 1997b) based on the conjugate gradient method to compute the natural gradient without inverting the Fisher information matrix. The idea is to solve the following linear equation, ∂l1 (zt |xt ; θ t ), ∂θ e 1 (zt |xt ; θ t ) without inverting the matrix A(θ t ). Since the matrix for y = ∇l A(θ ) is (n + 2)m × (n + 2)m, it will take at most (n + 2)m steps by the e 1 (zt |xt ; θ t ) and each step needs conjugate gradient method to compute ∇l O(n) flops to compute a matrix vector product. The number of flops needed to compute the natural gradient by the conjugate gradient method is O(n2 ) when m ¿ n and O(n3 ) when m = O(n). Using the conjugate gradient method to compute the natural gradient is also useful for other probability families when the method of scoring is used to find the maximum likelihood estimates (MLEs). However, the conjugate gradient method is still slow in computing the natural gradient for training the stochastic perceptron. We shall explore the structure of the Fisher information matrix of the stochastic perceptron and develop an algorithm that computes the natural gradient in just O(n) flops.
A(θ t )y =
4 Computing the Natural Gradient for a Stochastic Perceptron To find an analytic form of the Fisher information matrix, we assume a white gaussian input x ∼ N(0, I ) with an identity matrix I as the covariance matrix of the input. Yang and Amari (1997a) proposed an efficient scheme to represent the Fisher information matrix for the multilayer perceptron. Due to this scheme, when m ¿ n, the storage space needed for the Fisher information matrix is O(n) units rather than O(n2 ) units, where a unit is the memory space to keep a parameter or a variable. Based on this scheme, we found a constructive procedure to compute the inverse of the Fisher information matrix with O(n2 ) flops. This procedure can be improved to compute the natural gradient with only O(n) flops. 4.1 Single-Layer Perceptron. We shall give some explicit expressions of the Fisher information matrix and the natural gradient for a one-layer stochastic perceptron. Since z = ϕ(wT x + b) + ξ, we have ξ ∂l = − 2 ϕ 0 (wT x + b)x, ∂w σ ξ ∂l = − 2 ϕ 0 (wT x + b). ∂b σ The Fisher information matrix is · 1 1 A11 G(θ ) = 2 A(θ ) = 2 a21 σ σ
a12 a22
¸ ,
2142
Howard Hua Yang and Shun-ichi Amari
where θ = (wT , b)T ,
A11 = E[(ϕ 0 (wT x + b))2 xxT ], a12 = aT21 = E[(ϕ 0 (wT x + b))2 x], a22 = E[(ϕ 0 (wT x + b))2 ]. Let w = kwk and u1 = w/w and extend u1 to an orthogonal P basis {u1 , · · · , un } for
(4.1)
since x ∼ N(0, I ). The outer product xxT can be written as
xxT = x21 u1 uT1 +
n X
n X
xj x1 (uj uT1 + u1 ujT ) +
j=2
xj xk uj uTk .
(4.2)
j,k=2
Noticing wT x = wx1 and applying equations 4.1 and 4.2 we have,
A11 = E[(ϕ 0 (wT x + b))2 xxT ] = E[(ϕ 0 (wx1 + b))2 x21 ]u1 uT1 n X +E[(ϕ 0 (wx1 + b))2 x1 ] E[xj ](uj uT1 + u1 ujT ) j=2
+E[(ϕ 0 (wx1 + b))2 ]
n X
E[xj xk ]uj uTk
j,k=2
= E[(ϕ 0 (wx1 + b))2 x21 ]u1 uT1 + E[(ϕ 0 (wx1 + b))2 ]
n X
uj ujT .
j=2
Let us define three integrals: Z ∞ x2 1 (ϕ 0 (wx + b))2 e− 2 dx > 0, d0 (w, b) = √ 2π −∞ Z ∞ x2 1 (ϕ 0 (wx + b))2 xe− 2 dx, d1 (w, b) = √ 2π −∞ Z ∞ x2 1 (ϕ 0 (wx + b))2 x2 e− 2 dx > 0. d2 (w, b) = √ 2π −∞ Then we have
A11 = d2 (w, b)u1 uT1 + d0 (w, b)
n X
uj ujT ,
j=2
a12 = E[(ϕ 0 (wx1 + b))2 x1 ]u1 = d1 (w, b)u1 , and a22 = d0 (w, b).
(4.3) (4.4) (4.5)
Natural Gradient Descent Method
2143
A11 can be written as A11 = d0 (w, b)I + (d2 (w, b) − d0 (w, b))u1 uT1 , since {ui } are orthogonal and n X
ui uTi = I − u1 uT1 .
k=2
Summarizing previous calculations for the blocks in G(θ ), we have
A11 = d0 (w, b)I + (d2 (w, b) − d0 (w, b))wwT /w2 , a12 = aT21 = d1 (w, b)w/w, a22 = d0 (w, b).
(4.6) (4.7)
Saad and Solla (1995) used the scaled error function Z √x 2 x 2 2 e−t dt ϕ(x) = erf( √ ) = √ π 0 2 as the sigmoid function for the hidden units in order to obtain an analytic expression of the generalization error. If we choose ϕ(x) as the sigmoid function for the hidden neurons, we obtain the following close forms for the integrals 4.3 through 4.5: √ 2 b2 w2 }, exp{−b2 + 2 d0 (w, b) = √ w + 0.5 π w2 + 0.5 1 w2 b2 1 ( + 2 ), d2 (w, b) = d0 (w, b) 2 w + 0.5 2 w + 0.5 √ b2 w2 2wb 2 }. exp{−b + d1 (w, b) = − π(w2 + 0.5)3/2 w2 + 0.5 Even for the single-layer perceptron, the size of the Fisher information matrix is (n + 1) × (n + 1). However, the Fisher information matrix can be generated online with O(n2 ) flops by equations 4.6 and 4.7. If one wants to trace the Fisher information matrix, one need only O(n) units to store w and b in each iteration. To compute the inverse of A(θ ), we need the following well-known inverse formula of a four-block matrix: · Lemma 1.
B 11 B 21
B 12 B 22
¸−1
· =
B 11 B 21
B 12 B 22
¸
provided |B 11 | 6= 0 and |B 22 − B 21 B −1 11 B 12 | 6= 0. Here, −1 −1 −1 B 11 = B −1 + B B B B B 12 22,1 21 11 , 11 11 B 22,1 = B 22 − B 21 B −1 11 B 12 , B 22 = B −1 22,1 , −1 and B 12 = (B 21 )T = −B −1 11 B 12 B 22,1 .
2144
Howard Hua Yang and Shun-ichi Amari
Applying the above inverse formula of a four-block matrix, we obtain · 11 ¸ A a12 −1 A (θ ) = a21 a22 where
a12 a22
µ
¶
d21 wwT wwT + , 2 2 w d2 (d0 d2 − d1 ) w2 d1 d2 = (a21 )T = − w, d2 (d0 d2 − d21 ) d2 = . d0 d2 − d21
A11 =
1 I+ d0
1 1 − d2 d0
(4.8) (4.9) (4.10)
By using equations 4.8 and 4.9, the time complexity for computing A−1 (θ ) is O(n2 ). If we compute the inverse of the Fisher information matrix and the function l1 (z|x; θ ) in two separate steps and then multiply them together, the time complexity for computing the natural gradient is O(n2 ). Instead of computing the natural gradient in this way, we compute it in a single step by applying equations 4.8 and 4.9: e 1 (z|x; θ ) = −h(x, z; w, b) ∇l T d21 x +(1 − 1 + ) ww2x w − d0 d2 d0 d2 (d0 d2 −d21 ) × T x + d2 − dd1(dd2 w d −d2 ) d d −d2 2
0 2
1
0 2
(4.11)
d1 d2 w d2 (d0 d2 −d21 )
,
1
where h(x, z; w, b) = (z − ϕ(wT x + b))ϕ 0 (wT x + b). It needs only O(n) flops e 1 (z|x; θ ) by equation 4.11. to compute the natural gradient ∇l 4.2 Multilayer Perceptron. The main difficulty in implementing the natural gradient descent algorithm (see equation 3.5) is to compute the natural gradient online. The number of parameters in the stochastic perceptron defined by equation 2.1 is m(n + 2), so the size of the Fisher information matrix is m(n + 2) × m(n + 2). In this article, we assume m ¿ n. To compute the inverse of an n × n matrix, it generally needs O(n3 ) flops by commonly used methods such as the lower and upper triangular factorization together with the triangular matrix inverse (Stewart, 1973) . In the previous section, we showed some procedures for computing the Fisher information matrix and the natural gradient for a one-layer perceptron. Following similar procedures, Yang and Amari (1997a) found a scheme to represent the Fisher information matrix for multilayer perceptrons. Based on this scheme, we gave a method that requires O(n2 ) flops to compute the inverse of the Fisher information matrix. In this article, we adapt this method
Natural Gradient Descent Method
2145
e 1 (z|x; θ ). We shall show that the adapted to compute the natural gradient ∇l method requires only O(n) flops to compute the natural gradient. We first briefly describe this scheme to represent the Fisher information matrix and then introduce the method to compute the natural gradient. We need some notations for block matrix. An m × n block matrix X is denoted by X = [X ij ][m×n] or X = [X ij ]i=1,···,m,j=1,···,n . Let
A(θ ) = [Aij ][(m+2)×(m+2)] be a partition of A(θ ) corresponding to the partition of
θ = (wT1 , · · · , wTm , aT , bT )T . Define
ui = wi /kwi k, i = 1, · · · , m, U 1 = [u1 , · · · , um ], and [v 1 , · · · , v m ] = U 1 (U T1 U 1 )−1 . ij 0 T Let rij = uTi uj , R1 = (rij )m×m and R−1 1 = (r )m×m . Define xi = ui x for i = 1, · · · , m; then
(x01 , · · · , x0m ) ∼ N(0, R1 ). The structure of the blocks in A(θ ) is elucidated by the following two lemmas proved in Yang and Amari (1997a), which gives an efficient scheme to represent the Fisher information matrix for the stochastic multilayer perceptron. Lemma 2.
For 1 ≤ i, j ≤ m, and ai 6= 0,
Aij = ai aj (cij Ä0 +
m X
T clk ij ul v k )
(4.12)
l,k=1
where Ä0 =
n X k=m+1 0
uk uTk = I −
cij = E[ϕ (wi x0i + "
m X
uk v Tk , k=1 bi )ϕ 0 (wj xj0 + bj )],
0 0 0 0 clk ij = E ϕ (wi xi + bi )ϕ (wj xj + bj )
(4.13)
à m X s=1
!
#
rls x0s x0k .
(4.14) (4.15)
2146
Howard Hua Yang and Shun-ichi Amari
For 1 ≤ i ≤ m, Ã ! m m X X T k k Ai,m+1 = Am+1,i = ci1 v k , · · · , cim v k ,
Lemma 3.
k=1
(4.16)
k=1
where ckij = E[ϕ 0 (wi x0i + bi )ϕ(wj xj0 + bj )x0k ], 1 ≤ i, j, k ≤ m.
(4.17)
Ai,m+2 has the same structure as Ai,m+1 : Ai,m+2 =
ATm+2,i
=
à m X
e ci1k v k , · · · ,
m X
k=1
! k e vk cim
,
(4.18)
k=1
where e cijk = ai aj E[ϕ 0 (wi x0i + bi )ϕ 0 (wj xj0 + bj )x0k ], 1 ≤ i, j, k ≤ m.
Am+1,m+1 = (bij )m×m
(4.19)
with bij = E[ϕ(wi x0i + bi )ϕ(wj xj0 + bj )] is a function of bi , bj , wi , wj and rij .
Am+1,m+2 = ATm+2,m+1 = (e bij )m×m ,
(4.20)
with e bij = aj E[ϕ(wi x0i + bi )ϕ 0 (wj xj0 + bj )].
Am+2,m+2 = (b0ij )m×m
(4.21)
with b0ij = ai aj E[ϕ 0 (wi x0i + bi )ϕ 0 (wj xj0 + bj )]. √ Assume ϕ(x) = erf(x/ 2) where erf(u) is the error function. It is not difk ficult to obtain analytic expressions for cij , b0ij , clk ij , and c˜ij . When bi = 0, i = 1, · · · , m, applying the multivariate gaussian average method in Saad and Solla (1995), we can obtain analytic expressions for these coefficients: ckij , bij , and b˜ij . But when bi 6= 0, the analytic expressions for these coefficients seemingly do not exist. These coefficients are m-dimensional integrals, which can be approximated offline by some numerical integral methods. Recalling that n is the input dimension and m is the number of hidden neurons, in this article, we assume m ¿ n. Using lemmas 2 and 3, instead of using O(n2 ) units to store the Fisher information matrix, we only need O(n) units to store the vectors uk and v k . O(n2 ) flops are needed to construct the Fisher information matrix from these vectors.
Natural Gradient Descent Method
2147
5 Complexity Issues We shall discuss the time complexity to compute A(θ ), A−1 (θ ) and the natural gradient for the multilayer perceptron. By lemmas 2 and 3, to compute the blocks in A(θ ), we need to compute uk and v k , k = 1, · · · , m. It needs O(n) flops to compute uk and v k by their definitions. The time complexity to compute A(θ ) is O(n2 ) since each outer product uk v Tk needs O(n2 ) flops. It is shown in Yang and Amari (1997a) that about O(n2 ) flops are needed to compute the inverse of the Fisher information matrix. If we compute the Fisher information matrix first and then compute the natural gradient, this will take O(n2 ) flops in each iteration to implement the natural gradient descent. This is in the same order as the conjugate gradient method proposed by Yang and Amari (1997b). In the rest of this article, we show that the time complexity of computing the natural gradient for the multilayer perceptron can be significantly improved by using the representation scheme in lemmas 2 and 3 and by computing the natural gradient without inverting the Fisher information matrix. We shall show that we can compute the natural gradient directly with only O(n) flops. To show this, we reexamine the process of computing the inverse of the Fisher information matrix. We need the following notations:
Gl(m, <) = {A ∈ <m×m : det(A) 6= 0}. m X M = a0 Ä0 + aij ui vjT : a0 6= 0, A = (aij ) ∈ Gl(m, <) . i,j=1
M=
m X
i,j=1
a0 Ä0 +
aij ui vjT
(the closure of M).
f = {[a0 , A] : a0 6= 0, A ∈ Gl(m, <)}. M f ). f = {[a0 , A] : a0 ∈ <, A ∈ <m×m }( the closure of M M
f. An element in M f is denoted by [a0 , A] or [a0 , (aij )]. f is a subset of M M f →
e = a 0 Ä0 + ψ(A)
m X i,j=1
f. e = [a0 , (aij )] ∈ M aij ui vjT ∈ M, for A
2148
Howard Hua Yang and Shun-ichi Amari
f, we define multiplication, summae = [a0 , A] and B e = [b0 , B ] ∈ M For A f by tion, and scaling in M e? B e = [a0 b0 , AB ], A e+ B e = [a0 + b0 , A + B ], A a[a0 , A] = [aa0 , aA]. f, its inverse f is a group. For A e = [a0 , A] ∈ M It is easy to verify that M −1 1 −1 e element is A = [ a0 , A ]. f under the mapping ψ. It is not difficult Note that M is the image set of M f → M is an isomorphism, meaning that ψ is one-toto show that ψ : M f. This implies that M is a e B) e = ψ(A e ? B) e for A e and B e∈ M one and ψ(A)ψ( P T subgroup of Gl(n, <) and for C = a0 Ä0 + m i,j=1 aij ui vj ∈ M,
C −1 =
m X 1 Ä0 + aij ui vjT a0 i,j=1
(5.1)
where the matrix (aij ) is the inverse of the matrix (aij )−1 . The following lemma, proved in Yang and Amari (1997a), gives the relations between M and M. Lemma 4.
M=M
\
Gl(n, <).
(5.2)
We shall apply equation 5.1 and lemma 4 to study the structure of A−1 (θ ). 5.1 Committee Machine. Let us first consider a committee machine that is a multilayer perceptron with ai = 1 and bi = 0 for i = 1, · · · , m. Repeatedly applying the inverse formula of the four-block matrix, formula 5.1, and lemma 4, we know that A−1 (θ ) has the same structure as A(θ ): A−1 (θ ) = [Aij ]i=1,···,m,j=1,···,m , where every Aij ∈ M. In this process, all matrix multiplications, summations, and scalings are carried out in the f. Therefore, f, and all matrix inversions are made in the group M space M f is O(m4 ), which the time cost for computing the representations of Aij in M ij does not depend on n. Only when each block A is realized in the matrix space
Natural Gradient Descent Method
2149
Let
e (∇l1 )w1 .. e1= ∇l . e (∇l1 )wm
e 1 corresponding to the partition of θ = (wT , · · · , wTm )T . be a partition of ∇l 1 e Here each (∇l1 )wi is an n × 1 vector. From the block representation of A−1 (θ ), we have e 1 )wi = (∇l
m X
Aij
j=1
∂l1 . ∂ wj
Noticing Aij ∈ M, we have
Aij = bij Ä0 +
m X
T bkl ij uk v l ,
k,l=1
where the coefficients bij and bkl ij are determined in the process of computing
A−1 (θ ). From the above expression of Aij , we obtain " ¶ µ ¶ # m µ m m X X X ∂l1 T ∂l1 kl T ∂l1 e −bij vk uk + bij v l uk . (5.3) bij (∇l1 )wi = ∂ wj ∂ wj ∂ wj j=1 k=1 k,l=1
It requires only the vector inner product, the vector addition, and no matrix multiplication to compute the above expression, so the time complexity for e 1 is O(n). computing the natural gradient ∇l In summary, for the committee machine, O(n2 ) flops due to the vector outer product and the matrix addition are required to compute the inverse of the Fisher information matrix, and O(n) flops due to vector inner product and vector addition are needed to compute the natural gradient by formula 5.3. In the next section, we show that this conclusion is also true for the multilayer perceptron. 5.2 Multilayer Perceptron. Recalling that
A(θ ) = [Aij ][(m+2)×(m+2)] is a partition of A(θ ) corresponding to the partition of
θ = (wT1 , · · · , wTm , aT , bT )T , we consider a new partition of A(θ ) as: ¸ · B 11 B 12 , A(θ ) = T
B 12
B 22
2150
Howard Hua Yang and Shun-ichi Amari
where
B 11 = [Aij ]i=1,···,m,j=1,···,m , B 12 = [Aij ]i=1,···,m,j=m+1,m+2 , B 22 = [Aij ]i=m+1,m+2,j=m+1,m+2 . Assuming that both B 11 and B 22 − B 21 B −1 11 B 12 are nonsingular and applying the four-block inverse formula, we have −1
·
A (θ ) =
B 11 B 21
B 12 B 22
¸ ,
where B ij are defined in the same way as lemma 1. To characterize the structure of these blocks B ij in A−1 (θ ) we need some notations and lemmas. Define
Mu = {A ∈
m X i=1
ai1 ui , · · · ,
m X
# aim ui
,
i=1
where the coefficients aji are computed in O(n) flops. The proof of lemma 5 is given in appendix A. Lemma 6. Let A ∈ Mu and B ∈ Mv ; then the complexity for computing the m × m matrix B T A is O(n). The proof of lemma 6 is given in appendix B.
Natural Gradient Descent Method
2151
Lemma 7. If A ∈ Mu and B ∈ <m×m , then AB ∈ Mu . AB has the following form, "
m X
bi1 ui , · · · ,
i=1
m X
# bim ui
,
i=1
where the coefficients bji are computed in m3 flops. The proof of lemma 7 is given in appendix C. Let us recall the definition of B 11 : −1 −1 −1 B 11 = B −1 11 + B 11 B 12 B 22,1 B 21 B 11 ,
B 22,1 = B 22 −
(5.4)
B 21 B −1 11 B 12 .
The matrix B 11 has the same structure as the Fisher information matrix for the committee machine, so B −1 11 is an m × m matrix with each of its blocks belonging to M. Therefore, the time complexity for computing B −1 11 is the same as that for the committee machine. However, we shall show below −1 −1 that the blocks in B −1 11 B 12 B 22,1 B 21 B 11 do not belong to M, and the time complexity for computing these blocks needs to be discussed separately. To discuss the time complexity of computing the blocks B ij , we need the following lemma: Lemma 8.
The m × 2 block matrix B −1 11 B 12 has the form
B −1 11 B 12 = [X ij ][m×2] , where
X ij =
" m X k=1
xk1 ij uk , · · · ,
m X
# xkm ij uk
,
(5.5)
k=1
and O(n) flops are needed to compute the coefficients xkl ij . The proof of lemma 8 is given in appendix D. It takes O(n) flops to compute the 2m × 2m matrix B 22,1 . −1 −1 B −1 B B 12 11 22,1 B 21 B 11 = [Z ij ][m×m] is an m × m block matrix and each block Z ij
Lemma 9.
2152
Howard Hua Yang and Shun-ichi Amari
has the following form:
Z ij =
m X
T zkl ij uk ul .
k,l=1
It takes O(n) flops to compute these coefficients zkl ij in the above expression. The proof of lemma 9 is given in appendix E. Theorem 1. The complexity for computing B 22 = B −1 22,1 is O(n). The m × 2 block matrix B 12 has the following form: −1 B 12 = −B −1 11 B 12 B 22,1 = −[Y ij ][m×2] ,
where
Y ij =
" m X
yk1 ij uk , · · · ,
k=1
m X
# ykm ij uk
,
(5.6)
k=1
and O(n) flops are needed to compute the coefficients ykl ij . The m × m block matrix B 11 has the following form: e ij ][m×m] , B 11 = [B where e ij = aij Ä0 + B
m X k,l=1
ij
bkl uk v Tl +
m X
ij
zkl uk uTl ,
(5.7)
k,l=1 ij
ij
and O(n) flops are needed to compute these coefficients aij , bkl and zkl . Proof. By (1) in lemma 9, it takes O(n) flops to compute B 22,1 , which is of 2m × 2m. Therefore, O(n) flops are needed to compute B −1 22,1 . This is shown in the second part of the proof for lemma 9. By equation 5.4 and the second statement in lemma 9, B 11 is an m × m block matrix. The (i, j)th block in B 11 is expressed as equation 5.7, and O(n) flops are needed to compute these ij ij coefficients aij , bkl and zkl .
Natural Gradient Descent Method
2153
The computation of the natural gradient is based on theorem 1. Let e (∇l1 )w1 .. . e1= e ∇l (∇l1 )wm (∇l e 1 )a e 1) (∇l b e 1 . By theorem 1, be the partition of ∇l e 1 )w1 = (∇l
m X
e 1i ∂l1 − Y 11 ∂l1 − Y 12 ∂l1 . B ∂ wi ∂a ∂b i=1
Applying equations 5.6 and 5.7, we have e 1 )w1 = (∇l
¶ m µ X ∂l1 a1k +e a1k uk , ∂ wk k=1
(5.8)
e 1k in equation 5.7 and where a1k are the coefficients of B e a1k = −
m X
a1i v Tk
i=1
−
m µ X l=1
ykl 11
¶ m µ X ∂l1 1i T ∂l1 1i T ∂l1 + + zkl ul b v ∂ wi i,l=1 kl l ∂ wi ∂ wi
∂l1 ∂l1 + ykl 12 ∂ al ∂ bl
¶ .
e 1 )w1 is a linear combination of ∂l1 and uk , and it requires Therefore, (∇l ∂ wi a1k due to the vector inner product and the O(n) flops to compute a1k and e e 1 )wi . vector addition. This is also true for computing other (∇l 22 Let B = [C ij ][2×2] . Again applying equation 5.6, we have e 1 )a = (∇l
m X i=1
where
Y
T i1
Y Ti1
∂l1 ∂l1 ∂l1 + C 11 + C 12 , ∂ wi ∂a ∂b
(5.9)
" #T m m X X ∂l1 k1 T ∂l1 km T ∂l1 = yi1 uk ,···, yi1 uk ∂ wi ∂ wi ∂ wi k=1 k=1
involving the vector inner product uTk ∂∂lw1 i . Note that each C ij is of m × m. e 1 )a . This is also true for (∇l e 1) . So, it only needs O(n) flops to compute (∇l b Corollary 1. The time complexity for inverting the Fisher information matrix is O(n2 ). The time complexity for computing the natural gradient A(θ )−1 ∂l1 is ∂θ O(n).
2154
Howard Hua Yang and Shun-ichi Amari
In summary, it is easy to generalize formulas 5.8 and 5.9 to compute e 1 ) . For the multilayer perceptron, the natural gradient can e 1 )wi and (∇l (∇l b be computed in O(n) flops by these formulas. 6 Conclusion The complexity for computing the Fisher information matrix and the natural gradient is discussed. Based on the efficient scheme to represent the Fisher information matrix, the process of inverting the Fisher information matrix is examined in detail. The structure of the Fisher information matrix and the natural gradient for the multilayer perceptron is elaborated by this analysis. We show that the time complexity for inverting the Fisher information matrix is O(n2 ) and the time complexity for computing the natural gradient is O(n). Appendix A: Proof of Lemma 5 Extend {u1 , · · · , um } to {u1 , · · · , um , um+1 , · · · , un }, a basis in m, uj ⊥ L(u1 , · · · , um )
ujT uk
= δj,k (delta notation),
(A.1) for j, k = m + 1, · · · , n,
(A.2)
where L(u1 , · · · , um ) is the vector space spanned by {u1 , · · · , um }. Let U = [U 1 , um+1 , · · · , un ] and V = [V 1 , um+1 , · · · , un ]. Then noticing that each column of V 1 is in L(u1 , · · · , um ), we have V T U = I . Therefore, V T = U −1 and U V T = I, which implies that m X
ui v i +
i=1
n X
ui uTi = I ,
i=m+1
that is, Ä0 = I −
m X
n X
ui v i =
i=1
ui uTi .
(A.3)
i=m+1
P Let A = a0 Ä0 + ij aij ui vjT and B ∈ Mv . Each column of B is a linear P combination of {v 1 , · · · , v m }. Let b = m i=1 ci v i . Then
Ab = a0
n X i=m+1
ui uTi
+
m X i,j=1
aij ui vjT
m X i=1
ci v i
Natural Gradient Descent Method m X
=
2155
aij ck vjT v k ui
i,j,k=1
=
m X
ai ui ,
i=1
Pm aij ck vjT v k and O(n) flops are needed to compute these where ai = j,k=1 coefficients ai . Expressing AB as the following form, " m X
AB =
ai1 ui , · · · ,
i=1
m X
# aim ui
,
i=1
we need O(n) flops to compute the coefficients aji . Appendix B: Proof of Lemma P 6i Pm i Pm i Pm i Let A = [ m i=1 a1 ui , · · · , i=1 am ui ] and B = [ i=1 b1 v i , · · · , i=1 bm v i ]. P i j T b a v u , which is computed in Then the (k,l)th element of B T A is m j i,j=1 k l i O(n) flops.
Appendix C: Proof of Lemma 7 <m×m , P each column of AB has the form Ax for For A ∈ Mu and P B∈ m i i T x ∈ <m . Let A = [ m c u i=1 1 i , · · · , i=1 cm ui ] and x = [x1 , · · · , xm ] . Then
Ax =
à m m X X i=1
! cik xk
ui =
m X
bi ui ,
i=1
k=1
and m2 flops are needed to compute these coefficients bi . Hence, when AB is expressed as the following form,
AB =
" m X i=1
bi1 ui , · · · ,
m X
# bim ui
,
i=1
m3 flops are needed to compute the coefficients bji .
2156
Howard Hua Yang and Shun-ichi Amari
Appendix D: Proof of Lemma 8 B 11 has the same structure as that of the Fisher information matrix for 2 the committee machine. B −1 11 is an m × m block matrix, and each of its m blocks has the following form, e a0 Ä0 +
m X
T e clk ij ul v k ,
l,k=1
clk and the complexity for computing these coefficients e a0 and e ij does not depend on n. B 12 is an m × 2 block matrix with blocks belonging to Mv . By lemma 5, −1 B 11 B 12 is an m × 2 block matrix, and each of its 2m blocks belongs to Mu with the form of equation 5.5 and the complexity for computing xkl ij is O(n). Appendix E: Proof of Lemma 9 Applying lemmas 6 and 8, it takes O(n) flops to compute the 2m × 2m matrix B T12 (B −1 11 B 12 ), so O(n) flops are needed to compute B 22,1 = B 22 − B T12 (B −1 B ). 12 11 By lemmas 7 and 8,
−1 B −1 11 B 12 B 22,1
Y 12 .. , . Y m2
Y 11 .. = . Y m1
P Pm km k1 where Y ij = [ m k=1 yij uk , · · · , k=1 yij uk ]. Also by lemma 8,
B 21 B −1 11
=
B T12 B −1 11
=
T (B −1 11 B 12 )
· =
X T11 X T12
··· ···
X Tm1 X Tm2
¸ .
Hence, −1 −1 B −1 11 B 12 B 22,1 B 21 B 11 = [Z ij ][m×m] , P T T kl T + Y i2 X j2 = m where Z ij = Y i1 X j1 k,l=1 zij uk ul and O(n) flops are needed kl to compute the coefficients zij .
Natural Gradient Descent Method
2157
References Amari, S. (1997). Neural learning in structured parameter spaces—Natural Riemannian gradient. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 127–133) Cambridge, MA: MIT Press. Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 251–276. Orr, G. B., & Leen, T. K. (1997). Using curvature information for fast stochastic search. In M. C. Mozer, M. I. Jordan & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Saad, D., & Solla, S. A. (1995). On-line learning in soft committee machines. Physical Review E, 52, 4225–4243. Stewart, G. W. (1973). Introduction to matrix computations. New York: Academic Press. Yang, H. H., & Amari, S. (1997a). Natural gradient descent for training multilayer perceptrons. Unpublished manuscript. Yang, H. H., & Amari, S. (1997b). Training multi-layer perceptrons by natural gradient descent. In ICONIP’97 Proceedings, New Zealand. Yang, H. H., & Amari, S. (1998). The efficiency and the robustness of natural gradient descent learning rule. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Yang, H. H., Murata, N., & Amari, S. (1998). Statistical inference: Learning in artificial neural networks. Trends in Cognitive Sciences, 2(1), 4–10. Received August 11, 1997; accepted April 16, 1998.
LETTER
Communicated by Dana Ron
Almost Linear VC-Dimension Bounds for Piecewise Polynomial Networks Peter L. Bartlett Department of System Engineering, Australian National University, Canberra, ACT 0200, Australia
Vitaly Maiorov Department of Mathematics, Technion, Haifa 32000, Israel
Ron Meir Department of Electrical Engineering,Technion, Haifa 32000, Israel
We compute upper and lower bounds on the VC dimension and pseudodimension of feedforward neural networks composed of piecewise polynomial activation functions. We show that if the number of layers is fixed, then the VC dimension and pseudo-dimension grow as W log W, where W is the number of parameters in the network. This result stands in opposition to the case where the number of layers is unbounded, in which case the VC dimension and pseudo-dimension grow as W 2 . We combine our results with recently established approximation error rates and determine error bounds for the problem of regression estimation by piecewise polynomial networks with unbounded weights. 1 Motivation It is a well-known result of much recent work that in order to derive useful performance bounds for classification, regression, and time-series prediction, use must be made of both approximation and estimation error bounds. Approximation errors result from using functional classes of limited complexity (such as neural networks with a finite number of computational units), and estimation errors are caused by the finiteness of the sample used for learning. In order to achieve good performance bounds, both the approximation and the estimation errors need to be made small. Unfortunately, the demands required in order to reduce these two terms are usually conflicting. For example, in order to derive good estimation error bounds, one often assumes that the parameters are restricted to a bounded domain (see, for example, Bartlett, 1998), which may be allowed to increase with the sample size, as in the method of sieves (Geman & Hwang, 1982). However, imposing such boundedness constraints leads to great difficulties in establishing approximation error bounds, which in general have not been surmounted c 1998 Massachusetts Institute of Technology Neural Computation 10, 2159–2173 (1998) °
2160
Peter L. Bartlett, Vitaly Maiorov, and Ron Meir
to date. Recently it has been shown that the VC dimension of many types of neural networks with continuous activation functions is finite even without imposing any conditions on the magnitudes of the parameters (Macintyre & Sontag, 1993; Goldberg & Jerrum, 1995; Karpinski & Macintyre, 1997). Since there is a close connection between the VC dimension and the estimation error (see section 4), this result is significant in the context of learning. Thus, as long as the function itself is bounded, one may proceed to derive good upper bounds for the covering numbers needed in establishing the estimation error bounds. These results can then be combined with available results for the approximation error, the derivation of which is greatly facilitated by avoiding the constraints on the parameters. In this article we consider a specific class of multilayered feedforward neural networks composed of piecewise polynomial activation functions. Since it is known (Leshno, Lin, Pinkus, & Schocken, 1993) that a necessary and sufficient condition for universal approximation of continuous functions over compacta by neural networks is that the transfer functions be nonpolynomial, it is interesting to consider the “minimal” complexity transfer functions that guarantee this universality. It has recently been shown (Maiorov & Meir, 1997) that the rate of approximation of the Sobolev space Wr (with a constraint on r; see section 4) by functions of this type is optimal up to a logarithmic factor, when the parameters of the polynomials are unrestricted in size. In order to complete the derivation of total error bounds, use must be made of results for the covering number of the above class of functions. From the results of Pollard (1984) and Haussler (1992, 1995), these numbers may be upper bounded if upper bounds are available for the pseudo-dimension. We establish upper and lower bounds for both the VC dimension and the pseudo-dimension of such networks. For completeness, we give definitions of these terms in the next section. Let F be the class of binary-valued functions computed by feedforward neural networks with piecewise polynomial activation functions, such that there are k computational (noninput) units and W weights. Goldberg and Jerrum (1995) have shown that VCdim(F ) ≤ c1 (W 2 + Wk) ≤ O(W 2 ), where c1 is a constant. Moreover, Koiran and Sontag (1997) have demonstrated that in fact VCdim(F ) ≥ c2 W 2 = Ä(W 2 ), which would lead one to conclude that the bounds are in fact tight up to a constant. However, the proof they used to establish the lower bound made use of the fact that the number of layers is unbounded. In practical applications, this number is often a small constant. Thus, the question remains as to whether it is possible to obtain a better bound in the realistic scenario where the number of layers is fixed. The contribution of this work is the proof of upper and lower bounds on the VC dimension and pseudo-dimension of piecewise polynomial nets. As we will see, the upper bound behaves as O(WL2 + WL log WL), where L is the number of layers. (We use the “big-oh” notation f (n, m) = O(g(n, m)) to imply the existence of a constant c and integers n0 and m0 such that f (n, m) ≤ cg(n, m) for all n > n0 and m > m0 .) If L is fixed, this is O(W log W),
Almost Linear VC-Dimension Bounds
2161
which is superior to the currently available result which behaves as O(W 2 ). Moreover, using ideas from Goldberg and Jerrum (1995) and Koiren and Sontag (1997), we are able to derive a lower bound on the VC dimension, which is Ä(WL) for L = O(W). Maass (1994) shows that three-layer networks with threshold activation functions and binary inputs have VC dimension Ä(W log W), and Sakurai (1993) shows that this is also true for two-layer networks with threshold activation functions and real inputs. It is easy to show that these results imply similar lower bounds if the threshold activation function is replaced by any piecewise polynomial activation function f that has bounded and distinct limits limx→−∞ f (x) and limx→∞ f (x). We thus conclude that if the number of layers L is fixed, the VC dimension of piecewise polynomial networks with L ≥ 2 layers and real inputs, and of piecewise polynomial networks with L ≥ 3 layers and binary inputs, grows as W log W. Combining this result with the approximation error bounds mentioned above, we are able to derive total error bounds that are much better than currently available results for sigmoidal networks. 2 Upper Bounds We begin the technical discussion with a precise definition of the VCdimension and the pseudo-dimension. For more details and examples see Devroye, Gyorfi, ¨ and Lugosi (1996) and Vidyasagar (1996). Definition 1. Let X be a set and A a system of subsets of X. A set S = {x1 , . . . , xn } is shattered by A if, for every subset B ⊆ S, there exists a set A ∈ A such that S ∩ A = B. The VC-dimension of A, denoted by VCdim(A), is the largest integer n such that there exists a set of cardinality n that is shattered by A. Intuitively, the VC-dimension measures the size, n, of the largest set of points for which all possible 2n labelings may be achieved by sets A ∈ A. It is often convenient to talk about the VC-dimension of classes of indicator functions F . In this case we simply identify the sets of points x ∈ X for which f (x) = 1 with the subsets of A, and use the notation VCdim(F ). For classes of real-valued functions, one defines the pseudo-dimension as follows: Definition 2. Let X be a set and F ⊆ [0, B]X . A set S = {x1 , . . . , xn } is Pshattered by F if there exists a real vector c ∈ [0, B]n such that, for every binary vector e ∈ {0, 1}n , there exists a corresponding function fe ∈ F such that fe (xi ) ≥ ci
if
ei = 1,
and fe (xi ) < ci
if
ei = 0.
2162
Peter L. Bartlett, Vitaly Maiorov, and Ron Meir
The pseudo-dimension of F , denoted by Pdim(A), is the largest integer n such that there exists a set of cardinality n that is P-shattered by A. For any class F of real-valued functions, let sgn(F ) be the class of functions obtained by taking the sign of functions in F , where sgn(u) = 1 if u ≥ 0 and zero otherwise. It is then clear from the definition that VCdim(sgn(F )) ≤ Pdim(F ), implying that a lower bound on the VC-dimension of sgn(F ) immediately yields a lower bound on the pseudo-dimension of F . We characterize the class of networks considered in this work. A feedforward multilayer network is a directed acyclic graph that represents a parameterized real-valued function of d real inputs. Each node is called an input unit or a computation unit. The computation units are arranged in L layers. Edges are allowed from input units to computation units. There can also be an edge from a computation unit to another computation unit, but only if the first unit is in a lower layer than the second. There is a single unit in the final layer, called the output unit. Each input unit has an associated real value, which is one of the components of the input vector x ∈ Rd . Each computation unit has an associated real value, called the unit’s output value. Each edge has an associated real parameter, as does ¢ ¡P each computation unit. The output of a computation unit is given by σ e we ze + w0 , where the sum ranges over the set of edges leading to the unit, we is the parameter (weight) associated with edge e, ze is the output value of the unit from which edge e emerges, w0 is the parameter (bias) associated with the unit, and σ : R → R is called the activation function of the unit. The argument of σ is called the net input of the unit. We suppose that in each unit except the output unit, the activation function is a fixed piecewise polynomial function of the form σ (u) = φi (u)
for u ∈ [ti−1 , ti ),
for i = 1, . . . , p + 1 (and set t0 = −∞ and tp+1 = ∞), where each φi is a polynomial of degree no more than l. We say that σ has p break points and degree l. The activation function in the output unit is the identity function. Let ki denote the number of computational units in layer i, and suppose there is a total of W parameters (weights and biases) and k computational units (k = k1 +k2 +· · ·+kL−1 +1). For input x and parameter vector a ∈ A = RW , let f (x, a) denote the output of this network, and let F = {x 7→ f (x, a) : a ∈ RW } denote the class of functions computed by such an architecture, as we vary the W parameters. We first discuss the computation of the VC-dimension, and thus consider the class of functions sgn(F ) = {x 7→ sgn( f (x, a)) : a ∈ RW }. Before giving the main theorem of this section, we present the following result, which is a slight improvement (see Anthony & Bartlett, 1998, chap. 8) of a result due to Warren (1968).
Almost Linear VC-Dimension Bounds
2163
Lemma 1. Suppose f1 (·), f2 (·), . . . , fm (·) are fixed polynomials of degree at most l in n ≤ m variables. Then the number of distinct sign vectors {sgn( f1 (a)), . . . , sgn( fm (a))} that can be generated by varying a ∈ Rn is at most 2(2eml/n)n . We then have our main result: Theorem 1. For any positive integers W, k ≤ W, L ≤ W, l, and p, consider a network with real inputs, up to W parameters, up to k computational units arranged in L layers, a single output unit with the identity activation function, and all other computation units with piecewise polynomial activation functions of degree l and with p break points. Let F be the class of real-valued functions computed by this network. Then VCdim(sgn(F )) ≤ 2WL log(2eWLpk) + 2WL2 log(l + 1) + 2L . Since L and k are O(W), for fixed l and p this implies that VCdim(sgn(F )) = O(WL log W + WL2 ). Before presenting the proof, we outline the main idea in the construction. For any fixed input x, the output of the network f (x, a) corresponds to a piecewise polynomial function in the parameters a, of degree no larger than (l + 1)L−1 (recall that the last layer is linear). Thus, the parameter domain A = RW can be split into regions, in each of which the function f (x, ·) is polynomial. From lemma 1, it is possible to obtain an upper bound on the number of sign assignments that can be attained by varying the parameters of a set of polynomials. The theorem will be established by combining this bound with a bound on the number of regions. Proof of Theorem 1. For an arbitrary choice of m points x1 , x2 , . . . , xm , we wish to bound ¯© ª¯ K = ¯ (sgn( f (x1 , a)), . . . , sgn( f (xm , a))) : a ∈ A ¯ . Fix these m points, and consider a partition {S1 , S2 , . . . , SN } of the parameter domain A. Clearly K≤
N ¯© X ª¯ ¯ (sgn( f (x1 , a)), . . . , sgn( f (xm , a))) : a ∈ Si ¯ . i=1
We choose the partition so that within each region Si , f (x1 , ·), . . . , f (xm , ·) are all fixed polynomials of degree no more than (l + 1)L−1 . Then, by lemma 1, each term in the sum above is no more than µ
2em(l + 1)L−1 2 W
¶W .
(2.1)
2164
Peter L. Bartlett, Vitaly Maiorov, and Ron Meir
The only remaining point is to construct the partition and determine an upper bound on its size. The partition is constructed recursively, using the following procedure. Let S1 be a partition of A such that, for all S ∈ S1 , there are constants bh,i,j ∈ {0, 1} for which sgn(ph,xj (a) − ti ) = bh,i,j
for all a ∈ S,
where j ∈ {1, . . . , m}, h ∈ {1, . . . , k1 }, and i ∈ {1, . . . , p}. Here ti are the break points of the piecewise polynomial activation functions, and ph,xj is the affine function describing the net input to the hth unit in the first layer, in response to xj . That is, ph,xj = ah · xj + ah,0 , where ah ∈ Rd , ah,0 ∈ R are the weights of the hth unit in the first layer. Note that the partition S1 is determined solely by the parameters corresponding to the first hidden layer, as the input to this layer is unaffected by the other parameters. Clearly, for a ∈ S, the output of any first layer unit in response to an xj is a fixed polynomial in a. Now, let W1 , . . . , WL be the number of variables used in computing the unit outputs up to layer 1, . . . , L, respectively (so WL = W), and let k1 , . . . , kL be the number of computation units in layer 1, . . . , L, respectively (recall that kL = 1). Then we can choose S1 so that |S1 | is no more than the number of sign assignments possible with mk1 p affine functions in W1 variables. Lemma 1 shows that µ ¶ 2emk1 p W1 . |S1 | ≤ 2 W1 Now, we define Sn (for n > 1) as follows. Assume that for all S in Sn−1 and all xj , the net input of every unit in layer n in response to xj is a fixed polynomial function of a ∈ S, of degree no more than (l + 1)n−1 . Let Sn be a partition of A that is a refinement of Sn−1 (that is, for all S ∈ Sn , there is an S0 ∈ Sn−1 with S ⊆ S0 ), such that for all S ∈ Sn there are constants bh,i,j ∈ {0, 1} such that sgn(ph,xj (a) − ti ) = bh,i,j
for all a ∈ S,
(2.2)
where ph,xj is the polynomial function describing the net input of the hth unit in the nth layer, in response to xj , when a ∈ S. Since S ⊆ S0 for some S0 ∈ Sn−1 , equation 2.2 implies that the output of each nth layer unit in response to an xj is a fixed polynomial in a of degree no more than l(l + 1)n−1 , for all a ∈ S. Finally, we can choose Sn such that for all S0 ∈ Sn−1 we have |{S ∈ Sn : S ⊆ 0 S }| is no more than the number of sign assignments of mkn p polynomials in Wn variables of degree no more than (l + 1)n−1 , and by lemma 1 this is no more than ¶Wn µ 2emkn p(l + 1)n−1 . 2 Wn
Almost Linear VC-Dimension Bounds
2165
Notice also that the net input of every unit in layer n + 1 in response to xj is a fixed polynomial function of a ∈ S ∈ Sn of degree no more than (l + 1)n . Proceeding in this way, we get a partition SL−1 of A such that for S ∈ SL−1 the network output in response to any xj is a fixed polynomial of a ∈ S of degree no more than l(l + 1)L−2 . Furthermore, µ
¶ Y µ ¶W i 2emk1 p W1 L−1 2emki p(l + 1)i−1 2 W1 Wi i=2 µ ¶ W i L−1 Y 2emki p(l + 1)i−1 ≤ 2 . Wi i=1
|SL−1 | ≤ 2
Multiplying by the bound (see equation 2.1) gives the result µ ¶Wi L Y 2emki p(l + 1)i−1 2 . K≤ Wi i=1 Since the points x1 , . . . , xm were chosen arbitrarily, this gives a bound on the maximal number of dichotomies induced by a ∈ A on m points. An upper bound on the VC-dimension is then obtained by computing the largest value of m for which this number is at least 2m , yielding m < L+
L X i=1
µ Wi log
2empki (l + 1)i−1 Wi
¶
£ ¤ < L 1 + (L − 1)W log(l + 1) + W log(2empk) , where all logarithms are to the base 2. We conclude (see, for example, Vidyasagar, 1996, lemma 4.4) that £ ¡ ¢ ¤ VCdim(F ) ≤ 2L (L − 1)W log(l + 1) + W log 2eWLpk + 1 . Theorem 1 can be immediately extended to obtain results for the pseudodimension of the class of piecewise polynomial networks. As explained, for example, in Vidyasagar (1996), the pseudo-dimension of a neural network F is identical to the VC-dimension of an equivalent network with an additional scalar parameter input to the output unit. Since each one of the piecewise polynomial activation functions in the network considered already includes a free scalar input, which contributes to the output unit, it is clear that no additional freedom is obtained by adding a further scalar parameter. We thus conclude that the bound of theorem 1 is also an upper bound for the pseudo-dimension of F . 3 Lower Bound We now compute a lower bound on the VC-dimension (and thus also on the pseudo-dimension) of neural networks with continuous activation func-
2166
Peter L. Bartlett, Vitaly Maiorov, and Ron Meir
tions. This result generalizes the lower bound in Koiren and Sontag (1997) since it holds for any number of layers. Theorem 2.
Suppose f : R → R has the following properties:
1. limα→∞ f (α) = 1 and limα→−∞ f (α) = 0, and 2. f is differentiable at some point x0 with derivative f 0 (x0 ) 6= 0. Then for any L ≥ 1 and W ≥ 10L − 14, there is a feedforward network with the following properties: The network has L layers and W parameters, the output unit is a linear unit, all other computation units have activation function f , and the set sgn(F ) of functions computed by the network has VCdim(sgn(F )) ≥
¹ º¹ º W L , 2 2
where buc is the largest integer less than or equal to u. Proof. As in Koiren and Sontag (1997), the proof follows that of theorem 2.5 in Goldberg and Jerrum (1995), but we show how the functions they described can be computed by a network, and keep track of the number of parameters and layers required. We first prove the lower bound for a network containing linear threshold units and linear units (with the identity activation function), and then show that all except the output unit can be replaced by units with activation function f , and the resulting network still shatters the same set. Fix positive integers M, N ∈ N. We now construct a set of MN points, which may be shattered by a network with O(N) weights and O(M) layers. Let {ai }, i = 1, 2, . . . , N denote a set of N parameters, where each ai ∈ [0, 1) has an M-bit binary representation, ai =
M X
2−j ai,j
ai,j ∈ {0, 1},
j=1
that is, the M-bit base two representation of ai is ai = 0.ai,1 ai,2 , . . . , ai,M . We will consider inputs in BN × BM , where BN = {ei : 1 ≤ i ≤ N} , ei ∈ {0, 1}N has ith bit 1 and all other bits 0, and BM is defined similarly. We show how to extract the bits of the ai , so that for input x = (el , em ), the network outputs al,m . Since there are NM inputs of the form (el , em ), and al,m can take on all possible 2MN values, the result will follow. There are three stages to the computation of al,m : (1) computing al , (2) extracting al,k from al , for every k, and (3) selecting al,m among the al,k s.
Almost Linear VC-Dimension Bounds
2167
Suppose the network input is x = ((u1 , . . . , uN ), (v1 , . . . , vM )) = (el , em ). P Using one linear unit, we can compute N i=1 ui ai = al . This involves N + 1 parameters and one computation unit in one layer. In fact, we need only N parameters, but we need the extra parameter when we show that this linear unit can be replaced by a unit with activation function f . Consider the parameter ck = 0.al,k . . . al,M , that is, ck =
M X
2k−1−j al,j ,
j=k
for k = 1, . . . , M. Since ck ≥ 1/2 iff al,k = 1, clearly sgn(ck − 1/2) = al,k for all k. Also, c1 = al and ck = 2ck−1 − al,k−1 . Thus, consider the recursion ck = 2ck−1 − al,k−1 al,k = sgn(ck − 1/2), with initial conditions c1 = al and al,1 = sgn(al − 1/2). Clearly, we can compute al,1 , . . . , al,M−1 and c2 , . . . , cM−1 in another 2(M − 2) + 1 layers, using 5(M − 2) + 2 parameters in 2(M − 2) + 1 computational units (cf. Figure 1). We could compute al,M in the same way, but the following approach gives fewer layers. Set ! Ã M−1 X vi . b = sgn 2cM−1 − al,M−1 − i=1
If m 6= M, then b = 0. If m = M, then the input vector (v1 , . . . , vM ) = eM , PM−1 vi = 0, implying that b = sgn(cM ) = sgn(0.al,M ) = al,M . and thus i=1 In order to conclude the proof, we need to show how the variables al,m may be recovered, depending on the inputs (v1 , v2 , . . . , vM ). Let ∨ and ∧ denote the boolean operations of disjunction and conjunction, respectively. We then have al,m = b ∨
M−1 _
(al,i ∧ vi ).
i=1
PM WM xi = sgn( i=1 xi − Since for boolean x and y, x∧y = sgn(x+y−3/2), and i=1 1/2), we see that the computation of al,m involves an additional 5M parameters in M + 1 computational units and adds another two layers. In total, there are 2M layers and 10M+N−7 parameters, and the network shatters a set of size NM. Clearly we can add parameters and layers without affecting the function of the network. So for any L, W ∈ N, we can set M = bL/2c and N = W +7−10M, which is at least bW/2c provided W ≥ 10L−14. In that case, the VC-dimension is at least ¹ º¹ º W L . 2 2
2168
Peter L. Bartlett, Vitaly Maiorov, and Ron Meir
em
el
al al;1 c2 cM ,1 al;M ,1
al;M al;m Figure 1: Network of linear threshold units and linear units for the lower bound.
The network just constructed uses linear threshold units and linear units. However, it is easy to show (see Koiren & Sontag, 1997, theorem 5) that each unit except the output unit can be replaced by a unit with activation function f so that the network still shatters the set of size MN. For linear units, the input and output weights are scaled so that the linear function can be approximated to sufficient accuracy by f in the neighborhood of the point x0 . For linear threshold units, the input weights are scaled so that the behavior of f at infinity accurately approximates a linear threshold function.
Almost Linear VC-Dimension Bounds
2169
Remark. Note that the input to the network constructed in the proof of theorem 2 was a vector of dimension N + M; its size scales linearly with the number of weights and layers. In fact, it is possible to obtain the same general result as above, using scalar inputs. Similarly to Koiren & Sontag (1997, proposition 1) one first constructs a set of MN points in R2 that can be shattered and then proceeds to show that this result can be applied also for inputs in R (Koiren & Sontag, 1997, theorem 1). The only modification to the proof presented above is that in the first layer, we will need 3(N − 1) + 1 parameters, N − 1 threshold gates, and a single linear gate instead of N + 1 parameters and a single linear gate used in our proof. The remainder of the proof is unchanged. The only difference in the final bound is a multiplicative factor. 4 Application to Learning We briefly describe the application of our results to the problem of learning, focusing on the case of regression. The case of classification can be treated similarly. Consider then the problem of estimating a regression function E(Y|X = x) from a set of n independently and identically distributed samples {(X1 , Y1 ), . . . , (Xn , Yn )}, each drawn from a probability distribution P(X, Y), |Y| ≤ M < ∞. In particular, we focus for simplicity on a learning algorithm that constructs a function fˆn ∈ F by minimizing the empirical error, Lˆ n ( f ) =
n ¯ X ¯ ¯Yi − f (Xi )¯2 , i−1
over f ∈ F , where | f | ≤ M. We consider the class of feedforward neural networks with a single hidden layer and an output nonlinearity restricting the range of the function to [−M, M]. Specifically, let à ! k X T ci σ (ai x + bi ) + c0 , f (x) = πM i=1
where the activation function σ (·) is assumed to be piecewise polynomial, and the output unit activation function πM (·) is given by −M if α < −M, M if α > M, πM (α) = α otherwise. The need to restrict the range of the functions f arises here for technical reasons having to do with the derivation of the estimation error (see, e.g., Haussler, 1992). Previous work along these lines has relied on bounded parameter values. In particular, Barron (1994) bounds the parameters c, a and b and imposes a Lipschitz constraint on the activation function σ , while
2170
Peter L. Bartlett, Vitaly Maiorov, and Ron Meir
P Haussler (1992) constrains the parameters c so that i |ci | ≤ β < ∞, and assumes that the activation function σ is nondecreasing and bounded (see also Lee et al., 1996; Lugosi & Zeger, 1995; Bartlett, 1998). We do not assume any constraints on the parameter values and do not require boundedness of the activation function σ . The expected loss of the resulting estimate fˆn , is given by L( fˆn ) = E|Y − fˆn (X)|2 , where the expectation is taken with respect to X, Y. Standard results from the theory of uniform convergence of empirical measures (Pollard, 1984), together with bounds on certain covering numbers in terms of pseudodimension (Haussler, 1995, corollary 3), show that ˜ f ) + c2 MPdim(F ) log n , (4.1) EL( fˆn ) ≤ s2 + c1 inf L( n f ∈F ª © where c1 and c2 are absolute constants, s2 = E (Y − E(Y|X))2 is the noise ª © ˜ f ) in equa˜ f ) = E |E(Y|X) − f (X)|2 . The quantity inf f ∈F L( variance, and L( tion 4.1 is referred to as the approximation error and depends on the actual regression function E(Y|X = x). Note that in equation 4.1 we have used a result from Haussler (1992, sect. 2.4), where the bound relies on a choice of c1 > 1. As discussed in Haussler (1992), it is possible to set c1 = 1, which shows that the expected loss converges to the approximation error plus noise variance, but this is at the cost of increasing the third ¡ ¢1/2 . In cases where the approximation error term to c02 M Pdim(F ) log n/n inf f ∈F L( f ) can be made arbitrarily small, it is clear that the former bound, namely equation 4.1, is superior. In order to derive an upper bound on the expected loss, use must be made of upper bounds on the approximation error as well as upper bounds on the pseudo-dimension Pdim(F ), which we computed in section 2. The approximation error term can be handled using techniques studied in Maiorov and Meir (1997), where the assumption is made that the regression function belongs to the Sobolev space Wr , containing all functions f : Rd → R with square integrable derivatives up to order r > d/2. In particular, we consider the subset WrK of functions f ∈ Wr satisfying k f (r) kL2 ≤ K. Since r > d/2, well-known embedding theorems from the theory of Sobolev spaces (see corollary 5.16 in Adams, 1975) imply that such a function f is in fact uniformly bounded. Standard techniques show that there is a bound M(K) so that, for all f ∈ WrK and all x ∈ Rd , | f (x)| ≤ M(K) (see, for example, section 11 in Kantorovich & Akilov, 1964). For a given K, if the regression function is in WrK and the network class F is as described above, with k hidden units and an output bound satisfying M ≥ M(K), it has been shown in Maiorov and Meir (1997) that ˜ f ) ≤ c log k/kr/d , inf L(
f ∈F
Almost Linear VC-Dimension Bounds
2171
where d is the Euclidean dimension and c is a constant that depends only on K, r, and d. We note that the result in Mairov P and Meir (1997) was proved for unbounded functions of the form i ci σ (aTi x + bi ). However, it is easy to show that if the regression function is uniformly bounded by M(K), then the Psame approximation rate holds for the class of bounded functions πM ( i ci σ (aTi x + bi )), M ≥ M(K), which is considered here. This result is only slightly worse than the best result available (Mhaskar, 1996) for the standard sigmoidal activation function σ (u) = 1/(1 + e−u ), where the approximation error is upper bounded by c/kr/d . Although the constraint r > d/2 is rather restrictive, it is required in order to yield bounded functions. In future work we hope to remove this restriction, but since this leads to unbounded functions, new techniques might be necessary for the estimation error bounds. A final comment is in order concerning the use of the pseudo-dimension in equation 4.1. It can be shown that for the hard-limiting activation function πM , the pseudo-dimension yields nearly optimal bounds for estimation error, since in that case the pseudo-dimension is essentially equivalent to another combinatorial dimension, called the fat-shattering dimension, which gives nearly matching lower bounds on estimation error (see Bartlett, Long, & Williamson, 1996). Combining the two results, we obtain the total error bound, log k Pdim(F ) log n , EL( fˆn ) ≤ s2 + c r/d + c0 n k
(4.2)
where c and c0 are constants that depend only on K. By allowing the output bound M to increase suitably slowly as n increases, we see that for any f ∈ Wr , the total error decreases, as indicated by equation 4.2, but with constants that depend on f . In the case of a single hidden-layer network with k hidden units, the number of parameters W is given by W = k(d + 2) + 1 = O(k). Thus, we observe that the final term in equation 4.2 is of order O(k log k log n/n). For the standard sigmoid, we have from Karpinski and Macintyre (1997) that Pdim = O(W 2 k2 ) = O(k4 ) and thus this term is O(k4 log n/n). Keeping in mind that the approximation error terms are identical up to a logarithmic factor log k, we can see that the guaranteed upper bound on the rates of convergence of the loss in the case of piecewise polynomial activation functions is much faster. It is an interesting open problem to determine if piecewise polynomial networks have smaller error in this sense. Acknowledgments V. M. was partially supported by the Center for Absorption in Science, Ministry of Immigrant Absorption, State of Israel. The work of R. M. was supported in part by a grant from the Israel Science Foundation. Part of
2172
Peter L. Bartlett, Vitaly Maiorov, and Ron Meir
this work was done while R. M. was visiting the Isaac Newton Institute, Cambridge, England. Support from the Ollendorff Center of the Department of Electrical Engineering at the Technion is also acknowledged. The work of P. B. was supported in part by a grant from the Australian Research Council. Thanks to two anonymous referees for helpful comments. References Adams, R. A. (1975). Sobolev spaces. New York: Academic Press. Anthony, M., & Bartlett, P. L. (1998). A theory of learning in artificial neural networks. Unpublished manuscript. Barron, A. (1994). Approximation and estimation bounds for artificial neural networks. Machine Learning, 4, 115–133. Bartlett, P. L. (1998). The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44, 525–536. Bartlett, P. L., Long, P. M., & Williamson, R. C. (1996). Fat-shattering and the learnability of real-valued functions. Journal of Computer and System Sciences, 52(3), 434–452. Devroye, L., Gyorfi, ¨ L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag. Geman, S., & Hwang, C. R. (1982). Nonparametric maximum likelihood estimation by method of sieves. Annals of Statistics, 10(2), 401–414. Goldberg, P. W., & Jerrum, M. R. (1995). Bounding the VC dimension of concept classes parameterized by real numbers. Machine Learning, 18, 131–148. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100, 78–150. Haussler, D. (1995). Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. J. Combinatorial Theory, A 69, 217–232. Kantorovich, L. V., & Akilov, G. P. (1964). Functional analysis in normed spaces. Oxford: Pergamon Press. Karpinski, M., & Macintyre, A. (1997). Polynomial bounds for VC dimension of sigmoidal and general Paffian neural networks. Journal of Computer and System Science, 54, 169–176. Koiran, P., & Sontag, E. D. (1997). Neural networks with quadratic VC dimension. Journal of Computer and System Science, 54, 190–198. Lee, W.-S., Bartlett, P. J., & Williamson, R. C. (1996). Efficient agnostic learning of neural networks with bounded fan-in. IEEE Trans. Inf. Theory, 42(6), 2118– 2132. Leshno, M., Lin, V., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6, 861–867. Lugosi, G., & Zeger, K. (1995). Nonparametric estimation via empirical risk minimization. IEEE Trans. Inf. Theory, 41(3), 677–687.
Almost Linear VC-Dimension Bounds
2173
Maass, W. (1994). Neural nets with superlinear VC-dimension. Neural Computation, 6(5), 877–884. Maiorov, V., & Meir, R. (1997). On the near optimality of the stochastic approximation of smooth functions by neural networks (Tech. Rep. No. CC-223). Haifa: Department of Electrical Engineering, Technion. Macintyre, M. J., & Sontag, E. D. (1993). Finiteness results for sigmoidal neural networks. In Proc. 25th ACM STOC. Mhaskar, H. (1996). Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, 8(1), 164–177. Pollard, D. (1984). Convergence of empirical processes. New York: Springer-Verlag. Sakurai, A. (1993). Tighter bounds on the VC-dimension of three-layer networks. In World Congress on Neural Networks. Hillsdale, NJ: Erlbaum. Vidyasagar, M. (1996). A theory of learning and generalization. New York: SpringerVerlag. Warren, H. E. (1968). Lower bounds for approximation by nonlinear manifolds. Trans. AMS, 133, 167–178. Received December 1, 1997; accepted April 22, 1998.
LETTER
Communicated by Patrice Simard
The Diabolo Classifier Holger Schwenk LIMSI-CNRS, BP 133, 91403, Orsay Cedex, France
We present a new classification architecture based on autoassociative neural networks that are used to learn discriminant models of each class. The proposed architecture has several interesting properties with respect to other model-based classifiers like nearest-neighbors or radial basis functions: it has a low computational complexity and uses a compact distributed representation of the models. The classifier is also well suited for the incorporation of a priori knowledge by means of a problemspecific distance measure. In particular, we will show that tangent distance (Simard, Le Cun, & Denker, 1993) can be used to achieve transformation invariance during learning and recognition. We demonstrate the application of this classifier to optical character recognition, where it has achieved state-of-the-art results on several reference databases. Relations to other models, in particular those based on principal component analysis, are also discussed. 1 Introduction Most classification algorithms can be separated into two categories that we denote by classification by separation and classification by modeling. To formalize these ideas, let us call the d-dimensional vector x a pattern and X the space of all observable patterns. In the case of real-valued features, we would have X = Rd . Classification by separation tries to subdivide the space X by decision surfaces into several regions, each one corresponding to a class. These regions may extend to infinity in some directions, in particular when d is large. This complicates “distance rejection”—the rejection of patterns that do not correspond to any learning data. This does not necessarily involve the calculation of a distance, but only expresses the fact that an observed pattern is far from any known one. Classification by separation allows the rejection of ambiguous patterns, that is, those close to a decision boundary. It also has the advantage that the recognition complexity is generally independent of the number of training patterns.1 Multilayer neural networks whose outputs are interpreted as class labels are well-known examples of
1
More learning data, however, may require more complex architectures.
c 1998 Massachusetts Institute of Technology Neural Computation 10, 2175–2200 (1998) °
2176
Holger Schwenk
classification by separation. Note, however, that standard mean square error does not generally find the optimal decision surfaces. Classification by modeling, on the other hand, is based on the development of a model of each class and a measure of similarity between each of these models and the unknown pattern. During recognition, the best-fitting model is selected, or the pattern is rejected due to ambiguity (several models fit quite well). If all models fit badly, we have distance rejection. The simplest case, employing Euclidean distance and all learning data as reference vectors, gives us the well-known k-nearest-neighbors classifier. Several techniques have been proposed to learn a smaller set of references without loss of performance or even with improvements. This learning can treat the classes independently (e.g., k-means clustering) or can be discriminant, minimizing confusions between the classes (e.g., learning vector quantization, or LVQ). The common disadvantage of all these algorithms is the inefficient representation of the class models by enumeration of a rather large number of reference patterns. This leads to high memory requirements and slow recognition since many distances have to be calculated. Several authors have proposed speed-up techniques by precalculating relative properties of the references, but this increases memory use. A very interesting advantage of model-based classifiers, on the other hand, is the possibility of using a problem-specific similarity measure that incorporates a priori information about the nature of the data. It is well known that simple Euclidean distance is often inappropriate for comparing patterns since it increases rapidly for some perceptually small changes of the pattern. In optical character recognition, for instance, we know that small rotations should not change the class of a pattern. This invariance can be achieved by deformable templates (Widrow, 1973; Burr, 1981, 1983). Unfortunately, they are also computationally very expensive and discrimination may be difficult to realize. Revow, Williams, and Hinton (1996), for instance, describe generative models that are built from deformable B-splines with gaussian ink generators. The system achieves state-of-the-art performance on the CEDAR database of handwritten isolated digits, but the recognition speed is less than 0.1 character per second on an R4000-based workstation. Although the processing speed can be doubled by using a neural network to provide a better starting point (Williams, Revow, & Hinton, 1995, 1997), this approach probably cannot be used in practical applications. Instead of using a similarity measure, we can also directly estimate the class conditional probabilities. Classification is done using Bayes’ rule, which simplifies to choosing the most likely class if the class prior probabilities are identical. A popular nonparametric technique is kernel density estimation. As pointed out by Hinton, Dayan, and Revow (1997), there is a relationship between relative density estimation and classification by modeling using a similarity measure: if the kernel functions are radially symmetric, monotonically decreasing, and have unbounded extent (e.g., a gaussian), the relative density estimation becomes identical to nearest-neighbor
The Diabolo Classifier
2177
classification as the width parameter of the kernel goes to zero. Relative density estimation can be quite fast, in particular if we use a parametric or semiparametric model, such as a mixture of gaussians. One important drawback, however, is the lack of efficient techniques for incorporating a priori knowledge into the probability models. 2 Architecture of the Classifier In this section we present a new classification architecture that uses a distributed representation of the class models and allows the efficient use of a problem-specific distance measure. The basic idea is to use autoassociative neural networks to learn a model of each class. These networks, also called autoencoders or Diabolo networks,2 are characterized by a small-dimensional hidden layer and an output layer of the same size as the input. We will denote the weights from the input to the hidden-layer encoding weights and those from the hidden layer to the output-layer generative weights. Training is normally done by error back propagation using a target output that is identical with the network input. Therefore, this is an unsupervised learning algorithm. Existing applications include data compression (e.g., DeMers & Cottrell, 1993; Kambhaltla & Leen, 1994) and coding (e.g., Fogelman-Soulie, Lamy, & Viennet, 1993). In both cases the hidden activations are used as a compact but approximate representation of the input vector. Baldi and Hornik (1989) have proved that an autoencoder with one hidden layer and linear activation functions performs principal component analysis (PCA). Our use of autoencoders is quite different from these applications. In the nondiscriminant version of our model, each autoencoder is trained with examples of only one class. There are no network outputs corresponding directly to class labels, but we have to decide which autoencoder to train with a particular example. Like k-means, this is globally supervised but locally unsupervised learning. In the recognition phase, the input vector is propagated through all autoencoders, and the corresponding reconstruction errors are calculated. The network with minimal error determines the class. This classifier does not use the hidden activations. It should also be better suited for high-dimensional classification problems than neural networks predicting directly class a posteriori probabilities since each example induces many constraints on the parameters due to the autoassociative architecture. Figure 1 summarizes the architecture of the Diabolo classifier. The figure also illustrates typical classification behavior, here when presenting the handwritten character D. One can clearly see that the reconstruction distance for the D network is much lower than for the three others. The character is therefore correctly classified. It is also interesting to analyze the outputs of the two networks with the next nearest distances: the O network 2
The name is inspired by the game with a similar “conical” form.
2178
Holger Schwenk
net B
score B
5.75 net D
score D
1.48
net K
score K
argmin
class D
33.08
net O
score O
4.36
character to classify
input diabolo output vectors networks vectors
distance decision mesaures module
Figure 1: Basic architecture of the Diabolo classifier.
“tries” to output a rounder character, and the B network “wants” to add a horizontal bar in the middle. The K network achieves the worst reconstruction error since the two characters have almost nothing in common. This simple classifier already achieves interesting results (see section 3), but it has two major drawbacks. First, all examples of one class are represented by one linear model, which is surely insufficient for many applications such as character recognition; second, there is no discrimination between the classes.
The Diabolo Classifier
2179
Nonlinearity can be achieved in three complementary manners: 1. Additional hidden layers with sigmoidal activation functions between the small bottleneck layer and the input and output layers may be used to perform nonlinear dimension reduction (DeMers & Cottrell, 1993; Fogelman-Soulie et al., 1993; Kambhaltla & Leen, 1994). Several authors, however, have found that these networks have multiple local minima3 and that they are very difficult to train (DeMers & Cottrell, 1993; Kambhaltla & Leen, 1994). We were not able to achieve big improvements with a Diabolo classifier using nonlinear autoencoders (Schwenk & Milgram, 1994). 2. Several autoencoders may be used for each class, giving a piecewiselinear model. This idea was used for lip reading (Bregler & Omohundro, 1994), data compression (Kambhaltla & Leen, 1994), and character recognition (Hinton, Revow, & Dayan, 1995; Hinton et al., 1997). In all these applications, PCA was used to find the linear models instead of training a linear autoencoder with backpropagation. The distribution of the learning data to the different submodels can be done by vector quantization or by expectation-maximization (EM). As pointed out by Tipping and Bishop (1997), however, these mixture models are to some extent ad hoc since they do not rely on a well-defined probability density. 3. A nonlinear distance measure may be used as an objective function instead of the Euclidean distance. Any similarity measure can be used as long as its derivatives with respect to the network outputs can be calculated analytically. This article will show that the third case gives a very powerful and computationally efficient way to incorporate domain-specific knowledge.4 A discriminant learning algorithm will also be introduced in section 2.1. But first we point out some differences between an approach using PCA to find the models and training autoencoders by error backpropagation. From a theoretical point of view, it is known that autoencoders with one hidden layer and linear activation functions perform PCA (Baldi & Hornik, 1989). It is also clear that PCA is computationally more efficient since it involves only finding the first h eigenvectors of the covariance matrix of the learning data. Fast numerical algorithms for this task are well known. Although its convergence is guaranteed since the error surface has no local minima, backpropagation will take more time for learning. In practice, we are mainly interested in fast recognition time, which is the same for both approaches, but fast training may allow us to find a better classifier. How3
A linear autoencoder with one hidden layer and Euclidean error has no local minima (Baldi & Hornik, 1989) 4 The basic idea was first presented in Schwenk and Milgram, (1995b).
2180
Holger Schwenk
ever, we will argue here that linear autoencoders or equivalently PCA may be not well suited for modeling in many applications. They are not able to cluster closed regions in input space since variations in direction of the principal components are free and not limited. This may prevent rejection of unknown data. Autoencoders with a sigmoidal nonlinearity in the hidden layer do not have this drawback. This has been proved theoretically by Bianchini, Frasconi, and Gori (1995) and is also intuitively plausible. The set of all possible outputs of an linear autoencoder is given by the linear combination of the generative weights plus the thresholds of the output layer. If the activations of the hidden units are bound—for instance, by a sigmoidal activation function—the possible outputs form a closed region. One could argue that very small encoding weights and large generative weights may allow coverage of arbitrarily large regions in input space, but we have never observed this situation in our experiments. In addition, the extent of the classes can be explicitly controlled by the discriminant learning algorithm. PCA also has a higher risk of overfitting the data since the models are developed independently, class by class, and it is difficult to control capacity. Furthermore, PCA minimizes reconstruction errors that are not directly related to classification error. For instance, when the number of retained principal components is increased, or equivalently the dimension of the hidden layer of an autoencoder, the reconstruction error of the training examples of the correct class will decrease, but not necessarily the classification error since the reconstructions may also be very good for examples of other classes. Therefore, the capacity and training of each autoencoder have to be controlled so that it achieves a low reconstruction error for only examples of the correct class. This is not easy when using PCA. Backpropagation learning of autoencoders, on the other hand, can be stopped when the classification error on validation data stops decreasing. In fact, it has turned out in our experiments that a Diabolo classifier using one linear autoencoder for each class performed better when backpropagation with early stopping was used instead of numerical PCA (4.7% error instead of 5.3% for the handwritten digit recognition task described later in this article). Further improvements can be obtained by the discriminant learning algorithm presented in the following section. Finally, PCA is based on Euclidean distance, which is not appropriate for many pattern recognition applications. Hinton et al. (1997) suggested incorporating knowledge about desired invariances by manipulating the covariance matrix in a way that simulates augmenting the learning data by artificially transforming them, but it seems that significant improvements could not be achieved by this method with respect to normal PCA. In the rest of this article, we show that a Diabolo classifier can benefit considerably from such knowledge by using a transformation-invariant distance measure as the objective function of the autoencoders.
The Diabolo Classifier
2181
2.1 Discriminant Learning. In the simple Diabolo classifier, each network was trained separately on examples of the corresponding class. In general, however, this does not prevent a network of a wrong class from achieving a low reconstruction error. This overfitting can be controlled to some extent by stopping learning when the classification error stops decreasing, but it can also be controlled explicitly. For this reason we have changed the learning algorithm so that a good reconstruction by a network of a wrong class is punished (Schwenk & Milgram, 1995a). To do this one can normalize the distance measures Di by the softmax function: e−Di Ei = P −D . k ke
(2.1)
∂Di i Learning is now done by gradient ascent in the direction ∂E ∂oi = Ei (Ei − 1) ∂oi where i is the index of the correct class. Additionally we perform gradient ∂E descent in the direction − ∂ojj for a network of another class j 6= i, corresponding to an increase of the distance Dj (“anti-update”). We could do this systematically for all networks, but this would result in an unnecessarily long learning time. In fact, after some training epochs, the models are quite good for most of the examples so that a lot of anti-updates would be performed with a very low weighting. For this reason we increase the distance of at most one network. Let us call D(1) the smallest distance and D(2) the second smallest one. CD(i) denotes the class of the net that achieved the i-smallest distance measure, and Cdes is the desired class of the current example. Using these notations, we have defined two conditions in which to punish a Diabolo network:
Condition for Algorithm 1 • The example is wrongly classified (CD(1) 6= Cdes ): Do an anti-update for the net with the smallest distance. Conditions for Algorithm 2 • The example is wrongly classified (like condition 1): Do an anti-update for the net with the smallest distance. • The example is correctly classified (CD(1) = Cdes ) but the relative dif(2) (1) is very small: Do an antiference to the second-best network D D−D (1) update for the net with the second smallest distance. In the first case each misclassification leads to an increase of the distance of the responsible autoencoder. The second version of discrimination also performs an anti-update when the pattern is just correctly classified. In some sense, we force a clearer decision, some kind of a margin, between
2182
Holger Schwenk
the class models. Note also that the network of the correct class is always updated. Recently this intuitive algorithm has gained a theoretical justification: Schapire, Freund, Bartlett, and Lee (1997) have proved in a boosting context that generalization behavior improves when the number of examples with low margin, for example, the difference between the score of the correct class and the best score of a wrong class, is decreased. Although these algorithms seem to be very similar to the well-known LVQ, there are some important differences. First, the Diabolo classifier uses only one “reference vector” per class, the current output of the autoencoder, while there are normally several ones for LVQ. This implies that the two closest distances always belong to different classes. Note, however, that not all information is compiled down to only one reference for each class since the outputs of the autoencoders are different for each pattern. LVQ2 performs changes only when there are misclassifications—that means on references close to a decision surface. This is not possible with the Diabolo classifier since an autoencoder uses a distributed representation of the “references,” which prevents the adaptation of only one reference. It is also for this reason that discrimination is performed from the beginning of learning. Several experiments have indeed shown that discriminant training of a Diabolo classifier whose autoencoders had been previously learned separately for each class did not lead to improvements. The discriminant learning algorithms also make it less likely that the autoencoders associate arbitrarily large regions in the feature space. 2.2 Knowledge Incorporation by Tangent Distances. In many realworld applications, we know that the classification should be at least locally invariant under a set of problem-specific transformations. In character recognition, for instance, small translations, rotations, or scale changes do not change the class of a pattern. Simard, LeCun, and Denker (1993) have proposed a solution to this problem by using a new distance measure, the so-called tangent distance. Let us represent a transformation t as a parameterized function in the space of all patterns X: t(x, α) : X × R → X
(2.2)
t(x, 0) = x for all x.
(2.3)
Note that all patterns Mx = {y|y = t(x, α)} that can be generated by transformation form a one-dimensional curve in the pattern space X. It may be necessary to use smoothing techniques to make it continuous and guarantee the existence of the derivatives with respect to α. Mx is in general highly nonlinear and mathematically intractable. Therefore, a first-order approxi-
The Diabolo Classifier
2183
mation is used: t(x, α) ≈ x + α tx ¯ ∂t(x, α) ¯¯ , with tx = ∂α ¯α=0
(2.4) (2.5)
where tx is called the tangent vector at the point x. The tangent vector for a specific transformation can be calculated in several ways that are detailed in appendix A. The definition (see equation 2.4) can be generalized to c transformations: t(x, α) ≈ x + α1 t1x + · · · + αc tcx = x + Tx α,
(2.6)
where α = (α1 , · · · , αc )T is a c-dimensional parameter vector and Tx = [t1x , · · · , tcx ] is a matrix of dimension n × c regrouping the tangent vectors. Using these definitions, tangent distance is defined as the minimal distance between the hyperplane Tx containing point x and the hyperplane Ty containing point y (Simard et al., 1993): ° 1° °x + Tx α − y − Ty β °2 2 α,β °2 1° = °x + Tx α∗ − y − Ty β ∗ ° 2
Dxy (x, y) = min
(2.7) (2.8)
This distance5 is locally invariant under the set of specified transformations of the vectors x and y. Note that in many real-world applications, a truly invariant distance measure is not desired since it would match a “6” with a “9” with a rotation by π. As we will show later, it is possible to calculate the derivatives of Dxy with respect to the patterns x or y, respectively. It can therefore be used as an objective function for an autoencoder. During learning, we minimize the tangent distance Ddo between the desired output d, which is, of course, identical with the net input, and the net output o. The networks do not need to learn each example separately anymore, possibly in different positions, orientations, and with varying thickness, but instead they can use the set of specified transformations in order to find a common nonlinear model of each class. This should lead to a lower reconstruction error for examples of the correct class. We could also achieve an arbitrarily low reconstruction error on the training data by increasing the number of hidden units, but the models would be probably less class specific and they would achieve a good reconstruction for examples of other classes as well. 5 Tangent distance is not a real distance measure since it does not satisfy the triangle inequality.
2184
Holger Schwenk
This may worsen the classification behavior even though the reconstruction errors for each class are lower. Also, in some applications, reconstruction should not be perfect if the training examples are very noisy. For this reason, in most of the PCA-based classification systems, the number of retained principal components is chosen so as to capture about 95% of the variance. Tangent distance, however, is unlikely to improve the reconstruction of noise since, in general, a noisy pattern cannot be generated from a clean one by applying an affine or a thickness transformation (see appendix A for a description of the used transformations). For the sake of brevity, we use the following notation for the partial derivatives with respect to several variables regrouped in a vector and for the derivatives of a vector with respect to another vector:
∂s = ∂x
µ
∂s ∂s ··· ∂x1 ∂xn
¶
∂v = ∂x
∂v1 ∂x1
.. .
∂vc ∂x1
···
∂v1 ∂xn
···
∂vc ∂xn
.. .
.
(2.9)
To minimize equation 2.7, we set: ¡ ¢T ∂Ddo (d, o) ! = d + Td α∗ − o − To β ∗ Td = 0T ∗ ∂α ¡ ¢T ∂Ddo (d, o) ! = − d + Td α∗ − o − To β ∗ To = 0T , ∗ ∂β
(2.10) (2.11)
where 0T is a row vector of appropriate size. After some simplification, we can summarize equations 2.10 and 2.11 into one system of linear equations with 2c unknown variables: µ |
Tdd Tod
¶µ −Tdo −Too {z }| =: T
¶ µ T ¶ α∗ Td ! = (o − d), ∗ β TTo {z } | {z } =: x =: R
(2.12)
where Tdd = TTd Td , Tdo = TTd To , Tod = TTdo , and Too = TTo To . The numerical solution of Tx = R finally gives us the optimal multipliers α∗ and β ∗ . Alternatively, we can simplify and rearrange equations 2.10 and 2.11 in order to get two linear systems of c unknown variables each (Simard et al., 1993). This, however, involves more matrix multiplications, which are much more computationally demanding than the numerical solution of the bigger linear system. Nevertheless, it can be advantageous if it is possible to precalculate LU-decompositions of the matrices, for example, when using a nearest-neighbor classifier. We also use a single-sided version of the tangent distance, in which only
The Diabolo Classifier
2185
one vector can be transformed:6 1 kd + Td α − ok2 α 2 °2 1° = °d + Td α∗ − o° . 2
Dd (d, o) = min
(2.13) (2.14)
By setting ¢T ∂Dd (d, o) ¡ ! = d + Td α∗ − o Td = 0T , ∗ ∂α
(2.15)
we easily get T α∗ = T−1 dd Td (o − d) .
(2.16)
Single-sided tangent distance is, of course, less powerful, but it has computational advantages. First, we need to perform only one matrix multiplication and inversion instead of calculating and solving the linear system (see equation 2.12). Furthermore, these operations need to be done only once since T the term T−1 dd Td in equation 2.16 is the same for all the autoencoders of the different classes. Using a Diabolo classifier with the single-sided tangent distance is therefore only slightly more complicated than the standard Euclidean error. Figure 2 shows a graphical comparison of Euclidean distance with the two tangent distances. 2.3 Learning Algorithms. When training an autoencoder by backpropagation, we need to calculate the derivatives of the objective function with respect to the network weights, or after application of the chain rule, with respect to the network outputs. For the single-sided tangent distance, we get: −
¡ ¢ ¢T ¡ ∂Td α∗ ∂Dd (d, o) = − d + Td α∗ − o −I ∂o ∂o ¢T ¡ ¢T ¡ ∂ α∗ = d + Td α∗ − o − d + Td α∗ − o Td | {z } ∂o T = 0 after 2.15 ¢T ¡ ∗ (2.17) = d + Td α − o .
This is a very intuitive result: the net output does not approach the desired output directly, but instead it takes the shortest path toward the tangent hyperplane (see also Figure 2). 6
We use the notation Dd and Ddo to express which vectors can be transformed.
2186
Holger Schwenk
Md
Ddo
td
Dd ,rDd
d
D
o
to
Mo
d : desired output D : Euclidean distance td : tangent vector in d Dd : single-sided tangent distance (only d is transformed) o : net output to : tangent vector in o Ddo : double-sided tangent distance
(both points are transformed)
rDd : gradient of Dd Figure 2: Comparison of Euclidean distance with the different tangent distances.
The situation is more complicated when using the double-sided tangent distance as the objective function:
−
¶ µ ¡ ¢T ∂To β ∗ ∂Td α∗ ∂Ddo (d, o) = − d + Td α∗ − o − To β ∗ −I− 0+ ∂o ∂o ∂o ¡ ∗ ¢T ∗ = d + Td α − o − To β
The Diabolo Classifier
2187
¢T ¡ ∂ α∗ − d + Td α∗ − o − To β ∗ Td | {z } ∂o T = 0 after equation 2.10 ¶ µ ¢T ∂To ∗ ¡ ∂ β∗ β + To + d + Td α∗ − o − To β ∗ ∂o ∂o =
¡
d + Td α∗ − o − To β ∗
(2.18)
¢T
c ¢T X ¡ ∂tl + d + Td α∗ − o − To β ∗ βl∗ o ∂o l=1
¢T ¡ ∂ β∗ + d + Td α∗ − o − To β ∗ To | {z } ∂o T = 0 after equation 2.11
=
¡
∗
d + Td α − o − To β
∗
¢T
à I+
c X l=1
(2.19)
∂tl βl∗ o ∂o
! .
(2.20)
This means that we have to calculate the derivatives of the tangent vectors of the network output (see appendix B). Despite this complicated error function, standard stochastic backpropagation was used. In practice, a small “trick” proved to be useful: for the calculation of the derivatives, we replaced the definition (see equation 2.8) by: Ddo (d, o) =
¡ ¢° 1° °d − o + λ Td α∗ − To β ∗ °2 , 2
(2.21)
where λ is a parameter that is increased from 0 to 1. For λ = 0 we have Euclidean distance, and for λ = 1 we have the double-sided tangent distance. In fact, in the beginning of learning, the output vectors do not always resemble characters, so the application of the transformations may be meaningless. In general, however, the choice of the learning parameters is not very critical. 2.4 Comparison with Similar Architectures. Several authors have developed architectures that share some similar ideas with the Diabolo classifier. Pomerleau (1993), for instance, has suggested adding outputs trained to reproduce the inputs to a standard discriminant neural network. After learning, the reconstruction error is used to decide if the network is being applied to a known situation—in this case, if an autonomous vehicle is driving on a known road situation. This technique, called input reliability estimation, was also used to select between several networks trained for different road situations. This is similar to a Diabolo classifier using Euclidean distance and nondiscriminant training, but the main goal is to learn to drive
2188
Holger Schwenk
the vehicle. It is also an open question whether the combination of autoassociative learning with discriminant outputs negatively influences global performance. Hinton et al. (1995) proposed classifying handwritten digits by learning piecewise linear models of each class. Classification is done by choosing the most likely submodel. Each class is represented as a mixture of 10 linear models. Learning is done separately for each class using an EM algorithm to assign the responsibilities to the different linear models within one class and numerical PCA (Hinton et al., 1995), or factor analysis (Hinton et al., 1997) is used to learn the linear models themselves. The resulting classifier is quite fast, but it is unknown if it can achieve state-of-the-art results in comparison to a neural network using shared weights or a nearest-neighbor classifier using tangent distance. The authors also suggested incorporating knowledge about desired invariances by manipulating the covariance matrix in a way that simulates adding artificially transformed examples, but significant improvements were not achieved by this method with respect to normal PCA or factor analysis (FA). This may be explained by the fact that this technique is in some sense equivalent to the less powerful single-sided tangent distance, since transformations are not applied to the test vector. It may also be that the artificially transformed data would necessitate the use of more free parameters. Recently this architecture was refined by Tipping and Bishop (1997) who used a well-defined mixture model for probabilistic principal component analyzers. They report an error rate of 4.64% on the “bs” test set of the CEDAR handwritten digit database, which is a slight improvement with respect to the mixture of PCAs (4.91%), but equivalent to the mixture of FAs (4.68%). Other applications include bank note identification (Frosini, Gori, & Priami, 1996), wire-frame classification (Suzuki & Ando, 1995), and fault detection (Japkowicz, Myers, & Gluck, 1995). All of these systems rely on autoencoders using simple Euclidean distance, and they use a nondiscriminant learning algorithm. We will show in the following section that in particular using tangent distance as an objective function and a better learning algorithm dramatically improves the performance of an autoencoder-based classification architecture. To the best of our knowledge, until now the only learning algorithms for tangent-distance-based classifiers that have been presented are due to Hastie and Simard (1995). The approach seeks to find, separately for each class, a centroid m minimizing the average tangent distance with respect to the learning data xi : C(m) =
X i
° °2 min °m + Tm αi − xi − Txi β i ° .
αi ,β i
(2.22)
This is a difficult nonlinear optimization problem, but an iterative procedure is available. Instead of specifying implicitly the desired invariances, such
The Diabolo Classifier
2189
as the rules for calculating the tangent vectors, the authors also suggest learning the invariances from the data itself by minimizing the following criterion: C(m, 0) =
X i
° °2 min °m + 0 αi − xi − Txi β i °
αi ,β i
(2.23)
0 is an n × c dimensional matrix where c denotes the number of invariances we want to learn. Again, there is an iterative procedure based on singular value decomposition. Both algorithms have been used as the building block in k-means clustering in order to learn several centroids for each class. Error rates of 3.4% on the digits database of the U. S. Postal Service using only 23 prototypes per class have been reported, while using all 700 references per class gives an error of 2.6% (both results used the double-sided tangent distance). 3 Experimental Results In 1992 the National Institute of Standards and Technology (NIST) provided a database of handwritten digits and letters, known as the NIST SpecialDatabase 3. This database contains about 45,000 uppercase segmented characters, which we have divided into a learning and validation set (60%) and a test set (40%). We also report results on the digits of this database. We applied only a very simple preprocessing: the binary characters were centered and size normalized, the aspect ratio was kept, and characters were smoothed using a gaussian kernel. We neither deslanted the characters nor tried to clean background noise that could hurt the normalization process. The net input is 16 × 16 pixels with real values in the range [−1, 1]. All the following results were obtained using one fully connected autoencoder for each class. The dimension of the hidden layers was 40 when Euclidean and 30 when single- or double-sided tangent distance was used. These values were chosen after some initial experiments, but it has turned out that it is not a very crucial parameter. Note that all the classes use the same architecture for the autoencoders, although it may be reasonable to lower the dimension for classes with little variability. A set of seven transformations (x- and y-translation, rotation, scaling, axis deformation, diagonal deformation, and thickness) was used (see appendix A for the calculation of the tangent vectors). Figure 3 illustrates how the networks use the transformations. Although the L in the first example has an unusually short horizontal line, the network reconstructs a normally sized character. It is clearly visible how the input transformation lengthens and the output transformation shortens this line in order to get a small tangent distance. The right side of the first line shows a very difficult classification problem: a heavily deformed T. Nevertheless, we get a small tangent distance, so that the character is correctly classified.
2190
Holger Schwenk
The second line shows an example of a wrong classification: the T is recognized as an F. The output of the T network can be transformed so that the shear is correct, but since the transformations are applied to the whole character, they cannot shift the vertical bar more to the middle of the top bar for the T (right side of second line). This leads to a lower reconstruction error of the F network (left side of second line). We were not able to find an example for which the wrong application of the transformations led to a misclassification. In summary, we note a big difference between the Euclidean and the tangent distances, a good indicator that the autoencoders really use the transformations. 3.1 NIST Uppercase Letters. Figure 4 summarizes the results on the whole test set of 17,982 handwritten uppercase letters. Two criteria were used: the raw error rate and the 1% rejection rate (the number of examples that must be rejected in order to get only 1% error on the accepted ones). Characters were rejected when the relative difference between the two closest distances falls below a threshold. This criterion rejects ambiguous patterns. Distance rejection could be based on the absolute values of the reconstruction errors, but it was not tried since we were missing appropriate counterexamples. The incorporation of knowledge by using tangent distance as an objective function dramatically improved the performances. The Diabolo classifier achieves a raw error rate of 4.7% with simple Euclidean distance, which goes down to 3.7% with the single-sided tangent distance and to only 2.6% with the double-sided tangent distance. The 1% rejection rates are 11.7%, 8.0%, and 4.2% respectively. The discriminant learning algorithms improved these results further. This can be clearly seen for the simplest Diabolo classifier using Euclidean distance: the error rate could be reduced by 0.3% and the 1% rejection rate by as much as 2.6%. Since the accuracy of a Diabolo classifier is quickly improved with the incorporation of knowledge by transformation invariance, it is more and more difficult to achieve significant gains. Thus, it was not possible to reduce the error rate of the Diabolo classifier using double-sided tangent distance (and not really expected). Most of the remaining misclassifications are obviously inevitable ambiguous cases in this database. The rejection rate, on the other hand, could be significantly reduced by 0.6%. In many applications, it is an even more important quality criterion than the raw error rate. The networks trained by the second, more discriminant version showed a slightly better generalization behavior. In order to get the same low raw error rate with a 1-nearest-neighbor classifier using tangent distance, the entire set of 26,969 reference vectors had to be used (see Table 1). It is worth noting the results with fewer references. When using only 17,982 references, the error rate increased to 3.7% for the single-sided version and 2.8% for the double-sided version. Also, the Diabolo classifier has both a significantly better rejection behavior and lower computational complexity. Characters were stored as floating-point
The Diabolo Classifier
2191
Euclidean distance: 20.7
Euclidean distance: 9.65 }|
z
{
input
output
z
input
{z
output
{
input
optimally transformed |
}|
output
optimally transformed
}
|
input
{z
output
}
tangent distance: 0.86
tangent distance: 1.06
\F"-network
\T"-network Euclidean distance: 18.8
Euclidean distance: 16.2 }|
z
input
{
output
optimally transformed |
input
{z
output
}
tangent distance: 3.27
z
}|
input
{
output
optimally transformed |
input
{z
output
}
tangent distance: 3.89
Figure 3: Reconstruction examples (test set). The left side of each screen dump depicts the input character and the right side the one reconstructed by the network of the correct class. In the middle are the optimally transformed patterns as calculated when evaluating the double-sided tangent distance, transformed by α ∗ and β ∗ , respectively. The second line shows a wrong classification: the reconstruction error of the F network (left side) is smaller than the one of the correct T network (right side).
2192
Holger Schwenk
Euclidean one sided two sided
error in %
rejection in % 11.7
4.7
5
4.4 3.7
3.6 2.6
10
4.4
10.0
3.5 2.6
9.1
8.0 6.8
6.7
2.6 4.2
5
3.7
3.6
0
without discrim.
discrim. discrim. version 1 version 2
without discrim. discrim. discrim. version 1 version 2
Figure 4: Error (left) and rejection rates (right) of a Diabolo classifier on the test set of NIST uppercase letters using the different discrimination methods. Table 1: Comparison of the Diabolo Classifier with Other Well-Known Systems, NIST Uppercase Letters.
Test set Raw error 1% rejection Complexity Memory Speedb a b
Discriminant Diabolo
1-Nearest-Neighbor of P. Simard
LeNet 1
LeNet 1 + Boosting
2.55% 3.63%
2.58% 4.40%
4.0%a 9.2%a
2.4%a 3.1%a
400 KB 18 ch/s
27 MB 2 ch/s
100 KB ≈ 50 ch/s
300 KB ≈ 16-28 ch/s
Test database of 3000 examples. Characters per second (ch/s) on a Pentium Pro 200 MHz.
numbers, and all reference vectors were loaded into memory. No other properties were precalculated. Double-sided tangent distance was calculated on only the 256 closest reference vectors with respect to Euclidean distance. Using this implementation, the 1-nearest-neighbor approach recognizes about 2 characters per second on a PC with a PentiumPro 200 MHz CPU, while the Diabolo classifier achieves a recognition speed of more than 18 characters per second. Simard (1994) and Sperduti and Stork (1995) have proposed more elaborate speedup techniques for a k-nearest-neighbors classifier using tangent distance, but this necessarily increases the memory requirements further.
The Diabolo Classifier
2193
Table 2: Comparison of the Diabolo Classifier with Other Well-Known Systems: NIST Digits, 10 Runs Using Different Partitions. Classifier
1-nearest-neighbor with tangent distance Diabolo LeNet 1 Autoencoder + RBF MLP Combination of the 3 above classifiers a b
Raw Error Rate
1% Rejection Rate
Best
Average
Worst
Best
Average
Worst
0.68%
0.98%
1.32%
a
0.11%
0.60%
1.19% 1.4% 1.9% 1.9%
1.49%
a b
b
b
b
b
0.44% 0.7% 1.9% 2.1%
1.15%
b
1.1%
b
b
0.5%
b
0.94% b b b
b
b b b
The rejection rate is not evaluated if the raw error is less than 1%. Only one run using 15,000 digits for training and 4,000 for testing.
Table 1 also contains the results obtained with a large multilayer perceptron with extensive use of shared weights, known as LeNet 1 (LeCun et al., 1989). It achieves an error rate of 4.0% when used alone and 2.4% for an ensemble of three such networks trained by boosting (Drucker, Schapire, & Simard, 1993). The networks were trained on a basic set of 10,000 examples, and the validation and test set consisted of 2,000 and 3,000 examples respectively (Drucker, personal communication). Due to the different number of examples, the results should be compared with care, but we can probably deduce that the state of the art on this database seems to be around 2.5%. Meanwhile, there are considerably improved versions of LeNet 1 (LeCun et al., 1995), but to the best of our knowledge they have not been tested on the uppercase letters of the NIST database. 3.2 NIST Digits. We have also evaluated the Diabolo classifier with exactly the same parameters on the digits of the NIST database. This database has been used by many researchers, but unfortunately there is no welldefined training and test set, so results from different authors should be compared with care. In our experiments, we used only the first 25,000 of the more than 200,000 available patterns. These digits have been divided as follows: 15,000 examples for learning and validation and 10,000 examples for the independent test set. The preprocessing is identical to the one used for the letters. We have recognized that these data contain parts of varying difficulty. Therefore Table 2 gives the best, worst, and average results for 10 different partitions of the whole data. A similar setup was used by Viennet (1993) and Lamy (1995). The authors used one fixed partition of 15,000 digits for learning and only 4,000
2194
Holger Schwenk
for testing. It is not clear if an independent validation data set was used for tuning parameters. Using these data, the authors evaluated the well-known LeNet 1, a modular system using one autoencoder for feature extraction followed by a radial basis function (RBF) network for classification (FogelmanSoulie et al., 1993; Viennet, 1993), and a multilayer perception (MLP) using hand-optimized features (combination of oriented profiles in three directions, point occurrences, morphological characteristics, and so forth). The last system is in fact an improvement of the one first proposed by Burel, Pottier, and Catros (1992). The comparison with our results is not easy, but the Diabolo classifier seems to outperform all the other single classifiers (the 1-nearest-neighbor is slightly better, but it has a very high computational complexity). Also, the Diabolo classifier achieves results at least as good as the additive combination of the three aforementioned classifiers (Lamy, 1995), a system that takes advantage of three different coding schemes. This suggests that a priori knowledge should be incorporated into the learning algorithm rather than into the feature extraction. Another very complete comparison of several classification architectures was done by Le Cun et al. (1995). However, 60,000 digits for training and 10,000 for testing were used, coming half and half from the NIST databases SD3 and TD1. Therefore, we do not think that their results can be directly compared with ours, but the approaches tested, in particular LeNet 5, Boosting, and support vector machines, may also achieve very low error rates on the part of the NIST SD3 digits we used for our study. 3.3 Digit Database of the French Postal Service. Finally, we have tested the Diabolo classifier on a database of segmented handwritten digits of the French Postal Service (Schwenk & Milgram, 1996b). It contains welldefined subsets: 9,466 examples for learning and 2,371 digits for validation. The French Postal Service has another internal set of 7,397 examples coming from the same envelopes. The evaluation of the performance on these data is done by the French Postal Service in order to guarantee a correct comparison. In all three data sets, there is a great variety in the number of examples of each class. There are, for instance, almost 2400 zeros, but only 400 eights in the learning database. If we trained a Diabolo classifier with these original data, the models for the zeros would converge much faster than the other ones. Therefore, we discarded some of the examples or duplicated them in order to get approximately 1000 training examples for each class. This is a brute-force method, with several disadvantages. More elaborate techniques are possible; for instance, applying small, random transformations instead of duplicating examples exactly, but they were not tried. Note also that we did not use the class prior probabilities during classification. The preprocessing and the parameters of the Diabolo classifier are the same as for the experiments with the NIST database. However, we have recognized that it was not favorable to use thickness transformations for either the 1-nearest-neighbor or the Diabolo classifier. This can be explained
The Diabolo Classifier
2195
Table 3: Comparison of Several Classifiers on the French Postal Service Database. 1-nn Using Pixmap Features
Diabolo Using Pixmap Features
LeNet 1 Using Pixmap Features
RBF Using Morphological Features
MLP Using Hybrid Features
Validation set: Raw error
1.31%
1.60%
2.4%
2.96%
Not reported
Test set: Raw error
2.20%
2.66%
Not reported
3.71%
4.9%
by the small size of the digits: 20 × 30 pixels in comparison to 60 × 80 for the NIST database. Therefore, the normalized 16 × 16 images are almost binary and they do not contain enough gray values to apply a thickness transformation correctly, even with smoothing. The results of a 1-nearest-neighbor and a Diabolo classifier, both using double-sided tangent distance, as well as other systems proposed in the literature, are summarized in Table 3. The slightly modified LeNet 1 was evaluated on this database by Bouatour and Julien (1993). The RBF uses morphologic features (Lemari´e, 1993), and the MLP using handcrafted features was described above (Pottier & Burel, 1993; Burel et al., 1992). Again, the Diabolo classifier outperforms all other neural network classifiers in this comparison. Its results are only slightly worse than those of the 1-nearest-neighbor classifier, using only a fraction of the memory and time. 4 Conclusion We have presented a new classification architecture that has several interesting properties for application to pattern recognition. It combines the advantages of a model-based classifier, in particular modularity and the better possibility of rejecting patterns that are far from any known class, with a compact distributed representation of the models and the necessity to evaluate only few distance metrics. On the other hand, one may argue that the computational complexity of the Diabolo classifier increases linearly with the number of classes. A possible solution could be to train a simple classifier to predict a set of n classes, with n smaller than the number of actual classes, that contains with high accuracy the correct one. During recognition, one would evaluate the models of the Diabolo classifier only for these classes. During the past few years, the tangent distance of P. Simard has evolved into a popular idea, and one of the most successful approaches to achieve arbitrary transformation invariance in optical character recognition. To the
2196
Holger Schwenk
best of our knowledge, the proposed Diabolo classifier is the only approach to learn discriminant tangent models. This also distinguishes our approach from the ones taken by Tipping and Bishop (1997), Hinton et al. (1997), and others that aim to circumvent the limitations of single PCA models by mixture models. These developed models are based on statistical properties of the learning data, and there is no incorporation of a priori knowledge about particular properties of the learning problem. Also, the models are not discriminant since they are developed independently for each class. Our experimental results in handwritten character recognition suggest that discrimination and efficient knowledge incorporation can lead to very good generalization behavior. There are, of course, applications where a priori knowledge is not available and standard Euclidean distance has to be used, but discriminant training is still possible. In these cases, a mixture model, also nondiscriminant, may yield better results than the proposed Diabolo classifier. However, the EM approaches used to train PCA mixture models in principle could also be applied to autoencoders using tangent distance in the objective function. This may improve the results in some cases and is an interesting topic for future research. We are also looking into applications of the Diabolo classifier to other domains, in particular, online character recognition (Schwenk & Milgram, 1996a) and speech processing. Appendix A: Calculation of the Tangent Vectors The following demonstration follows the work of Simard, Le Cun, Denker, and Victorri (1992). See also Hastie and Simard (1995) for an alternative formulation. The basic idea is to perform a transformation of a two-dimensional gray-scale image p(x, y) by altering the coordinates x and y where p is evaluated: t (p, α) := p(x + α Dx , y + α Dy ).
(A.1)
Using a first-order Taylor series expansion, we can write: ∂p(x, y) ∂p(x, y) ¢ + Dy ∂y {z } | ∂x | {z } =: Sx =: Sy ¢ ¡ = p(x, y) + α Dx Sx + Dy Sy ,
t (p, α) ≈ p(x, y) + α
¡
Dx
(A.2)
(A.3)
and we get finally: t(p, α) = p + α tp
(A.4)
with tp = Dx Sx + Dy Sy .
(A.5)
The Diabolo Classifier
2197
Table 4: Coordinate Displacements for the Seven Transformations Used. Horizontal Vertical Axial Diagonal Translation Translation Scale Deformation Rotation Deformation Thickness Dx Dy
1 0
0 1
x y
−x y
−y x
y x
∂ p(x,y) ∂x ∂ p(x,y) ∂y
The coordinate displacements Dx and Dy for the seven transformations used are given in Table 4. Several approaches can be used to approximate the derivatives Sx and Sy (Hastie & Simard, 1995): 1. Use first differences in each direction. 2. Convolve the image with a smooth bivariate kernel, and then differentiate. In practice, this implies differentiating the kernel first (separately in x and y) and then convolving: t(p, α) ≈ p ∗ g + α tp
(A.6) (A.7)
with ¶ ¶ µ µ ∂g ∂g ¢ , +Dy p ∗ tp = Dx p ∗ ∂x ∂y | {z } | {z } =: Sx =: Sy
(A.8)
where g is a bivariate gaussian. Note that the masks for the partial derivatives can be calculated exactly since g is analytically known. 3. Smooth the image first, but then use first differences as in equation 2.1. The results reported in this article were obtained using the second method. Appendix B: Calculation of the Derivatives of the Tangent Vectors The derivation of tangent vectors to of the network output o can be done with help of equation A.5. First, we need the derivatives of Sx and Sy . This depends, of course, on which approach was used to calculate them. In the case of first differences in each direction, we get −1 and +1 for the respective neighbored points and zero for the others. The second method reduces to the calculation of the derivatives of a convolution. This is, in fact, the mask used. For the last method, both of the above techniques have to be combined. Note that the displacements Dx and Dy are independent of the output vector for affine transformations (the first six in Table 4). For the thickness
2198
transformation we have: ¶ µ ´ ∂Sy ∂ ³ 2 ∂Sx ∂to = Sx + S2y = 2 Sx + Sy . ∂o ∂o ∂o ∂o
Holger Schwenk
(B.9)
Acknowledgments Most of the work was done while the author was with PARC laboratory at the University of Paris VI, France. The author would like to thank Y. Bengio, P. Haffner, M. Milgram, and P. Simard for helpful comments. This work was partly funded under grant HSP II D/92/01878 of the German Academic Exchange Service. References Baldi, P., & Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2, 53–58. Bianchini, M., Frasconi, P., & Gori, M. (1995). Learning in multilayered networks used as autoassociators. IEEE Transactions on Neural Networks, 6(2), 512–515. Bouatour, H. & Julien, E. (1993). Reconnaissance des chiffres manuscrits. In JET-POSTE’93 (pp. 937–944). Nantes: Service de Recherche Technique de la Poste. Bregler, C., & Omohundro, S. M. (1994). Surface learning with applications to lipreading. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 43–50). San Mateo, CA: Morgan Kaufmann. Burel, G., Pottier, I., & Catros, J. (1992). Recognition of handwritten digits by image processing and neural networks. In International Joint Conference on Neural Networks (pp. 666–671). Burr, D. J. (1981). A dynamical model for image registration. Comput. Graphics Image Processing, 15, 102–112. Burr, D. J. (1983). Matching elastic templates. In O. Braddick & A. Sleigh (Eds.), Proceedings of physical and biological processing of images. Berlin: SpringerVerlag. DeMers, D., & Cottrell, G. (1993). Non-linear dimensionality reduction. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 580–587). San Mateo, CA: Morgan Kaufmann. Drucker, H., Schapire, R., & Simard, P. (1993). Boosting performance in neural networks. International Journal of Pattern Recognition and Artificial Intelligence, 7(4), 705–719. Fogelman-Soulie, F., Lamy, B., & Viennet, E. (1993). Multi-modular neural network architectures for pattern recognition: Applications in optical character recognition and human face recognition. International Journal of Pattern Recognition and Artifical Intelligence, 7(4), 721–755. Frosini, A., Gori, M., & Priami, P. (1996). A neural network–based model for paper currency recognition and verification. IEEE Transactions on Neural Networks, 7(6), 1482–1490.
The Diabolo Classifier
2199
Hastie, T., & Simard, P. (1995). Learning prototype models for tangent distance. In G. Tesauro, D. Touretzky, and T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 999–1006). Cambridge, MA: MIT Press. Hinton, G. E., Dayan, P., & and Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8(1), 65–74. Hinton, G. E., Revow, M., & Dayan, P. (1995). Recognizing handwritten digits using mixtures of linear models. In G. Tesauro, D. Touretzky & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 1015–1022). Cambridge, MA: MIT Press. Japkowicz, N., Myers, C., & Gluck, M. (1995). A novelty detection approach to classification. In Proceedings of the Fourteenth Joint Conference on Artificial Intelligence (pp. 518–523). Kambhaltla, N., & Leen, T. K. (1994). Fast non-linear dimension reduction. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 152–159). San Mateo, CA: Morgan Kaufmann. Lamy, B. (1995). Reconnaissance de Caract´eres Manuscrits par Combinaison de Mod´eles Connexionnistes. Unpublished doctoral dissertation, Universit´e Pierre et Marie Curie. Le Cun, Y., Boser, B., Denker, J. S., Henderson, B., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551. Le Cun, Y., Jackel, L., Boutou, L., Brunot, A., Cortes, C., Denker, J., Drucker, H., Guyon, I., Muller, ¨ U., S¨ackinger, E., Simard, P., & Vapnik, V. (1995). Comparison of learning algorithms for handwritten digit recognition. In International Conference on Artificial Neural Networks (pp. 53–60). Berlin: Springer-Verlag. Lemari`e, B. (1993). R`eseaux de r´egularisation pour la reconnaissance des chiffres manuscrits. In JETPOSTE’93 (pp. 541–549). Nantes: Service de Recherche Technique de la Poste. Pomerleau, D. A. (1993). Input reconstruction reliability estimation. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 279–286). San Mateo, CA: Morgan Kaufmann. Pottier, I., & Burel, G. (1993). Evaluation of a neural system for handwritten digits recognition. In JETPOSTE’93 (pp. 550–557). Nantes: Service de Recherche Technique de la Poste. Revow, M., Williams, C. K. I., & Hinton, G. E. (1996). Using generative models for handwritten digit recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(6), 592–606. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1997). Boosting the margin: A new explanation for the effectiveness of voting methods. In Machine Learning: Proceedings of the Fourteenth International Conference on Machine Learning (pp. 322–330). San Mateo, CA: Morgan Kaufmann. Schwenk, H., & Milgram, M. (1994). Structured diabolo-networkds for handwritten character recognition. In M. Marinaro & P. Morasso (Eds.) International Conference on Artificial Neural Networks (pp. 985–988). Berlin: SpringerVerlag.
2200
Holger Schwenk
Schwenk, H., & Milgram, M. (1995a). Learning discriminant tangent models for handwritten characater recognition. In International Conference on Artificial Neural Networks (pp. 585–590). Berlin: Springer-Verlag. Schwenk, H., & Milgram, M. (1995b). Transformation invariant autoassociation with application to handwritten character recognition. In D. Touretzky, G. Tesauro, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 991–998). Cambridge, MA: MIT Press. Schwenk, H., & Milgram, M. (1996a). Constraint tangent distance for online character recognition. In International Conference on Pattern Recognition (pp. D 520– 524). New York: IEEE Computer Society Press. Schwenk, H., & Milgram, M. (1996b). Reconnaissance de codes postaux par r´eseaux diabolo. In Colloque National sur l’Ecrit et Document. Nantes: Service de Recherche Technique de la Poste. Simard, P. Y. (1994). Efficient computation of complex distance metrics using hierarchical filtering. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 168–175). San Mateo, CA: Morgan Kaufmann. Simard, P., Le Cun, Y., & Denker, J. (1993). Efficient pattern recognition using a new transformation distance. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 50–58). San Mateo, CA: Morgan Kaufmann. Simard, P., Le Cun, Y., Denker, J., & Victorri, B. (1992). An efficient algorithm for learning invariances in adaptive classifiers. In International Conference on Pattern Recognition (pp. 651–655). Sperduti, A., & Stork, D. G. (1995). A rapid graph-based method for arbitrary transformation-invariant pattern classification. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 665– 672). Cambridge, MA: MIT Press. Suzuki, S., & Ando, H. (1995). Unsupervised classification of 3D objects from 2D views. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 949–956). Cambridge, MA: MIT Press. Tipping, M. E., & Bishop, C. M. (1997). Mixtures of probabilistic principal component analysers (Tech. Rep. No. NCRG/97/003). Neural Computing Research Group, Aston University. Viennet, E. (1993). Architectures connexionnistes multi-modulaires, application a` l’analyse de sc´ene. Unpublished doctoral dissertation, Universit´e de ParisSud, Centre d’Orsay. Widrow, B. (1973). The ”rubber-mask” technique–I. Pattern measurement and analysis. Pattern Recognition, 5, 175–197. Williams, C. K., Revow, M. D., & Hinton, G. E. (1995). Using a neural net to instantiate a deformable model. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 965–972). Cambridge, MA: MIT Press. Williams, C. K. I., Revow, M., & Hinton, G. E. (1997). Instantiating deformable models with a neural net. Computer Vision and Image Understanding, 68(1), 120–126.
Received October 31, 1997; accepted April 20, 1998.
LETTER
Communicated by John Platt
Online Learning from Finite Training Sets and Robustness to Input Bias Peter Sollich Department of Physics, University of Edinburgh, Edinburgh EH9 3JZ, U.K.
David Barber Real World Computing Partnership Theoretical Foundation, SNN, University of Nijmegen, 6525 EZ Nijmegen, The Netherlands
We analyze online gradient descent learning from finite training sets at noninfinitesimal learning rates η. Exact results are obtained for the timedependent generalization error of a simple model system: a linear network with a large number of weights N, trained on p = αN examples. This allows us to study in detail the effects of finite training set size α on, for example, the optimal choice of learning rate η. We also compare online and offline learning, for respective optimal settings of η at given final learning time. Online learning turns out to be much more robust to input bias and actually outperforms offline learning when such bias is present; for unbiased inputs, online and offline learning perform almost equally well. 1 Introduction The analysis of online (gradient descent) learning, one of the most common approaches to supervised learning found in the neural networks community, has recently been the focus of much attention. The characteristic feature of online learning is that the weights of a network (“student”) are updated each time a new training example is presented, such that the error on this example is reduced. In offline learning, on the other hand, the total error on all examples in the training set is accumulated before a gradient-descent weight update is made. For a given training set and starting weights, offline learning is entirely deterministic. Online learning, on the other hand, is a stochastic process due to the random choice of training example (from the given training set) for each update; in fact, it can essentially be viewed as a noisy version of offline learning. The two are equivalent only in the limit where the learning rate η → 0 (see, e.g., Heskes & Kappen, 1991). For both online and offline learning, the main quantity of interest is normally the evolution of the generalization error. After a given number of weight updates, how well does the student approximate the input-output mapping (“teacher”) underlying the training examples? c 1998 Massachusetts Institute of Technology Neural Computation 10, 2201–2217 (1998) °
2202
Peter Sollich and David Barber
Most analytical treatments of online learning assume either that the size of the training set is infinite or that the learning rate η is vanishingly small. Both restrictions are undesirable. In practice, most training sets are finite1 , and noninfinitesimal values of η are needed to ensure that the learning process converges after a reasonable number of updates. General results have been derived for the difference between online and offline learning to first order in η, which apply to training sets of any size (see, e.g., Heskes & Kappen, 1991). However, these do not directly address the question of generalization performance. The most explicit analysis of the time evolution of the generalization error for finite training sets was provided by Krogh and Hertz (1992) for a scenario very similar to the one we consider below. Their η → 0 offline calculation will serve as a baseline for our work. For finite η, progress has been made in particular for so-called soft committee machine network architectures (see, e.g., Saad & Solla, 1995; Biehl & Schwarze, 1995), but only for the case of infinite training sets. In this article, we give an exact analysis of online learning in a simple model system. Our aim is twofold: (1) to assess how the combination of noninfinitesimal learning rates η and finite training sets (containing α examples per weight) affects online learning and (2) to compare the generalization performance of online and offline learning. A priori, one may expect online learning to perform worse due to its inherent randomness. We show that this disadvantage is actually negligible when online and offline learning are compared on an equal footing, that is, for their respective optimal learning rates. More important, we will see that online learning is much more robust to input bias than offline learning and actually performs better than the offline version in the case of biased inputs. 2 Model Definition We consider training of a linear student network with input-output relation 1 y = √ wT x. N Here x is an N-dimensional vector of real-valued inputs, y the single real T output, and w the weight vector √ of the network. denotes the transpose of a vector, and the factor 1/ N is introduced for convenience. In online learning, whenever a training example (x, y) is presented to the network, its weight vector is updated along the gradient of the squared error2 on this 1 Online learning can also be used to learn teacher rules that vary in time. The assumption of an infinite set (or “stream”) of training examples is then much more plausible, and in fact necessary for continued adaptation of the student. We do not consider this case in the following. 2 We consider only squared error here, which is probably the most commonly used error measure. We also restrict our analysis to “vanilla” gradient descent learning, excluding
Online Learning
2203
example—
µ ¶2 µ ¶ 1 1 1 1 y − √ wT x = η √ yx − xxT w , 2 N N N where η is the learning rate. We are primarily interested in the case of online learning from finite training sets, where for each update an example is randomly chosen from a given set {(xµ , yµ ), µ = 1, . . . , p} of p training examples. If example µ is chosen for update n, the weight vector is changed to n io η h µ µ T 1 x (x ) + γ wn + η √ yµ xµ (online). (2.1) wn+1 = 1 − N N 1w = −η ∇w
Here we have also included a weight decay γ . The update rule for offline learning is similar, but here the gradients for all p different training examples are accumulated before a weight update is made: η X µ µ y x (offline). (2.2) w(r+1)p = [1 − η(λ + A)]wrp + √ N µ Here r is the number of offline weight updates. In order to compare online and offline learning at equal computational cost, we index the weight vectors for both cases by the number of gradient calculations, which is n = rp in the offline case. The matrix X xµ (xµ )T A = N1 µ
is the correlation matrix of the training inputs, and λ = γ α is the weight decay rescaled by the number of examples per weight, α = p/N. We will generally use λ (rather than γ ) to characterize the strength of the weight decay, for both online and offline learning. For simplicity, all student weights are assumed to be initially zero, that is, wn=0 = 0. The main quantity of interest to us is the generalization error of the student and its evolution during learning. We assume that the√training examples are generated by a linear “teacher,” that is, yµ = wT∗ xµ / N + ξ µ , where ξ µ is zero-mean additive noise of variance σ 2 . The teacher weight vector is taken to be normalized to w2∗ = N for simplicity. We first investigate the case of unbiased inputs (hxi = 0), assuming that input vectors are sampled randomly from an isotropic distribution over the hypersphere x2 = N (biased inputs will be considered in section 4). The generalization error, defined as the average of the squared error between student and teacher outputs for random inputs, is then ²g =
1 1 2 (wn − w∗ )2 = v 2N 2N n
more sophisticated learning algorithms.
where vn = wn − w∗ .
2204
Peter Sollich and David Barber
In order to make the scenario analytically tractable, we focus on the limit N → ∞ of a large number of input components and weights, taken at a constant number of examples per weight α = p/N and updates per weight (“learning time”) t = n/N. In this limit, the generalization error ²g (t) becomes self-averaging (see, however, section 4) and can be calculated by averaging both over the random selection of examples from a given training set and over all training sets. Our results can be straightforwardly extended to the case of perceptron teachers with a nonlinear transfer function, as in Sollich (1995). 3 Unbiased Inputs 3.1 Outline of Calculation. We begin by deriving from the online learning weight update (see equation 2.1) an update equation for the selection average of the generalization error (i.e., its average with respect to the random choice of training examples for each update, denoted generically by h. . .i). In fact, it will turn out to be useful to consider a slightly generalized 1 T vn Mvn , with M an arbitrary version of the generalization error, ²n = 2N N × N matrix. To get the update equation for h²n i, we first rewrite equation 2.1 in terms of vn , the difference between student and teacher weight vectors: ½
vn+1
·
1 µ µ T λ x (x ) + = 1−η N p
¸¾
1 ηλ vn + η √ ξ µ xµ − w∗ p N
(3.1)
This can now be multiplied by its transpose, with the matrix M inserted, and the selection average for update n performed. Discarding terms that become negligible in the large N limit, one finds after a little algebra, N (h²n+1 i − h²n i) =
η˜ (b − λw∗ )T M hvn i N ¿ · ¸ À 1 η˜ T v λM + (AM + MA) vn − N n 2 2 X 1 µ T η˜ α (x ) Mxµ + N µ N ½ ¾ 1 D T µ µ T E 1 1 µ 2 (ξ ) − ξ µ √ (xµ )T hvn i + vn x (x ) vn , × 2 2N N (3.2)
P µ µ √ where η˜ = η/α is a rescaled learning rate, and b = µ ξ x / N. We now want to transform equation 3.2 into a closed dynamical equation for h²n i. This means that all selection averages need to be either eliminated or reduced to averages of the same form as h²n i. For the two terms linear in
Online Learning
2205
hvn i, this is straightforward. The selection average of equation 2.1 yields directly N (hvn+1 i − hvn i) = η˜ [−(λ + A) hvn i + b − λw∗ ] . Starting from v0 = −w∗ , this can easily be solved, with the result (for N → ∞), © ª hvn i = (λ + A)−1 b − λw∗ − exp [−η˜ t(λ + A)] (b + Aw∗ ) ,
(3.3)
from which the selection average has now disappeared. Learning rate and learning time enter only through the combination τ = η˜ t; this rescaled time will be useful later. In equation 3.2, the remaining terms quadratic in vn now present the main problem. The second term on the right-hand side shows that the evolution of ²g = ²n (M = 1) depends on ²n (M = A), which in turn depends on ²n (M = A2 ) and so on, yielding an infinite hierarchy of order parameters. This problem was solved in Sollich and Barber (1997a) by introducing an auxiliary parameter h through M = exp(hA); all order parameters ²n (M = Am ), m = 1, 2, . . ., can then be obtained by differentiating 1 T vn exp(hA)vn . ²n (h) = 2N Here we choose a different route, which is somewhat more transparent and also more easily adapted to the case of biased inputs, to be considered later. The main idea is to decompose the evolution of vn into components defined by eigenvectors of the input correlation matrix A. (This is equivalent to changing to a coordinate system in which A is diagonal and then considering the components of vn separately.) More precisely, let us order the N eigenvalues of A in ascending order and split them into K equal blocks, labeled by κ = 1 . . . , K, each containing N/K eigenvalues. Let Pκ be the projector matrices P onto the spaces spanned by the eigenvectors of each block. Then vn = κ Pκ vn ; likewise, the generalization error is decomposed as ²g =
1X κ ² , K κ n
²nκ =
K T κ v P vn . 2N n
Each of the generalization error components ²nκ obeys the update equation 3.2 with M = KPκ . But these equations now become closed, because APκ = Pκ A ≈ aκ Pκ , where aκ is an eigenvalue from the κth block (formally, this approximation becomes exact in the limit K → ∞, where the spread of eigenvalues within each block tends to zero). This immediately reduces ®the second term on the right-hand side of equation 3.2 to −2η(λ ˜ + aκ ) ²nκ . Only the very last term of equation 3.2 now remains to be brought into a similar form. This µ is achieved by noting that the factors cκ = (K/N)(xµ )T Pκ xµ are “withinsample self-averaging” (Sollich & Barber, 1997a). Up to fluctuations that
2206
Peter Sollich and David Barber
vanish as O(N−1/2 ) for large N, all cµ are equal to each other and hence to the training set (“sample”) average K aκ 1X µ tr APκ ≈ . cκ = p µ αN α µ
The last approximation again becomes exact3 for K → ∞. The factors cκ = aκ /α can therefore be taken out of the sum over µ in equation 3.2, leaving the selection average, E 1D E X 1 D 1 X κ κ® vTn xµ (xµ )T vn = vTn Avn ≈ a ²n . 2 K κ µ 2N We now have all the ingredients to write equation 3.2 as a closed sysκ tem evolution ¡ of ® κ ®¢ equations for the ²n . In the large N limit, the change κ N ² − ²n due to an update becomes the time derivative ∂t ² κ and κ ® n+1 κ ²n → ² (t). Using the rescaled time τ = ηt ˜ introduced above, one then has [∂τ + 2(λ + aκ )] ² κ (τ ) = V κ (τ ) + η˜ W κ (τ ) + η˜ aκ
1 X κ0 κ0 a ² (τ ). K κ0
(3.4)
Here the functions V κ (τ ) and W κ (τ ) are K (b − λw∗ )T Pκ hvn i N" # 1 T 1 X µ 2 κ κ (ξ ) − b hvn i , W =a 2N µ N Vκ =
with hvn i given by equation 3.3. Having derived equation 3.4, the rest of the calculation is fairly straightforward. Equation 3.4 is formally solved using = transforms with respect to τ , for example, ²ˆ κ (z) RLaplace ∞ κ (τ ): dτ exp(−zτ ) ² 0 " # X 0 0 1 1 κ κ κ κ κ κ ˆ (z)+ η˜ a ² (0)+ Vˆ (z)+ η˜ W a ²ˆ (z) , (3.5) ²ˆ (z) = z+2(λ+aκ ) K κ0 κ
K wT∗ Pκ w∗ . Multiplying by aκ and sumwith the initial condition ² κ (0) = 2N P ming over κ then gives a self-consistency equation for K−1 κ aκ ²ˆ κ (z), which is easily solved. Inserting the solution into equation 3.5 then gives an explicit
3 The large K limit needs to be taken after the limit N → ∞ for “within-sample selfaveraging” to hold; this is why one cannot take K = N from the outset.
Online Learning
2207
expression for ²ˆ κ (z) and P hence for the Laplace transform of the generalization error, ²ˆg (z) = K−1 κ ²ˆ κ (z). As a final step, the average over all training sets (i.e., training inputs xµ and output noises ξ µ ) is then carried out. In the end, everything can be written in terms of averages over the known eigenvalue spectrum (Hertz, Krogh, & Thorbergsson, 1989; Sollich, 1994) of the input correlation matrix A. The explicit form of the final result (Sollich & Barber, 1997a) is rather cumbersome; we omit it here and note only the relatively simple dependence on η: ²ˆg (z) = ²ˆ0 (z) +
η²ˆ1 (z) . 1 − η²ˆ2 (z)
(3.6)
The functions ²i (z) (i = 0 . . . 2) depend on α, σ 2 , and λ (and, of course, z), but are independent of η. The teacher weights do not appear explicitly: because of the isotropy of the input distribution, only the length of the teacher weight vector matters once an average over training sets has been taken, and this has already been fixed to w2∗ = N. The calculation of the generalization error for offline learning is much simpler than that for the online case due to the absence of the selection average. In fact, the offline weight update (see equation 2.2) can be iterated directly to yield ª © vrp = (λ + A)−1 b − λw∗ − [1 − η(λ + A)]r (b + Aw∗ ) .
(3.7)
Multiplying this by its transpose gives directly the generalization error, and the average over training sets can then be carried out in the usual fashion (see, e.g., Hertz et al., 1989). As expected on general grounds, for η → 0 (and only then), one obtains the same result as for online learning, corresponding to the term ²ˆ0 (z) in equation 3.6. 3.2 Discussion. We now briefly highlight some features of our exact result (see equation 3.6) for the generalization error achieved by online learning; a somewhat more detailed exposition can be found in Sollich and Barber (1997b). We discuss the asymptotic generalization error ²∞ , the convergence speed for large learning times, and the behavior at small t; finally, we compare online and offline learning. For numerical evaluations, we √ generally take σ 2 = 0.1, corresponding to a sizable noise-to-signal ratio of 0.1 ≈ 0.32. The asymptotic generalization error is found directly from equation 3.6 using ²∞ = ²g (t → ∞) = limz→0 zˆ²g (z). As expected, it coincides with the offline result (which is independent of η) only for η = 0; as η increases from zero, it increases monotonically. Reassuringly, our calculation reproduces existing O(η) results for this increase (Heskes & Kappen, 1991). In Figure 1 we plot ²∞ as a function of η and λ for α = 0.5, 1, 2. We observe that it is minimal for λ = σ 2 and η = 0, as expected from corresponding results for
2208
Peter Sollich and David Barber
α=1
α=0.5
α=2
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.2
0.1
0.1
0.1
2 1.5
0.5
2 1.5
2 1.5
η 1
0
0 0.2 0.1 0.4 0.3
λ
η 1
0.5
0
0 0.2 0.1 0.4 0.3
λ
η 1
0.5
0
0 0.2 0.1 0.4 0.3
λ
Figure 1: Asymptotic generalization error ²∞ versus η and λ. α as shown, σ 2 = 0.1.
offline learning (Krogh & Hertz, 1992).4 We also read off that for fixed λ, ²∞ is an increasing function of η: the larger η, the more the weight updates tend to overshoot the minimum of the (total, i.e., offline) training error. This causes a diffusive motion of the weights around their average asymptotic values (Heskes & Kappen, 1991), which increases ²∞ . In the absence of weight decay (λ = 0) and for α < 1, however, ²∞ is independent of η. In this case, the training data can be fitted perfectly; every term in the total sum-of-squares training error is then zero, and online learning does not lead to weight diffusion because all individual updates vanish. In general, the relative increase ²∞ (η)/²∞ (η = 0)−1 due to nonzero η depends significantly on α. For η = 1 and α = 0.5, for example, this increase is smaller than 6% for all λ (at σ 2 = 0.1), and for α = 1 it is at most 13%. This means that in cases where training data are limited (p ≈ N), η can be chosen fairly large in order to optimize learning speed, without seriously affecting the asymptotic generalization error. In the large α limit, on the other hand, one finds ²∞ = (σ 2 /2)[1/α + η/(2 − η)]. The relative increase over the value at η = 0 therefore grows linearly with α; already for α = 2, increases of around 50% can occur for η = 1. Figure 1 also shows that ²∞ diverges as η approaches a critical learning rate ηc . As η → ηc , the “overshoot” of the weight update steps becomes so large that the weights eventually diverge. From the Laplace transform (see equation 3.6), one finds that ηc is determined by ηc ²ˆ2 (z = 0) = 1; it is a function of α and λ only. As shown in Figure 2, ηc increases with λ. This is reasonable, as the weight decay reduces the length of the weight vector at each update, counteracting potential weight divergences. In the small and 4 The optimal value of the unscaled weight decay decreases with α as γ = σ 2 /α, because for large training sets, there is less need to counteract noise in the training data by using a large weight decay.
Online Learning
2209
4
cslow
(a)
3
3
1 2
3
4
1
2
3
(d)
3
c x min 1
0
4
= 0:1
2
0
c x min
1
min x c
(c)
=0
2
c
4
(b)
5
4
5
=1
c x min
2 1 0
1
2
3
4
5
Figure 2: Sketch of definitions of ηmin (minimal learning rate for slow mode), ηx (crossover to slow mode dominated convergence), and ηc (maximal (“critical”) learning rate at which convergence still occurs), and their dependence on α.
large α limits, one has ηc = 2(1 + λ) and ηc = 2(1 + λ/α), respectively. For constant λ, ηc therefore decreases5 with α (see Figure 2). We now turn to the large t behavior of the generalization error ²g (t). For small η, the most slowly decaying contribution to ²g (t)—the slowest √ “mode”—varies as exp(−ct), its decay constant c = η[λ + ( α − 1)2 ]/α scaling linearly with η, the size of the weight updates, as expected (see Figure 2a). For larger η, the picture changes due to a new slow mode arising from the denominator of equation 3.6. Interestingly, this mode exists only for η above a finite threshold ηmin = 2/(α 1/2 + α −1/2 − 1). For finite α, it could therefore not have been predicted from a small η expansion of ²g (t). Its decay constant cslow decreases to zero as η → ηc , and crosses that of the normal mode at ηx (α, λ) (see Figure 2a). For η > ηx , the slow mode therefore determines the convergence speed for large t, and fastest convergence is 5 Conversely, for constant γ , η increases with α from 2(1 + γ α) to 2(1 + γ ). For large α, c the weight decay is applied more often between repeat presentations of a training example that would otherwise cause the weights to diverge.
2210
Peter Sollich and David Barber
(a)
0.5 g 0.4
= 1:95 = 1:90 = 1:80 = 1:00 = 0:50
(b) 0.5
= 1:75 = 1:50
0.4 g 0.3
= 0:75 = 0:50 = 0:25
0.2
0.3 0.1 0.2
0
10
20
30 t
0
0
5
10
15
t 20
Figure 3: ²g versus t for different η. Simulations for N = 50 are shown by symbols (standard errors less than symbol sizes). λ=10−4 , σ 2 =0.1. (a) α=0.7. (b) α=5.
obtained for η = ηx . However, it may still be advantageous to use lower values of η in order to lower the asymptotic generalization error (see below); values of η > ηx would deteriorate both convergence speed and asymptotic performance. Figures 2b–d shows the dependence of ηmin , ηx , and ηc on α and λ. For λ not too large, ηx has a maximum at α ≈ 1 (where ηx ≈ ηc ), while decaying to ηx ≈ 12 ηc for larger α. This can be explained in terms of the anisotropy of the total training error surface (Sollich & Barber, 1997a), which is strongest for α = 1 and λ → 0. Consider now the small t behavior of ²g (t). Figure 3 illustrates the dependence of ²g (t) on η; comparison with simulation results for N = 50 clearly confirms our calculations and demonstrates that finite N effects are not significant even for such fairly small N. For α = 0.7 (see Figure 3a), we see that nonzero η acts as effective update noise, eliminating the minimum in ²g (t), which corresponds to overtraining (Krogh & Hertz, 1992). ²∞ is also seen to be essentially independent of η as predicted for the small value of λ = 10−4 chosen. For α = 5, Figure 3b clearly shows the increase of ²∞ with η. It also illustrates how convergence first speeds up as η is increased from zero and then slows down again as ηc ≈ 2 is approached. Above, we saw that the asymptotic generalization error ²∞ is minimal for η = 0. Figure 4 shows what happens if we minimize ²g (t) instead for a given final learning time t, corresponding to a fixed amount of computational effort for training the network. As t increases, the optimal η decreases toward zero as required by the trade-off between asymptotic performance and convergence speed. For large t, the functional form of this decay is ηopt = (a + b ln t)/t with t-independent coefficients a and b (Sollich & Barber, 1997a). We now compare the performance of online learning to that of offline learning as calculated from equation 3.7. (The number of gradient calcu-
Online Learning
2211
(a) 1.0
λ = 0.001 λ = 0.1 λ = 0.2
0.8 0.6 opt 0.4
(b) 0.25
λ = 0.001 λ = 0.1 λ = 0.2
0.20
g 0.15
0.2 0.0
0
10
20
30
40
50 t
0.10
0
10
20
30
40
50 t
Figure 4: (a) Optimal learning rate η versus final learning time t for online (bold) and offline learning (thin lines) and (b) resulting generalization error ²g . α = 1, σ 2 = 0.1, λ as shown. Although we plot offline results as continuous lines to avoid visual clutter, they are actually defined only at discrete values of the learning time, t = rα, with r the number of offline weight updates.
lations required for r offline weight updates is n = rp, corresponding to a learning time t = n/N = rα; the generalization error ²g (t) is therefore defined only for learning times t that are integer multiples of α.) To compare online and offline learning on an equal footing, we again consider optimized values of η for given final learning time t. Figure 4b shows that the performance loss from using online instead of offline learning is actually negligible. This may seem surprising given the stochasticity of weight updates in online learning, in particular for small t. However, Figure 4a shows that online learning can make up for this by allowing larger values of η to be used. 4 Biased Inputs 4.1 Modifications to Calculation. We now investigate how online and offline learning are affected by input bias hxi = x¯ 6= 0. As a simple scenario of this kind, consider the case where the deviations 1x = x − x¯ of the inputs from their average are still distributed isotropically over a hypersphere. We choose the radius R of this hypersphere such that the average value of x2 is the same (N) as for the unbiased case, that is, R2 = N(1 − m2 ) where m2 = x¯ 2 /N measures the size of the bias. The generalization error (the squared deviation between student and teacher outputs averaged over all inputs) now has two components,
²g =
i 1 h T 2 (¯x vn ) + (1 − m2 )v2n . 2N
(4.1)
2212
Peter Sollich and David Barber
As before, we consider a teacher with weight vector of length w2∗ = N. In the presence of input bias,√however, we also need to specify the average teacher output y¯ = x¯ T w∗ / N. This parameter is not constrained by our other assumptions; however, to limit the number of free parameters in the model, we choose it to have its typical root-mean-squared value when the directions of w∗ and x¯ are uncorrelated: y¯ 2 = m2 . As for the case of unbiased inputs, the evolution of the generalization error is largely determined by the eigenvalue spectrum of the input correlation matrix A. This has been determined by a number of authors (LeCun, Kanter, & Solla, 1991; Wendemuth, Opper, & Kinzel, 1993; Halkjær & Winther, 1997) and shows the following features: There is a “normal” part of the spectrum, with eigenvalues that tend to finite values as N → ∞; the eigenvalues in this part of the spectrum are identical to those for the unbiased input case, expect for a rescaling by the factor (1 − m2 ). Additionally, however, there is one isolated eigenvalue aN = Nαm2 , which is proportional to N and exists only in the presence of input bias. Intuitively, this corresponds to the fact that the component of the student weights along the direction of x¯ is much more strongly determined by the training data because all input vectors ¯ Not surprisingly, the eigenvector corresponding have a component along x. ¯6 to aN is along the direction of x. We can see immediately that input bias has a drastic effect on offline learning by considering equation 3.7. For the offline learning process to converge, the product of η and the largest eigenvalue of λ + A must be less than two. In the presence of input bias, this gives the condition η < 2/(Nαm2 ) (neglecting λ, which gives a negligible correction for N → ∞). The maximal learning rate is therefore drastically reduced from order unity to O(N−1 ). A little reflection shows that only the first contribution of the generalization error (see equation 4.1) decays for finite learning times. Carrying out the average over training sets, one finds
²g (t = rα) =
1 2 1 m (1 − Nηαm2 )2r + (1 − m2 ). 2 2
(4.2)
The second contribution would decay only for learning times of O(N), which are inaccessibly long in the limit N → ∞ that we consider. Online learning, on the other hand, is not plagued by the same problem, as we now show. Consider the first contribution to the generalization error,
6 In fact, there is a small angle between this eigenvector and x ¯ , which, however, decreases as O((αN)−1/2 ) as N grows large. LeCun et al. (1991) claimed that this angle is exactly zero; however, their argument cannot be quite correct because it would also entail that A has only two different eigenvalues (whereas in reality it has a continuous spread of eigenvalues for any finite α).
Online Learning
2213
which we write as ²g,1 = 12 δn2 with 1 δn = √ x¯ T vn . N From the update equation (2.1) one derives that δn+1 = (1 − ηm2 )δn + η ξ µ m2 ,
(4.3)
up to correction terms that vanish for N → ∞. Starting from the initial value ¯ this can easily be iterated and the selection average carried out to δ0 = −y, give, D E 1 − (1 − ηm2 )2n 1 X µ 2 (ξ ) , δn2 = y¯ 2 (1 − ηm2 )2n + η2 m4 1 − (1 − ηm2 )2 p µ −1 up to PO(N ) corrections; an average over training sets then gives p−1 µ (ξ µ )2 → σ 2 . For n = t = 0, only the first term is nonzero. On the other hand, for nonzero learning time t (and values of the learning rate such that convergence occurs, i.e., 0 < η < 2/m2 ) only the second term survives because n = Nt → ∞ for N → ∞. We therefore have for the average value of the first contribution to the generalization error:
® ²g,1 (t = 0) = y¯ 2 = m2 ,
® 1 ηm2 ²g,1 (t > 0) = σ 2 . 2 2 − ηm2
(4.4)
® The discontinuous change at t = 0 reflects the fact that δn2 changes from its initial to its asymptotic value after a number of updates n, which does not increase with system size N. 7 We still have to calculate the evolution of the second component ²g,2 = (1 − m2 )v2n /(2N) of the generalization error (see equation 4.1) for the case of online learning. At first sight, the O(N) eigenvalue of A appears to complicate this task. However, the component of vn along x¯ , the corresponding eigenvector, contributes only negligibly to ²g,2 : 1 2N
µ
1 T x¯ vn |¯x|
¶2 =
1 ²g,1 = O(N−1 ). Nm2
Thus, only components of vn along directions corresponding to the O(1) eigenvalues of A need to be considered; their evolution can be calculated 7 Note also that we have written the selection average in equation 4.4 explicitly because ²g,1 is no longer self-averaging. Each weight update (see equation 4.3) causes a change in δn and ²g,1 of order unity, and hence the fluctuations of ²g,1 remain nonzero even for N → ∞.
2214
6
c
5
Peter Sollich and David Barber
=0
4 3
6
c
5
= 0:1
4 3
2 0 0.2 0.4 0.6 0.8 1
m
2
6
c
5
=1
4 3
2 0 0.2 0.4 0.6 0.8 1
m
2
2 0 0.2 0.4 0.6 0.8 1
m2
Figure 5: Critical learning rate ηc for online learning versus input bias m2 , for weight decay λ as shown and training set size α = 0, 1, . . . , 5 (bottom to top). Compare Figure 2 for the case of unbiased inputs.
exactly as in section 3. The only change is the rescaled eigenvalue spectrum of A; in fact, one finds that ²g,2 /(1 − m2 ) is exactly the same as ²g = v2n /2N for unbiased inputs of length x2 = N(1 − m2 ). It is easily checked that this change of effective input vector length can be effected by replacing λ, σ 2 , and η in the expressions for ²g by the rescaled values λ0 = λ/(1 − m2 ), (σ 0 )2 = σ 2 /(1 − m2 ) and η0 = η(1 − m2 ), and so no new calculations need to be carried out. 4.2 Discussion. We have already mentioned that the critical learning rate for offline learning is drastically reduced to ηc = 2/Nαm2 by the presence of input bias. For online learning, ηc is affected in two ways: first through the “rescaling” of η and λ explained above for the calculation of ²g,2 , and, second, through the presence of the term ²g,1 ; equation 4.4 shows that for the latter to remain finite one requires ηc < 2/m2 . Figure 5 illustrates the resulting variation of ηc with m2 for several values of α and λ. As the bias increases from 0, the critical learning rate first increases until it reaches the value 2/m2 ; from that point onward, it follows the curve ηc = 2/m2 (independently of α and λ) until it reaches ηc = 2 at8 m2 = 1. In marked contrast to the case of offline learning, the critical learning rate ηc for online learning therefore never decreases below values of order unity, and it can actually be increased by the presence of input bias. The different effects of input bias on the critical learning rates of online and offline learning are also reflected in the generalization performance for optimal values of η at given final learning time. For offline learning, equation 4.2 shows that the optimal η = 1/(Nαm2 ), whatever the (integer) value of 8
®
This is the maximal bias in our scenario since x2 = N > x¯ 2 = Nm2 .
Online Learning
(a)
2215
1.0
(b) 0.4
2
m =0 2 m = 0.2 2 m = 0.4
0.8 0.6 opt 0.4
0.3
2
g
m =0 2 m = 0.2 2 m = 0.4
0.2
0.2 0.0
0
10
20
30
40
50 t
0.1
0
10
20
30
40
50 t
Figure 6: (a) Optimal learning rate η versus final learning time t for online learning in the presence of input bias m2 (values as shown; α = 1, σ 2 = 0.1, λ = 0.001). (b) Resulting generalization error ²g , with results for offline learning shown for comparison (thin lines). Note that while offline learning performs (marginally) better than online learning for unbiased inputs (m2 = 0), it is far worse as soon as the input bias is nonzero.
r = t/α. This reduces the first contribution to the offline generalization error to zero for any r ≥ 1, but still leaves a nonzero term ²g = (1 − m2 )/2 (which, as explained above, would start to decay only for extremely long learning times t = O(N)). For online learning, on the other hand, the optimal learning rate remains of order one even in the presence of input bias. This was to be expected from the analogous results for the critical learning rate and can be seen explicitly in Figure 6a. Figure 6b shows the resulting generalization error, which is seen to decrease as the input bias increases. Online learning therefore successfully exploits the presence of the input bias to achieve better generalization performance.9 This contrasts markedly with the case of offline learning, where generalization performance (at finite learning times t) deteriorates as soon as an input bias is present.10
9 Wendemuth et al. (1993) view the input bias as “additional information,” which leads to improved generalization. In our case, the same conclusion can be arrived at by considering the extreme limit of maximal bias, m2 = 1. In this case, the distribution of input vectors collapses to the point x = x¯ , and so perfect generalization is obtained after only one training example has been presented. (For noisy training outputs, more examples would be needed; the generalization error then decays roughly as ²g ∼ σ 2 /n, which, however, still gives perfect generalization ²g = 0 for any finite learning time t.) 10 For biased inputs, we found an offline generalization error of ² = (1 − m2 )/2 for g optimally chosen η, which is arbitrarily close to 12 for m2 sufficiently small. For unbiased inputs, on the other hand, ²g for optimal η is generally significantly smaller than a half, as illustrated by Figure 4 for example; it can never be greater than 12 since otherwise η = 0 would give a lower ²g .
2216
Peter Sollich and David Barber
5 Conclusions In this article, we have obtained exact results for the generalization error achieved by online learning from finite training sets at noninfinitesimal learning rates. These apply directly only to the simple linear model that we have considered, but also exhibit generic features, which we expect to be of general relevance. For example, the calculated dependence on η of the asymptotic generalization error ²∞ and the convergence speed shows that, in general, sizable values of η can be used for training sets of limited size (α ≈ 1), while for larger α it is important to keep learning rates small. More important from a practical point of view is probably the explicit comparison between online and offline learning that our results allow us to make. To make this comparison fair, we considered the generalization performance of both algorithms for the respective optimal values of the learning rate at a given final learning time t. For unbiased inputs, we found in this way that online learning performs only marginally worse than offline learning, whereas it is in fact vastly superior as soon as there is any kind of input bias. This suggests strongly that online learning should generally be preferred over offline learning in problems where biased inputs cannot be a priori excluded. In the future, we hope to extend our analysis to dynamic (t-dependent) optimization of η; based on the results of Luo (1991) and Heskes and Wiegerinck (1996), however, one may suspect that performance improvements over optimal fixed η may be small. More important, more complicated network architectures need to be studied in which the crucial question of local minima can be addressed. We speculate that the superiority of online learning may be even more pronounced here, due to its stochastic weight updates, which should facilitate the escape from local minima. We have done some exploratory work along those lines for soft-committee machine architectures, using a fairly simple approximation scheme (Sollich & Barber, 1998). Considerable challenges remain, however, and there is much scope for future work in this direction. References Biehl, M., & Schwarze, H. (1995). Learning by online gradient descent. Journal of Physics A, 28, 643–656. Halkjær, S., & Winther, O. (1997). The effect of correlated input data on the dynamics of learning. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 169–175). Cambridge, MA: MIT Press. Hertz, J. A., Krogh, A., & Thorbergsson, G. I. (1989). Phase transitions in simple learning. Journal of Physics A, 22, 2133–2150. Heskes, T., & Kappen, B. (1991). Learning processes in neural networks. Physical Review A, 44, 2718–2762.
Online Learning
2217
Heskes, T., & Wiegerinck, W. (1996). A theoretical comparison of batch-mode, online, cyclic, and almost-cyclic learning. IEEE Transactions on Neural Networks, 7(4), 919–925. Krogh, A., & Hertz, J. A. (1992). Generalization in a linear perceptron in the presence of noise. Journal of Physics A, 25, 1135–1147. LeCun, Y., Kanter, I., & Solla, S. A. (1991). Eigenvalues of covariance matrices— application to neural-network learning. Physical Review Letters, 66(18), 2396– 2399. Luo, Z. (1991). On the convergence of the LMS algorithm with adaptive learning rate for linear feedforward networks. Neural Computation, 3(2), 226–245. Saad, D., & Solla, S. A. (1995). Online learning in soft committee machines. Physical Review E, 52, 4225. Sollich, P. (1994). Finite size effects in learning and generalization in linear perceptrons. Journal of Physics A, 27, 7771–7784. Sollich, P. (1995). Learning unrealizable tasks from minimum entropy queries. Journal of Physics A, 28, 6125–6142. Sollich, P., & Barber, D. (1997a). On-line learning from finite training sets. Europhysics Letters, 38, 477–482. Sollich, P., & Barber, D. (1997b). Online learning from finite training sets: An analytical case study. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 274–280). Cambridge, MA: MIT Press. Sollich, P., & Barber, D. (1998). Online learning from finite training sets in nonlinear networks. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10, Cambridge, MA: MIT Press. Wendemuth, A., Opper, M., & Kinzel, W. (1993). The effect of correlations in neural networks. Journal of Physics A, 26(13), 3165–3185. Received September 19, 1997; accepted March 4, 1998.
LETTER
Communicated by Steven Nowlan
Antipredictable Sequences: Harder to Predict Than Random Sequences Huaiyu Zhu Santa Fe Institute, Santa Fe, NM 87501, U.S.A.
Wolfgang Kinzel ¨ Theoretische Physik, Santa Fe Institute, Santa Fe, NM 87501, U.S.A., and Institut fur ¨ Universit¨at, D-97074 Wurzburg, Germany
For any discrete-state sequence prediction algorithm A, it is always possible, using an algorithm B no more complicated than A, to generate a sequence for which A’s prediction is always wrong. For any prediction algorithm A and sequence x, there exists a sequence y no more complicated than x, such that if A performs better than random on x, then it will perform worse than random on y by the same margin. An example of a simple neural network predicting a bit sequence is used to illustrate this very general but not widely recognized phenomenon. This implies that any predictor with good performance must rely on some (usually implicitly) assumed prior distributions of the problem. 1 Introduction Many sequence prediction algorithms have been studied that can predict with reasonable accuracy the next state of a sequence based on a short history of previous states (cf. Weigend & Gershenfeld, 1994, for example). At least that was what we believed before performing the experiments described in this article. It is the general belief in the machine learning community that these algorithms will make good predictions on sequences with obvious statistical structures, while on sequences with little or very subtle structures, they will perform as well as they would on a purely random sequence. This turns out not to be true unless some additional severe restrictions are imposed. In section 2 we show that for any prediction algorithm, we can construct a time series for which this algorithm fails completely. This and several related general results are exemplified for a simple perceptron. The computer experiments explained in section 3 show that such anticorrelated bit sequences have a complex structure with extremely long periods. The perceptron that generates this sequence (the teacher) has maximal prediction error, by construction. Nevertheless, another perceptron (student) learning this sequence obtains some overlap to the generating one. We calculate the c 1998 Massachusetts Institute of Technology Neural Computation 10, 2219–2230 (1998) °
2220
Huaiyu Zhu and Wolfgang Kinzel
overlap and the prediction rate analytically. Further experiments in section 4 show that to predict this complex anticorrelated sequence without error, the student perceptron has to be much larger than the teacher. In section 5 we discuss the implications of our findings to general-prediction algorithms. 2 Predictable, Unpredictable, and Even Less Predictable Suppose we are to predict a sequence x = [xi : i ∈ Z], where xi ∈ {−1, 1}, using a single-layer binary perceptron P of n nodes with Rosenblatt learning rule (Hertz, Krogh, & Palmer, 1991), P:
x0i = sign(wTi xi ),
wi+1 − wi =
c (xi+1 − x0i )xi , n
(2.1)
where xi = [xi−n+1 , . . . , xi ] ∈ {−1, 1}n is a sliding window of past data, wi ∈ Rn is the weight vector, x0i is the prediction about the next state xi+1 , and c is the learning speed. For simplicity we omit explicit reference to time step i and write the above equations as P:
x0 = sign(wT x),
c 0 1w . = n (x+ − x )x.
(2.2)
The performance of the predictor P may be evaluated by the correlation m ® 1 X x0i xi+1 , R(P, x) = x0 x+ = lim m→∞ m i=1
(2.3)
which equals 1 for completely correct prediction and 0 for completely random prediction. Denote ρ = R(P, x). The error rate is ε = (1 − ρ)/2, while (1 + ρ)/2 = 1 − ε is the rate of correct prediction. We digress a little to discuss our framework and assumptions. From now on we shall let the step length c = 1, as its only effect is to rescale w without affecting any external behavior. In this article, only sequences of infinite length are considered, for which the effect of initial condition generally fades with passage of time, so initial conditions will not be made explicit except where it matters. We may call this the vanishing initial condition assumption. Note that with this assumption, all of the following discussion applies to hidden Markov models as well. We also point out that allowing x to contain infinite past does not compromise generality. For special cases such as the perceptron learning example used here, in which only finite n steps into the past are used, one can always place restrictions on the predictors instead of sequences to achieve the same effect. Such assumptions are not considered in this article. They are not likely to affect our main conclusions. The above simple predictor P is already powerful enough to detect some interesting structures. For example, it is well known that humans are generally not good at producing random numbers. The “random sequences”
Antipredictable Sequences
2221
x casually typed by university students are predicted by P (with n = 10) with performance 0.1 < R(P, x) < 0.6 (Kinzel & Reents, 1998). On the other hand, such predictors are not good at detecting more subtle regularities. Tested on most random number generators z in use today, the performance is exactly R(P, z) = 0. The main topic of this article starts from the observation that there exist sequences x more difficult to predict than a random sequence. For example, x can be generated by assigning the opposite of P’s prediction as its target; that is, it is generated by the algorithm (Eisenstein, Kanter, Kessler, & Kinzel, 1995)
A(−P) :
x+ = −x0 = −sign(wT x),
1 2 0 1w . = n (x+ − x )x = n x+ x. (2.4)
For such a sequence, the prediction of P is wrong all the time, so ρ = −1. We coined the term antipredictable to describe sequences with ρ < 0 because they are even more difficult to predict than an “unpredictable” sequence (a random sequence) with ρ = 0. More generally, for an arbitrary time-independent predictor P and any ρ ∈ [−1, 1], we can generate a sequence x = s(ρP) with R(P, x) = ρ, by the time-independent algorithm, ½ A(ρP) :
x+ =
x0 , −x0 ,
with probability p, with probability 1 − p,
(2.5)
where p = (1 + ρ)/2 ∈ [0, 1]. The performance R(P, s(ρP)) = ρ ranges between ρ = −1 for the most difficult sequence s(−P) and ρ = 1 for the easiest sequence s(P), with the random sequence right in the middle with ρ = 0. Clearly there is a one-toone correspondence between sequences s(ρP) and s(−ρP) in the sense of probability distributions of random processes. If the randomness in these algorithms is generated by pseudo–random number generators, then the algorithms A(±P) are significantly simpler than those A(ρP) with intermediate ρ, in the sense of Turing machine algorithms. The above construction also shows that for any predictor P, sequence x, and number ρ ∈ [−1, 1], there exists another predictor ρP such that R(ρP, x) = ρR(P, x). Furthermore, under the assumption of vanishing initial conditions, any (random) sequence is generated by a time-dependent stochastic mapping x → x+ . In particular, a single deterministic sequence is simply a deterministic function of time. Given an arbitrary predictor P, a random sequence x, and a number ρ ∈ [−1, 1], we can generate a new sequence y = s(ρP, x)
2222
Huaiyu Zhu and Wolfgang Kinzel
with R(P, y) = ρR(P, x), by the time-dependent algorithm, ½ 0 0 y x x+ , with probability p, A(ρP, x) : y+ = −y0 x0 x+ , with probability 1 − p,
(2.6)
where x0 , y0 are the prediction of P on x and y. One might object to calling s(−P) “the most difficult sequence to predict” by arguing that it is difficult only for one particular predictor P with particular initial setting and learning parameters. This objection is not valid for several reasons. First, it is well known (Knuth, 1981) that if the concept of “difficulty of prediction” is to have any meaning at all, it must be measured against a certain predictor. If we are to choose predictors for each sequence, then any deterministic sequence can be regarded as completely predictable if we choose its own generator as the predictor. Second, contrary to what might have been suspected, for any given predictor, these types of sequences are not exceptional. As we have seen, half of all possible binary sequences are harder to predict than a random one. It is also easy to see by a similar construction that if the state-space has m elements, then (m−1)/m proportion of sequences will be difficult while 1/m proportion will be easy, when the difficulty is compared to that of predicting a random sequence. Finally, as we shall see in sections 3 and 4, the given sequence is difficult for a large class of predictors, not just P, even if the internal parameters are different. The above construction of ±ρP already showed that for any given sequence x, there are at least as many predictors that fail on x as there are successful predictors, when the internal state of the predictor is identical to that of the generator. In practice, if the generator of x is unknown, it would be very puzzling indeed to see that x+ is opposite to each x0 . There is an interesting explanation of this phenomenon. We can regard x = s(P) and y = s(−P) as the bets placed by two opponents in a “matching-penny game,” in which x is trying to predict the same as y while y is trying to predict the opposite of x. Because this is a zero-sum game and both players are using the same algorithm with opposite objectives, the algorithm can be good in one circumstance only if it is bad in another, by the same amount. 3 Learning from a Confused Teacher The two predictors P and −P have the same complexity but drastically different performance on sequences s(P) and s(−P): R(P, s(P)) = R(−P, s(−P)) = 1,
R(P, s(−P)) = R(−P, s(P)) = −1. (3.1)
Although this situation looks symmetric, the sequences s(P) and s(−P) are quite different. The former is generated by a perceptron with a fixed weight
Antipredictable Sequences
2223
and is periodic with short periods. The latter is generated by a learning algorithm whose predictions are always contradicted and looks completely random (even for n = 10). Two questions naturally arise from this situation. First, is s(−P) a random sequence? This seems plausible from the argument that any pattern in it would fade away in the learning process. Second, we know that s(−P) is the most difficult to predict by P, given that they share the same initial weight, but is it also difficult for the same P with different initial weight, or for different P? These two questions are answered by the following experiments in a teacher-student setting. Let P1 be the above perceptron (the teacher) with particular initial setting and P2 be another perceptron (the student) with possibly different initial settings. They might differ in the memory length n, initial weight w, and initial input x. By definition ρ1 = R(P1 , s(−P1 )) = −1. In the experiment we study ρ2 = R(P2 , s(−P1 )). The variables nk , xk , wk , and εk , k = 1, 2, refer to those of perceptrons Pk , respectively. In our first experiments we use the same length nk and same initial input xk , but with different initial weights wk . Denote cos θ = w1 · w2 /|w1 ||w2 |. The results are: • For small n, there are significant fluctuations in ρ2 and θ depending on the initial values, which decrease with increasing n. The fluctuations in ρ2 are smaller than that of cos θ . • For n sufficiently large (e.g., n > 15), in most runs approximately ρ2 ' −0.55 and cos θ ' 0.71 (correspondingly θ ' π/4). • The average values do not depend on the initial θ , except for the trivial case of θ = 0, in which P1 and P2 remain locked forever. • The weight w1 performs a random walk on a certain sphere centered at origin. • The weight w2 walks on a sphere with slightly smaller radius, also centered at the origin. The first implication of these results is that the sequence s(−P1 ) is only chaotic but certainly not random, insofar as a simple perceptron P2 of the same complexity as P1 can tell. More important, to the predictor P2 , the structure in s(−P1 ) is detrimental rather than beneficial, even though s(−P1 ) is generated without knowledge of the actual weight values of P2 . Second, without taking into account the weights, the behavior of P2 is similar to that of 0.55P1 . In other words, the prediction of P2 is distributed as if it is a random mixture of prediction of P1 with 0.55 probability and P0 with 0.45 probability, where P0 is a purely random predictor. Third, the process in the weight space can be described as that the “confused teacher” w1 makes a random walk, while the “student” w2 tries to learn the opposite of the teacher’s prediction −x0 based on its past behavior x. It succeeds partially but is still more confused than trying to learn a
2224
Huaiyu Zhu and Wolfgang Kinzel
random sequence. Further experiments also show that if the student tries to learn the prediction of the teacher x0 instead of its negative, still based on the teacher’s past behavior x, it obtains exactly the same success rate. The following heuristic analysis explains most of these observations, provided that certain randomness assumptions are satisfied. The main assumption is that as far as w1 and w2 are concerned, x can be regarded as an n-dimensional standard gaussian. More precisely, we assume that the length of projections of x onto the subspaces constructed from w1 and w2 is distributed in the same way as corresponding projections from a standard n-variate gaussian. This assumption may be justified by the observation that wT x is a sum of independent variables and the large number theorem applies (Biehl & Riegler, 1994). A secondary assumption is that the fluctua√ tions of |w1 |, |w2 |, and θ can be ignored, because they are of the order 1/ n. These lead to (1 + ρ2 )/2 = θ/π (Opper & Kinzel, 1996), and D
|x|
2
E
r h|xw |i =
= n,
2 , π
(3.2)
where xw is the projection of x on w. The updating of w1 implies w1+ = w1 −
2 xsign(wT1 x), n
(3.3)
4 2 4 T |x| − |w1 x|, n2 n r E 4 D 4 2 . 1|w1 |2 = − h|w1 |i n n π |w1+ |2 = |w1 |2 +
(3.4)
(3.5)
p Therefore h|w1 |i will converge to π2 . Similar arguments apply to w2 , except that 1w2 is nonzero only for ε2 = 1 − θ/π proportion of time steps, when sign(wT1 x) = sign(wT2 x). To fix the argument, let us define a two-dimensional coordinate system on the plane spanned by w1 and w2 , such that the phase angle of w1 is zero and that of w2 is in (0, π). Denote by φ the phase angle of the projection of x on this plane. We have ½ w2+ =
w2 − n2 xsign(wT2 x), w2 , ½
2
|w2+ | = D
2
1|w2 |
E
|w2 |2 + |w2 |2 ,
4 |x|2 n2
φ ∈ [− π2 , θ − π2 ], φ ∈ [θ − π2 , π2 ],
− n4 |wT2 x|,
φ ∈ [− π2 , θ − π2 ], φ ∈ [θ − π2 , π2 ],
µ ¶ 1 + cos θ θ 4 4 . 1− − h|w2 |i √ = n π n 2π
(3.6)
(3.7)
(3.8)
Antipredictable Sequences
2225
Therefore h|w2 |i will converge to 1 − πθ √ 2π. 1 + cos θ
(3.9)
® A similar argument on 1(wT1 w2 ) shows that θ will converge to a value determined by 1−
(1 + cos θ)2 θ = . π 4
(3.10)
Putting these together and solving for θ, |w1 |, |w2 |, we get θ = 0.70590 = 0.22469π, |w1 | = 1.25331, ε2 = 0.77530,
cos θ = 0.76103,
|w2 | = 0.88051|w1 | = 1.10356, ρ2 = −0.55061.
(3.11) (3.12) (3.13)
Our further experimental results (using n = 100 and averaged over 4 × 105 runs) are approximately cos θ = 0.714, |w1 | = 1.13,
ρ2 = −.550, |w2 | = 1.01,
(3.14) |w2 |/|w1 | = .889,
(3.15)
which agree well with the above derived values considering the rough assumptions made, especially the approximation of a distribution on a hypercube by a normal distribution. The validity of these derivations seems to imply that the motion of P1 and P2 is such that there is no correlation within each pair (w1 , x), (w2 , x), (x, x0 ), (w1 , x0 ), and (w2 , x0 ), even though there are correlations within the groups (w1 , w2 ), (w1 , x, x0 ), and (w2 , x, x0 ). Note that the independence of (x, x0 ) does not imply that the sequence x is independent also at a longer time scale, so this is not contradictory to the fact that P2 performs worse on x than on a random sequence. 4 More Powerful Students The sequence x = s(−P1 ) is produced by a deterministic algorithm, with regularities detectable by simple perceptrons. Therefore, it is natural to ask whether P2 might be able to learn x perfectly if its size n2 is much larger. We have performed numerical experiments with various n1 and n2 . Apart from
2226
Huaiyu Zhu and Wolfgang Kinzel
the above results concerning the case n1 = n2 , the following observations are made: • For small n1 there are large fluctuations in ρ2 with respect to the random initial conditions. • The sequence s(−P1 ) is usually periodic, and P2 can often learn it when n2 is only a fraction of the period L. • The performance ρ2 generally increases with n2 . For larger n2 , it is also more likely ρ2 = 1. • For n1 = n2 , the error is larger than a random guess, ρ2 < 0, ε2 > 1/2. • If n2 ' 3n1 , there appears to exist a plateau at zero correlations, ρ2 = 0, ε2 = 1/2. • For n2 sufficiently large, especially if 2n1 ≤ n2 , it converges very quickly to ρ2 = 1, ε2 = 0. The results of a single run of 104 training and prediction steps for n1 = 10 and various n2 are shown in Figure 1. It is interesting to compare these observations with the fact that sequences n generated by fixed-weight perceptrons are periodic with L ≤ 22 and that n 2 there are no more than 2 such perceptrons. It is a priori plausible that for perceptrons with learning rules, the period could be much longer, if not infinite. On average, any periodic sequence can be learned perfectly by a perceptron P2 that is sufficiently large, in particular if one has L < 1.7 n2 for large values of L (Schroder, ¨ Kinzel, & Kanter, 1996). The above experiments still use the same learning rule for P1 and P2 , although possibly with different sizes. It might be argued that there are other algorithms capable of learning both sequences s(P1 ) or s(−P1 ) equally well. For example, one possibility is to incorporate cross-validation on a moving average of ρ2 to switch automatically between these two possible generators. However, all such arguments are self-defeating, as a sequence generated by “anti-cross-validation” would demonstrate in this particular case. In general, without any assumption concerning the sequences to be predicted, the only certain way to win is to have P2 vastly more powerful than P1 , so that in the state-space of P2 , all the possible P1 are compressed into a small corner. Of course, such rote learning is neither desirable nor corresponding to either natural or artificial learning. Useful learning algorithms in practice actually use assumed statistical structures in the problems, although such assumptions are rarely made explicit in the literature. For example, our experiment of predicting sequences typed in by humans suggests that humans use algorithms similar to 0.4P1 when they are trying to produce random sequences. If they happened to be similar to −0.4P1 , then the performance of P1 would be even worse than random guess. Let P2 be a predictor with identical algorithm as P1 but independent initial values, and P be the set of all predictors whose initial values are independent of P1 .
Antipredictable Sequences
2227
0.8
0.6
ε
0.4
0.2
0.0
0
500 n2
1000
Figure 1: Generalization error ε2 for different n2 , on the sequence s(−P1 ) with fixed n1 = 10. Each dot is the result of a single run of 104 steps. The solid line is a moving average over 20 data points.
It is an interesting open question whether P2 is always the optimal predictor in P for any sequence s(ρP1 ) with ρ > 0, while −P2 is alway optimal for such sequences with ρ < 0. Note that this is trivially true if P is replaced by {rP2 : r ∈ [−1, 1]}. Some future work on this particular perceptron learning example may provide further insight on the benefits and limitations of various learning algorithms. For example, it is plausible that if P2 is allowed to be multilayer, n2 may not need to grow exponentially with n1 . The actual numerical relation itself would be instructive. Also, if the activation function sign is replaced by a sigmoidal function, then several interesting questions arise due to the Takens theorem (Takens, 1980), which in our case says that if the dynamics are smooth (C2 ), then there exists an algorithm based on 2(2n) + 1 past steps of x that will predict the future of x. The (not explicitly given) “Takens algorithm” depends on the original dynamics, unlike a learning algorithm, but the latter, by updating the weights, in general uses infinite
2228
Huaiyu Zhu and Wolfgang Kinzel
number of steps of past information. It is interesting to see if learning algorithms approximate or converge to the Takens algorithm in a certain sense. On the other hand, although for any P there exists finite-sized predictor T(P) that performs perfectly, it is likely that there is no general finite-sized algorithm L that will for any P map a sequence s(P) to T(P). Another direction to change current setting is to allow the two perceptrons to learn from (the opposite of) each other. Our preliminary results show that certain changes in the learning process resemble possible phase transitions, depending on factors such as the relative learning rates, size, and so on. Experiments show some intriguing phenomena, but this is outside the scope of this article. 5 Discussion One deficiency of most published research in learning theory is the lack of clarity on the statistical assumptions concerning the sequences to be predicted. There is a general feeling that most methods used by practitioners are positively correlated to most problems encountered in real life. In a previous article (Zhu & Rohwer, 1996b), examples are given to show that such beliefs in statistics are not warranted theoretically, although they may happen to be true in practice. The new example here is directly in the form of sequence prediction, one of the most actively explored areas of neural computing and machine learning. There are good reasons to believe that such antipredictable sequences are often encountered in many areas where timesequence predictions are demanded. For example, in financial predictions, if most dealers use similar prediction techniques that respond to financial indicators in similar ways, then most of them will find the opposite of their prediction to be true most of the time. The results of the current example may be viewed as an instance of the so-called no-free-lunch (NFL) theorems (Wolpert, 1996, p. 1352, and private communication), which says that if an algorithm performs better than random guessing in some situations, it will perform worse than random in other situations. The proof of one version of NFL may be described informally as the following: All the possible problems and algorithms in a certain situation may be considered as points in a space. The inner product of an algorithm A with a problem B gives the performance of A on B, which in our case is R(A, B) ∈ [−1, 1] and obviously has as much chance of being positive as being negative. The implication of these observations is a mixed blessing. On the one hand, it makes life somewhat easier for machine learning. Although priors are needed to make a good learning algorithm, for two-state prediction problems, it needs to be specific only to the extent that the algorithm and the problem are positively correlated if the purpose is simply to do better than random guessing on average. The performance of P on human-produced random sequences and many similar results demonstrates that some as-
Antipredictable Sequences
2229
sumption of this kind is valid in various practical situations. It is of much practical importance to specify numerically such prior distribution of problems. An example of how this might be done was given in Zhu and Rohwer (1996a) in the context of a regression algorithm competition over the Internet. On the other hand, it casts grave doubts on the priors specified according to the simplicity or complexity of the problems, unless the complexity itself is defined according to the actual frequency of practical problems. If complexity is defined algorithmically relative to a repertoire of standard procedures, such as logical gates, Turing machines, or programming languages, then the example given here demonstrates that the complexity of generating, predicting, or learning to predict a sequence can all be very different and unrelated to each other, and generally unrelated to the frequency of such sequences occurring in practical problems in which the algorithm is to apply. For any given algorithm and at any complexity level, there are at least as many bad problems as good problems. It is not the existence of structures (or lack of them) that makes an algorithm good. Rather, it is the alignment of the structures in the problem with those in the algorithm that makes the particular algorithm good for the particular distribution of problems. Naturally, things without any structure cannot be learned, but the mere existence of strong structures only guarantees that an algorithm either wins or loses heavily in the long run, as compared to random guessing, without favoring either. In fact, the existence of prior knowledge is well accepted, even though the indispensable role of priors to the performance of learning rules is not widely recognized and there is no general agreement on the types of structures that underlie many “general-purpose” prediction algorithms. Our examples show that the implicit priors generally used for problems of the type considered here are based not on complexity but on positive correlation. It remains to be seen to what extent this captures the salient properties of the majority of practical learning problems. Acknowledgments We are grateful to the critical and helpful comments of D. Wolpert, J. Schmidhuber, and the anonymous referees on an earlier version of this article. We thank the Santa Fe Institute for hospitality and support. H. Z. also thanks TXN, Inc. for financial support. References Biehl, M., & Riegler, P. (1994). On-line learning with a perceptron. Europhys. Lett., 28, 525–530.
2230
Huaiyu Zhu and Wolfgang Kinzel
Eisenstein, E., Kanter, I., Kessler, D. A., & Kinzel, W. (1995). Generation and prediction of time series by a neural network. Physical Review Letters, 74(1), 6–9. Hertz, J., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Reading, MA:Addison-Wesley. Kinzel, W., & Reents, G. (1998). Physics by computer. Berlin:Springer-Verlag. Knuth, D. E. (1981). The art of computer programming: Vol. 2: Seminumerical algorithms. (2nd ed.). Reading, MA: Addison-Wesley. Opper, M., & Kinzel, W. (1996). Statistical mechanics of generalization. In E. Domany, J. L. van Hemmen, & K. Schulten, (Eds.), Models of neural networks III (pp. 151–209). Berlin: Springer-Verlag. Schroder, ¨ M., Kinzel, W., & Kanter, I. (1996). Training a perceptron by a bit sequence: Storage capacity. Journal of Physics, A, 29, 7965–7972. Takens, F. (1980). Detecting strange attractors in turbulence. In D. A. Rand, & Z.-S. Young (Eds.), Dynamical systems and turbulence (pp. 366–381). Berlin: Springer-Verlag. Weigend, A. S., & Gershenfeld, N. A. (1994). Time series prediction: Forecasting the future and understanding the past. Reading, MA: Addison-Wesley. Wolpert, D. H. (1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341–1390. Zhu, H., & Rohwer, R. (1996a). Bayesian regression filters and the issue of priors. Neural Comp. Appl., 4(3), 130–142. Zhu, H., & Rohwer, R. (1996b). No free lunch for cross validation. Neural Computation, 8(7), 1421–1426. Received August 4, 1997; accepted April 3, 1998.